MEDCOUPLE computes the medcouple, a robust skewness estimator.
For small n, it is better to compute the the medcouple also on the reflected sample -z and return (medcouple(z) - medcouple(-z))/2; To work on positive values we do this with:
z = chi2rnd(5,25,1); mcm = 1; % use naive method (n is small ...) medoutR = 0.5*(-medcouple(repmat(max(z),size(z,1),1)-z , mcm) + medcouple(z,mcm)); medout = medcouple(z,mcm); disp('Application to a non problematic dataset:'); disp(['result with reflected option = ' num2str(medoutR) ' --- result on z only = ' num2str(medoutR)]);
Application to a non problematic dataset: result with reflected option = -0.1548 --- result on z only = -0.1548
Apply to symmetric and skewed data
rng(5); %orientation = 'horizontal'; orientation = 'vertical'; % normal data with an outlier datain1 = 15+randn(200,1); datain1 = [datain1 ; 35]; % log-normal data with an outlier datain2 = lognrnd(0,1,200,1); datain2 = [datain2 ; 35]; k = 1.5; % limits on normal data + one outlier MC_1 = medcouple(datain1); Q1_1 = quantile(datain1,0.25); Q3_1 = quantile(datain1,0.75); IQR_1 = Q3_1 - Q1_1; DataLimLower_1 = Q1_1 - k * exp(-3.5 * MC_1) * IQR_1; DataLimUpper_1 = Q3_1 + k * exp(4 * MC_1) * IQR_1; % limits on log-normal data + one outlier MC_2 = medcouple(datain2); Q1_2 = quantile(datain2,0.25); Q3_2 = quantile(datain2,0.75); IQR_2 = Q3_2 - Q1_2; DataLimLower_2 = Q1_2 - k * exp(-3.5 * MC_2) * IQR_2; DataLimUpper_2 = Q3_2 + k * exp(4 * MC_2) * IQR_2; % boxplot boxplot([datain1,datain2],'Whisker',k,'Orientation',orientation,... 'OutlierSize',10,'Symbol','*r', ... 'labels',{'Normal(15,1)+outlier','Log-normal(0,1)+outlier'}); title({'Boxplot adjusted for skewness' , 'with medcouple (red limits)'}, 'FontSize' , 20) set(gca,'FontSize',16); switch orientation case 'vertical' % add the limits (for vertical boxplot) xx = xlim(gca); xx2 = xx/1.1; xx2(1) = xx(2)/1.5; xx1 = xx/2; xx1(1) = xx(2)/3.5; line(xx1,[DataLimLower_1 , DataLimLower_1] , 'Color','red'); line(xx1,[DataLimUpper_1 , DataLimUpper_1] , 'Color','red'); line(xx2,[DataLimLower_2 , DataLimLower_2] , 'Color','red'); line(xx2,[DataLimUpper_2 , DataLimUpper_2] , 'Color','red'); case 'horizontal' % add the limits (for horizontal boxplot) yy = ylim(gca); yy2 = yy/1.1; yy2(1) = yy(2)/1.5; yy1 = yy/2; yy1(1) = yy(2)/3.5; line([DataLimLower_1 , DataLimLower_1] , yy1,'Color','red'); line([DataLimUpper_1 , DataLimUpper_1] , yy1,'Color','red'); line([DataLimLower_2 , DataLimLower_2] , yy2,'Color','red'); line([DataLimUpper_2 , DataLimUpper_2] , yy2,'Color','red'); end
% Uses the time execution of the weighted median returned in varargout % Tm=matlab-fast-quickselectFSw; Tw=matlab-fast-whimed % Ts=matlab-simplified; To=matlab-octile; Tq=matlab-quantiles; tm=0; tmmex=0; tw=0; ts=0; to = 0; tq=0; N = 800; % sample size cycles = 1000; % number of repetitions for c = 1:cycles if mod(c,floor(cycles/10))==0, disp(['repetition n. ' num2str(c)]); end % Generate different random sequences (seed 896 is a good one to % compare consistency with the original C code on a difficult sequence) myseed = randi(1000,1,1); rng(myseed , 'twister'); datain = rand(N,1); % REMARK: to generate N random integers (for example between 1 and % 1000) using the Mersenne Twister in both C and MATLAB, note that the % line below is equivalent to randi(1000,N,1): %datain = ceil(datain*1000); % Line below is to test with skewed data from the lognormal %datain = lognrnd(0,1,N,1); % just to ensure that data are stored in a column vector datain = datain(:); % $n\log n$, using quickselectFSw tm0 = tic; [MCm , t_diva] = medcouple(datain,0,0); tm = tm+toc(tm0); % $n\log n$, using quickselectFSwmex tm0mex = tic; [MCmmex , t_diva_mex] = medcouple(datain,0,2); tmmex = tmmex+toc(tm0mex); % $n\log n$, using Haase's weighted median (which is based on MATLAB sortrows) tw0 = tic; [MCw , t_Haase] = medcouple(datain,0,1); tw = tw+toc(tw0); % $n^2$ "naive" algorithm, simplified ts0 = tic; MCs = medcouple(datain,1); ts = ts+toc(ts0); % Quantiles-based approximation to0 = tic; MCo = medcouple(datain,2); to = to+toc(to0); % Octiles-based approximation tq0 = tic; MCq = medcouple(datain,3); tq = tq+toc(tq0); end disp(' '); disp(['OVERALL TIME EXECUTION IN ' num2str(cycles) ' MC COMPUTATION']); disp(['time using WM by Haase = ' num2str(tw)]); disp(['time using quickselectFSw = ' num2str(tm)]); disp(['time using quickselectFSwmex = ' num2str(tmmex)]); disp(['time using the naive = ' num2str(ts)]); disp(['time using quantiles = ' num2str(tq)]); disp(['time using octiles = ' num2str(to)]); disp(' '); disp('TIME SPENT IN COMPUTING WEIGHTED AVERAGES IN A SINGLE MC COMPUTATION') disp(['t_Haase = ' num2str(t_Haase)]); disp(['t_quickselectFSw = ' num2str(t_diva)]); disp(['t_quickselectFSwmex = ' num2str(t_diva_mex)]);
repetition n. 100 repetition n. 200 repetition n. 300 repetition n. 400 repetition n. 500 repetition n. 600 repetition n. 700 repetition n. 800 repetition n. 900 repetition n. 1000 OVERALL TIME EXECUTION IN 1000 MC COMPUTATION time using WM by Haase = 0.89418 time using quickselectFSw = 0.50866 time using quickselectFSwmex = 0.33656 time using the naive = 1.7259 time using quantiles = 0.033919 time using octiles = 0.06674 TIME SPENT IN COMPUTING WEIGHTED AVERAGES IN A SINGLE MC COMPUTATION t_Haase = 0.0005864 t_quickselectFSw = 0.0001909 t_quickselectFSwmex = 6.38e-05
z
— Uni-dimensional array.
The elements of the array z represent a
sample from a univariate distribution.They can be missing (nan) of infinite (inf) values, which are excluded from the computation if present.
Data Types: single| double
mcm
— Scalar.
Medcouple Method.By default (mcm = 0 or not specified) the medcouple is computed using a fast $O(n log n)$ algorithm.
mcm = 1 applies a naive $O^2(n)$ method. With mcm = 2 the medcouple is replaced by a quantile-based approximation.
Similarily, mcm = 3 uses a octile-based approximation.
All methods are discussed in the references reported below. The fast options are the most precise; the simplified (naive) algotithm is also very fast but may be less accurate, as the kernel function of Hubert and Struyf is implemented in a way that does not take into account possible zeros at the denominator (these cases are simply not considered in the calculation). The quantile and octile approximations by definition are robust skewness estimators alternatives to the medcouple.
Example: 'mcm',2
Data Types: single | double
wmm
— Scalar.
Weighted Median Method.By default (wmm=0 or not specified) the weighted median is computed using function quickselectFSw, which csn compute any weighted quantile. If wmm=1 the weighted median is computed using the algorithm by Sven Haase https://it.mathworks.com/matlabcentral/fileexchange/23077-weighted-median Option wmm=2 applies the mex file quickselectFSwmex.
Example: 'wmm',1
Data Types: single | double
varargout
—T : The execution time spent for computing the weighted median.
ScalarThe computation of the weighted median is a critical step in the medcouple and it is also computational demanding.
For assessment purposes it is reported optionally.
The fast algorithm for the medcouple (Brys G., Hubert M. and Struyf A., 2004) is a robust measure of skewness used to adjust the whiskers of a boxplot (Hubert M. and Vandervierenb E., 2008) and avoid wrong declaration of outliers in asymmetric univariate data. The medcouple replaces the classical skewness (that is based on the third moment of a data distribution) with a scaled median difference of the left and right half of the distribution. More precisely, it is defined as the median of the kernel function $h(z_i,z_j)=\frac{(z_j-median(z))-(median(z)-z_i)}{z_j-z_i}$ over all pairs $(z_i,z_j)$ where $z_i<=median(z)$ and $z_j >= median(z)$. The MATLAB toolbox LIBRA (https://wis.kuleuven.be/statdatascience/robust/LIBRA/) and the R package robustbase (https://cran.r-project.org/web/packages/robustbase) implement the fast medcouple with wrapper functions to the same compiled C source, to maximize speed. This medcouple function replicates faithfully the C function mlmc.c in pure MATLAB code, but replaces the key computations of the median and weighted median (which are intensively used) with calls to the FSDA functions quickselectFS and quickselectFSw or, optionally, to the $n\log n$ solution of Sven Haase (2022) weightedMedian. The result is a readable MATLAB function that does not sacrifices performance. The FSDA implementation also optimizes the computation of the medcouple kernel function, to take into account the case where there are multiple values equal to the median, the denominator becomes 0 and the kernel is therefore defined differently: we computed it efficiently in the sub-function calwork, using the signum function. The consistency of the MATLAB and original C implementations have been carefully verified with simulations that ensure the same random numbers generation in the two environments. The code that has been used can be found in the FSDA github project.
Brys G., Hubert M. and Struyf A. (2004), A Robust Measure of Skewness, "Journal of Computational and Graphical Statistics", Vol. 13(4), pp. 996-1017.
Hubert M. and Vandervierenb E. (2008), An adjusted boxplot for skewed distributions, "Computational Statistics and Data Analysis", Vol. 52, pp. 5186-5201.
Johnson D.B. and Mizoguchi T. (1978), Selecting the Kth Element in X + Y and X1 + X2 + ... + Xm, "SIAM Journal of Computing", Vol. 7, pp. 147-153.
Hinkley D.V. (1975), On Power Transformations to Symmetry, "Biometrika", Vol. 62, pp.101-111.