On Thu, Sep 26, 2013 at 6:51 AM, <[email protected]> wrote: > On Thu, Sep 26, 2013 at 4:21 AM, Nathaniel Smith <[email protected]> wrote: >> If you want a proper self-consistent correlation/covariance matrix, then >> pairwise deletion just makes no sense period, I don't see how postprocessing >> can help. > > clipping to [-1, 1] and finding the nearest positive semi-definite matrix. > For the latter there is some code in statsmodels, and several newer > algorithms that I haven't looked at. > > It's a quite common problem in finance, but usually longer time series > with not a large number of missing values. > >> >> If you want a matrix of correlations, then pairwise deletion makes sense. >> It's an interesting point that arguably the current ma.corrcoef code may >> actually give you a better estimator of the individual correlation >> coefficients than doing full pairwise deletion, but it's pretty surprising >> and unexpected... when people call corrcoef I think they are asking "please >> compute the textbook formula for 'sample correlation'" not "please give me >> some arbitrary good estimator for the population correlation", so we >> probably have to change it. >> >> (Hopefully no-one has published anything based on the current code.) > > I haven't seen a textbook version of this yet. > > Calculating every mean (n + 1) * n / 2 times sounds a bit excessive, > especially if it doesn't really solve the problem.
unless you also calculate each standard deviation (n + 1) * n / 2 times. But then you loose the relationship between cov and corrcoeff. Josef > > Josef > >> >> -n >> >> On 26 Sep 2013 04:19, <[email protected]> wrote: >>> >>> On Wed, Sep 25, 2013 at 11:05 PM, <[email protected]> wrote: >>> > On Wed, Sep 25, 2013 at 8:26 PM, Faraz Mirzaei <[email protected]> >>> > wrote: >>> >> Hi everyone, >>> >> >>> >> I'm using np.ma.corrcoef to compute the correlation coefficients among >>> >> rows >>> >> of a masked matrix, where the masked elements are the missing data. >>> >> I've >>> >> observed that in some cases, the np.ma.corrcoef gives invalid >>> >> coefficients >>> >> that are greater than 1 or less than -1. >>> >> >>> >> Here's an example: >>> >> >>> >> x = array([[ 7, -4, -1, -7, -3, -2], >>> >> [ 6, -3, 0, 4, 0, 5], >>> >> [-4, -5, 7, 5, -7, -7], >>> >> [-5, 5, -8, 0, 1, 4]]) >>> >> >>> >> x_ma = np.ma.masked_less_equal(x , -5) >>> >> >>> >> C = np.round(np.ma.corrcoef(x_ma), 2) >>> >> >>> >> print C >>> >> >>> >> [[1.0 0.73 -- -1.68] >>> >> [0.73 1.0 -0.86 -0.38] >>> >> [-- -0.86 1.0 --] >>> >> [-1.68 -0.38 -- 1.0]] >>> >> >>> >> As you can see, the [0,3] element is -1.68 which is not a valid >>> >> correlation >>> >> coefficient. (Valid correlation coefficients should be between -1 and >>> >> 1). >>> >> >>> >> I looked at the code for np.ma.corrcoef, and this behavior seems to be >>> >> due >>> >> to the way that mean values of the rows of the input matrix are >>> >> computed and >>> >> subtracted from them. Apparently, the mean value is individually >>> >> computed >>> >> for each row, without masking the elements corresponding to the masked >>> >> elements of the other row of the matrix, with respect to which the >>> >> correlation coefficient is being computed. >>> >> >>> >> I guess the right way should be to recompute the mean value for each >>> >> row >>> >> every time that a correlation coefficient is being computed for two >>> >> rows >>> >> after propagating pair-wise masked values. >>> >> >>> >> Please let me know what you think. >>> > >>> > just general comments, I have no experience here >>> > >>> > From what you are saying it sounds like np.ma is not doing pairwise >>> > deletion in calculating the mean (which only requires ignoring >>> > missings in one array), however it does (correctly) do pairwise >>> > deletion in calculating the cross product. >>> >>> Actually, I think the calculation of the mean is not relevant for >>> having weird correlation coefficients without clipping. >>> >>> With pairwise deletion you use different samples, subsets of the data, >>> for the variances and the covariances. >>> It should be easy (?) to construct examples where the pairwise >>> deletion for the covariance produces a large positive or negative >>> number, and both variances and standard deviations are small, using >>> two different subsamples. >>> Once you calculate the correlation coefficient, it could be all over >>> the place, independent of the mean calculations. >>> >>> conclusion: pairwise deletion requires post-processing if you want a >>> proper correlation matrix. >>> >>> Josef >>> >>> > >>> > covariance or correlation matrices with pairwise deletion are not >>> > necessarily "proper" covariance or correlation matrices. >>> > I've read that they don't need to be positive semi-definite, but I've >>> > never heard of values outside of [-1, 1]. It might only be a problem >>> > if you have a large fraction of missing values.. >>> > >>> > I think the current behavior in np.ma makes sense in that it uses all >>> > the information available in estimating the mean, which should be more >>> > accurate if we use more information. But it makes cov and corrcoef >>> > even weirder than they already are with pairwise deletion. >>> > >>> > Row-wise deletion (deleting observations that have at least one >>> > missing), which would create "proper" correlation matrices, wouldn't >>> > produce much in your example. >>> > >>> > I would check what R or other packages are doing and follow their >>> > lead, or add another option. >>> > (similar: we had a case in statsmodels where I used initially all >>> > information for calculating the mean, but then we dropped some >>> > observations to match the behavior of Stata, and to use the same >>> > observations for calculating the mean and the follow up statistics.) >>> > >>> > >>> > looks like pandas might be truncating the correlations to [-1, 1] (I >>> > didn't check) >>> > >>> >>>> import pandas as pd >>> >>>> x_pd = pd.DataFrame(x_ma.T) >>> >>>> x_pd.corr() >>> > 0 1 2 3 >>> > 0 1.000000 0.734367 -1.000000 -0.240192 >>> > 1 0.734367 1.000000 -0.856565 -0.378777 >>> > 2 -1.000000 -0.856565 1.000000 NaN >>> > 3 -0.240192 -0.378777 NaN 1.000000 >>> > >>> >>>> np.round(np.ma.corrcoef(x_ma), 6) >>> > masked_array(data = >>> > [[1.0 0.734367 -1.190909 -1.681346] >>> > [0.734367 1.0 -0.856565 -0.378777] >>> > [-1.190909 -0.856565 1.0 --] >>> > [-1.681346 -0.378777 -- 1.0]], >>> > mask = >>> > [[False False False False] >>> > [False False False False] >>> > [False False False True] >>> > [False False True False]], >>> > fill_value = 1e+20) >>> > >>> > >>> > Josef >>> > >>> > >>> >> >>> >> Thanks, >>> >> >>> >> Faraz >>> >> >>> >> >>> >> >>> >> _______________________________________________ >>> >> NumPy-Discussion mailing list >>> >> [email protected] >>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> [email protected] >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> [email protected] >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
