On Sat, Jun 16, 2012 at 4:39 PM, Nathaniel Smith <[email protected]> wrote: > On Sat, Jun 16, 2012 at 9:03 PM, Matthew Brett <[email protected]> > wrote: >> Hi, >> >> On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <[email protected]> wrote: >>> On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris >>> <[email protected]> wrote: >>>> >>>> >>>> On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <[email protected]> >>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I noticed that numpy.linalg.matrix_rank sometimes gives full rank for >>>>> matrices that are numerically rank deficient: >>>>> >>>>> If I repeatedly make random matrices, then set the first column to be >>>>> equal to the sum of the second and third columns: >>>>> >>>>> def make_deficient(): >>>>> X = np.random.normal(size=(40, 10)) >>>>> deficient_X = X.copy() >>>>> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] >>>>> return deficient_X >>>>> >>>>> then the current numpy.linalg.matrix_rank algorithm returns full rank >>>>> (10) in about 8 percent of cases (see appended script). >>>>> >>>>> I think this is a tolerance problem. The ``matrix_rank`` algorithm >>>>> does this by default: >>>>> >>>>> S = spl.svd(M, compute_uv=False) >>>>> tol = S.max() * np.finfo(S.dtype).eps >>>>> return np.sum(S > tol) >>>>> >>>>> I guess we'd we want the lowest tolerance that nearly always or always >>>>> identifies numerically rank deficient matrices. I suppose one way of >>>>> looking at whether the tolerance is in the right range is to compare >>>>> the calculated tolerance (``tol``) to the minimum singular value >>>>> (``S.min()``) because S.min() in our case should be very small and >>>>> indicate the rank deficiency. The mean value of tol / S.min() for the >>>>> current algorithm, across many iterations, is about 2.8. We might >>>>> hope this value would be higher than 1, but not much higher, otherwise >>>>> we might be rejecting too many columns. >>>>> >>>>> Our current algorithm for tolerance is the same as the 2-norm of M * >>>>> eps. We're citing Golub and Van Loan for this, but now I look at our >>>>> copy (p 261, last para) - they seem to be suggesting using u * |M| >>>>> where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub >>>>> and Van Loan suggestion corresponds to: >>>>> >>>>> tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2 >>>>> >>>>> This tolerance gives full rank for these rank-deficient matrices in >>>>> about 39 percent of cases (tol / S.min() ratio of 1.7) >>>>> >>>>> We see on p 56 (section 2.3.2) that: >>>>> >>>>> m, n = M.shape >>>>> 1 / sqrt(n) . |M|_{inf} <= |M|_2 >>>>> >>>>> So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). Setting: >>>>> >>>>> tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n) >>>>> >>>>> gives about 0.5 percent error (tol / S.min() of 4.4) >>>>> >>>>> Using the Mathworks threshold [2]: >>>>> >>>>> tol = S.max() * np.finfo(M.dtype).eps * max((m, n)) >>>>> >>>>> There are no false negatives (0 percent rank 10), but tol / S.min() is >>>>> around 110 - so conservative, in this case. >>>>> >>>>> So - summary - I'm worrying our current threshold is too small, >>>>> letting through many rank-deficient matrices without detection. I may >>>>> have misread Golub and Van Loan, but maybe we aren't doing what they >>>>> suggest. Maybe what we could use is either the MATLAB threshold or >>>>> something like: >>>>> >>>>> tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n) >>>>> >>>>> - so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This >>>>> gives 0 percent misses and tol / S.min() of 8.7. >>>>> >>>>> What do y'all think? >>>>> >>>>> Best, >>>>> >>>>> Matthew >>>>> >>>>> [1] >>>>> http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilon >>>>> [2] http://www.mathworks.com/help/techdoc/ref/rank.html >>>>> >>>>> Output from script: >>>>> >>>>> Percent undetected current: 9.8, tol / S.min(): 2.762 >>>>> Percent undetected inf norm: 39.1, tol / S.min(): 1.667 >>>>> Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 >>>>> Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734 >>>>> Percent undetected MATLAB: 0.0, tol / S.min(): 110.477 >>>>> >>>>> <script> >>>>> import numpy as np >>>>> import scipy.linalg as npl >>>>> >>>>> M = 40 >>>>> N = 10 >>>>> >>>>> def make_deficient(): >>>>> X = np.random.normal(size=(M, N)) >>>>> deficient_X = X.copy() >>>>> if M > N: # Make a column deficient >>>>> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] >>>>> else: # Make a row deficient >>>>> deficient_X[0] = deficient_X[1] + deficient_X[2] >>>>> return deficient_X >>>>> >>>>> matrices = [] >>>>> ranks = [] >>>>> ranks_inf = [] >>>>> ranks_ub_inf = [] >>>>> ranks_ub2_inf = [] >>>>> ranks_mlab = [] >>>>> tols = np.zeros((1000, 6)) >>>>> for i in range(1000): >>>>> m = make_deficient() >>>>> matrices.append(m) >>>>> # The SVD tolerances >>>>> S = npl.svd(m, compute_uv=False) >>>>> S0 = S.max() >>>>> # u in Golub and Van Loan == numpy eps / 2 >>>>> eps = np.finfo(m.dtype).eps >>>>> u = eps / 2 >>>>> # Current numpy matrix_rank algorithm >>>>> ranks.append(np.linalg.matrix_rank(m)) >>>>> # Which is the same as: >>>>> tol_s0 = S0 * eps >>>>> # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) >>>>> # Golub and Van Loan suggestion >>>>> tol_inf = npl.norm(m, np.inf) * u >>>>> ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) >>>>> # Upper bound of |X|_{inf} >>>>> tol_ub_inf = tol_s0 * np.sqrt(N) / 2 >>>>> ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) >>>>> # Times 2 fudge >>>>> tol_ub2_inf = tol_s0 * np.sqrt(N) >>>>> ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) >>>>> # MATLAB algorithm >>>>> tol_mlab = tol_s0 * max(m.shape) >>>>> ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) >>>>> # Collect tols >>>>> tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min() >>>>> >>>>> rel_tols = tols / tols[:, -1][:, None] >>>>> >>>>> fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' >>>>> max_rank = min(M, N) >>>>> for name, ranks, mrt in zip( >>>>> ('current', 'inf norm', 'upper bound inf norm', >>>>> 'upper bound inf norm * 2', 'MATLAB'), >>>>> (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), >>>>> rel_tols.mean(axis=0)[:5]): >>>>> pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 >>>>> print fmt % (name, pcnt, mrt) >>>>> </script> >>>> >>>> >>>> The polynomial fitting uses eps times the largest array dimension for the >>>> relative condition number. IIRC, that choice traces back to numerical >>>> recipes. >> >> Chuck - sorry - I didn't understand what you were saying, and now I >> think you were proposing the MATLAB algorithm. I can't find that in >> Numerical Recipes - can you? It would be helpful as a reference. >> >>> This is the same as Matlab, right? >> >> Yes, I believe so, i.e: >> >> tol = S.max() * np.finfo(M.dtype).eps * max((m, n)) >> >> from my original email. >> >>> If the Matlab condition is the most conservative, then it seems like a >>> reasonable choice -- conservative is good so long as your false >>> positive rate doesn't become to high, and presumably Matlab has enough >>> user experience to know whether the false positive rate is too high. >> >> Are we agreeing to go for the Matlab algorithm? >> >> If so, how should this be managed? Just changing it may change the >> output of code using numpy >= 1.5.0, but then again, the threshold is >> probably incorrect. >> >> Fix and break or FutureWarning with something like: >> >> def matrix_rank(M, tol=None): >> >> where ``tol`` can be a string like ``maxdim``? > > I dunno, I don't think we should do a big deprecation dance for every > bug fix. Is this a bug fix, so numpy will simply start producing more > accurate results on a given problem? I guess there isn't really a > right answer here (though claiming that [a, b, a+b] is full-rank is > clearly broken, and the matlab algorithm seems reasonable for > answering the specific question of whether a matrix is full rank), so > we'll have to hope some users speak up...
I don't see a problem changing this as a bugfix. statsmodels still has, I think, the original scipy.stats.models version for rank which is still much higher for any non-huge array and float, cond=1.0e-12. Josef > > -N > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
