On Mon, Jun 27, 2011 at 5:01 PM, Mark Wiebe <mwwi...@gmail.com> wrote: > On Mon, Jun 27, 2011 at 2:59 PM, <josef.p...@gmail.com> wrote: >> >> On Mon, Jun 27, 2011 at 2:24 PM, eat <e.antero.ta...@gmail.com> wrote: >> > >> > >> > On Mon, Jun 27, 2011 at 8:53 PM, Mark Wiebe <mwwi...@gmail.com> wrote: >> >> >> >> On Mon, Jun 27, 2011 at 12:44 PM, eat <e.antero.ta...@gmail.com> wrote: >> >>> >> >>> Hi, >> >>> >> >>> On Mon, Jun 27, 2011 at 6:55 PM, Mark Wiebe <mwwi...@gmail.com> wrote: >> >>>> >> >>>> First I'd like to thank everyone for all the feedback you're >> >>>> providing, >> >>>> clearly this is an important topic to many people, and the discussion >> >>>> has >> >>>> helped clarify the ideas for me. I've renamed and updated the NEP, >> >>>> then >> >>>> placed it into the master NumPy repository so it has a more permanent >> >>>> home >> >>>> here: >> >>>> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst >> >>>> In the NEP, I've tried to address everything that was raised in the >> >>>> original thread and in Nathaniel's followup 'Concepts' thread. To >> >>>> deal with >> >>>> the issue of whether a mask is True or False for a missing value, >> >>>> I've >> >>>> removed the 'mask' attribute entirely, except for ufunc-like >> >>>> functions >> >>>> np.ismissing and np.isavail which return the two styles of masks. >> >>>> Here's a >> >>>> high level summary of how I'm thinking of the topic, and what I will >> >>>> implement: >> >>>> Missing Data Abstraction >> >>>> There appear to be two useful ways to think about missing data that >> >>>> are >> >>>> worth supporting. >> >>>> 1) Unknown yet existing data >> >>>> 2) Data that doesn't exist >> >>>> In 1), an NA value causes outputs to become NA except in a small >> >>>> number >> >>>> of exceptions such as boolean logic, and in 2), operations treat the >> >>>> data as >> >>>> if there were a smaller array without the NA values. >> >>>> Temporarily Ignoring Data >> >>>> In some cases, it is useful to flag data as NA temporarily, possibly >> >>>> in >> >>>> several different ways, for particular calculations or testing out >> >>>> different >> >>>> ways of throwing away outliers. This is independent of the missing >> >>>> data >> >>>> abstraction, still requiring a choice of 1) or 2) above. >> >>>> Implementation Techniques >> >>>> There are two mechanisms generally used to implement missing data >> >>>> abstractions, >> >>>> 1) An NA bit pattern >> >>>> 2) A mask >> >>>> I've described a design in the NEP which can include both techniques >> >>>> using the same interface. The mask approach is strictly more general >> >>>> than >> >>>> the NA bit pattern approach, except for a few things like the idea of >> >>>> supporting the dtype 'NA[f8,InfNan]' which you can read about in the >> >>>> NEP. >> >>>> My intention is to implement the mask-based design, and possibly also >> >>>> implement the NA bit pattern design, but if anything gets cut it will >> >>>> be the >> >>>> NA bit patterns. >> >>>> Thanks again for all your input so far, and thanks in advance for >> >>>> your >> >>>> suggestions for improving this new revision of the NEP. >> >>> >> >>> A very impressive PEP indeed. >> > >> > Hi, >> >>> >> >>> However, how would corner cases, like >> >>> >> >>> >>> a = np.array([np.NA, np.NA], dtype='f8', masked=True) >> >>> >>> np.mean(a, skipna=True) >> >> >> >> This should be equivalent to removing all the NA values, then calling >> >> mean, like this: >> >> >>> b = np.array([], dtype='f8') >> >> >>> np.mean(b) >> >> >> >> >> >> /home/mwiebe/virtualenvs/dev/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2374: >> >> RuntimeWarning: invalid value encountered in double_scalars >> >> return mean(axis, dtype, out) >> >> nan >> >>> >> >>> >>> np.mean(a) >> >> >> >> This would return NA, since NA values are sitting in positions that >> >> would >> >> affect the output result. >> > >> > OK. >> >> >> >> >> >>> >> >>> be handled? >> >>> My concern here is that there always seems to be such corner cases >> >>> which >> >>> can only be handled with specific context knowledge. Thus producing >> >>> 100% >> >>> generic code to handle 'missing data' is not doable. >> >> >> >> Working out the corner cases for the functions that are already in >> >> numpy >> >> seems tractable to me, how to or whether to support missing data is >> >> something the author of each new function will have to consider when >> >> missing >> >> data support is in NumPy, but I don't think we can do more than provide >> >> the >> >> mechanisms for people to use. >> > >> > Sure. I'll ride up with this and wait when I'll have some tangible to >> > outperform the 'traditional' NaN handling. >> > - eat >> >> Just a question how things would work with the new model. >> How can you implement the "use" keyword from R's cov (or cor), with >> minimal data copying >> >> I think the basic masked array version would (or does) just assign 0 >> to the missing values calculate the covariance or correlation and then >> correct with the correct count. >> >> ------------ >> cov(x, y = NULL, use = "everything", >> method = c("pearson", "kendall", "spearman")) >> >> cor(x, y = NULL, use = "everything", >> method = c("pearson", "kendall", "spearman")) >> >> cov2cor(V) >> >> Arguments >> x a numeric vector, matrix or data frame. >> y NULL (default) or a vector, matrix or data frame with compatible >> dimensions to x. The default is equivalent to y = x (but more >> efficient). >> na.rm logical. Should missing values be removed? >> >> use an optional character string giving a method for computing >> covariances in the presence of missing values. This must be (an >> abbreviation of) one of the strings "everything", "all.obs", >> "complete.obs", "na.or.complete", or "pairwise.complete.obs". >> ------------ >> >> especially I'm interested in the complete.obs (drop any rows that >> contains a NA) case > > I think this is mainly a matter of extending NumPy's equivalent cov function > with a parameter like this. Implemented in C, I'm sure it could be done with > minimal copying, I'm not exactly sure how it will have to look implemented > in Python. Perhaps someone could try it once I have a basic prototype ready > for testing.
This is just a typical example, going to C doesn't help, whoever is rewriting scipy.stats.mstats or is writing similar statistical code will need to do this all the time. Josef > -Mark > >> >> Josef >> >> >> >> >> -Mark >> >> >> >>> >> >>> Thanks, >> >>> - eat >> >>>> >> >>>> -Mark >> >>>> _______________________________________________ >> >>>> NumPy-Discussion mailing list >> >>>> NumPy-Discussion@scipy.org >> >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >>>> >> >>> >> >>> >> >>> _______________________________________________ >> >>> NumPy-Discussion mailing list >> >>> NumPy-Discussion@scipy.org >> >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >>> >> >> >> >> >> >> _______________________________________________ >> >> NumPy-Discussion mailing list >> >> NumPy-Discussion@scipy.org >> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> >> > >> > >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion@scipy.org >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > >> > >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion