On Mon, Jun 27, 2011 at 2:24 PM, eat <e.antero.ta...@gmail.com> wrote: > > > On Mon, Jun 27, 2011 at 8:53 PM, Mark Wiebe <mwwi...@gmail.com> wrote: >> >> On Mon, Jun 27, 2011 at 12:44 PM, eat <e.antero.ta...@gmail.com> wrote: >>> >>> Hi, >>> >>> On Mon, Jun 27, 2011 at 6:55 PM, Mark Wiebe <mwwi...@gmail.com> wrote: >>>> >>>> First I'd like to thank everyone for all the feedback you're providing, >>>> clearly this is an important topic to many people, and the discussion has >>>> helped clarify the ideas for me. I've renamed and updated the NEP, then >>>> placed it into the master NumPy repository so it has a more permanent home >>>> here: >>>> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst >>>> In the NEP, I've tried to address everything that was raised in the >>>> original thread and in Nathaniel's followup 'Concepts' thread. To deal with >>>> the issue of whether a mask is True or False for a missing value, I've >>>> removed the 'mask' attribute entirely, except for ufunc-like functions >>>> np.ismissing and np.isavail which return the two styles of masks. Here's a >>>> high level summary of how I'm thinking of the topic, and what I will >>>> implement: >>>> Missing Data Abstraction >>>> There appear to be two useful ways to think about missing data that are >>>> worth supporting. >>>> 1) Unknown yet existing data >>>> 2) Data that doesn't exist >>>> In 1), an NA value causes outputs to become NA except in a small number >>>> of exceptions such as boolean logic, and in 2), operations treat the data >>>> as >>>> if there were a smaller array without the NA values. >>>> Temporarily Ignoring Data >>>> In some cases, it is useful to flag data as NA temporarily, possibly in >>>> several different ways, for particular calculations or testing out >>>> different >>>> ways of throwing away outliers. This is independent of the missing data >>>> abstraction, still requiring a choice of 1) or 2) above. >>>> Implementation Techniques >>>> There are two mechanisms generally used to implement missing data >>>> abstractions, >>>> 1) An NA bit pattern >>>> 2) A mask >>>> I've described a design in the NEP which can include both techniques >>>> using the same interface. The mask approach is strictly more general than >>>> the NA bit pattern approach, except for a few things like the idea of >>>> supporting the dtype 'NA[f8,InfNan]' which you can read about in the NEP. >>>> My intention is to implement the mask-based design, and possibly also >>>> implement the NA bit pattern design, but if anything gets cut it will be >>>> the >>>> NA bit patterns. >>>> Thanks again for all your input so far, and thanks in advance for your >>>> suggestions for improving this new revision of the NEP. >>> >>> A very impressive PEP indeed. > > Hi, >>> >>> However, how would corner cases, like >>> >>> >>> a = np.array([np.NA, np.NA], dtype='f8', masked=True) >>> >>> np.mean(a, skipna=True) >> >> This should be equivalent to removing all the NA values, then calling >> mean, like this: >> >>> b = np.array([], dtype='f8') >> >>> np.mean(b) >> >> /home/mwiebe/virtualenvs/dev/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2374: >> RuntimeWarning: invalid value encountered in double_scalars >> return mean(axis, dtype, out) >> nan >>> >>> >>> np.mean(a) >> >> This would return NA, since NA values are sitting in positions that would >> affect the output result. > > OK. >> >> >>> >>> be handled? >>> My concern here is that there always seems to be such corner cases which >>> can only be handled with specific context knowledge. Thus producing 100% >>> generic code to handle 'missing data' is not doable. >> >> Working out the corner cases for the functions that are already in numpy >> seems tractable to me, how to or whether to support missing data is >> something the author of each new function will have to consider when missing >> data support is in NumPy, but I don't think we can do more than provide the >> mechanisms for people to use. > > Sure. I'll ride up with this and wait when I'll have some tangible to > outperform the 'traditional' NaN handling. > - eat
Just a question how things would work with the new model. How can you implement the "use" keyword from R's cov (or cor), with minimal data copying I think the basic masked array version would (or does) just assign 0 to the missing values calculate the covariance or correlation and then correct with the correct count. ------------ cov(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman")) cor(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman")) cov2cor(V) Arguments x a numeric vector, matrix or data frame. y NULL (default) or a vector, matrix or data frame with compatible dimensions to x. The default is equivalent to y = x (but more efficient). na.rm logical. Should missing values be removed? use an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs". ------------ especially I'm interested in the complete.obs (drop any rows that contains a NA) case Josef >> >> -Mark >> >>> >>> Thanks, >>> - eat >>>> >>>> -Mark >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion@scipy.org >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>> >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion