On Thu, Jun 23, 2011 at 5:51 PM, <[email protected]> wrote: > On Thu, Jun 23, 2011 at 5:37 PM, Mark Wiebe <[email protected]> wrote: > > On Thu, Jun 23, 2011 at 4:19 PM, Nathaniel Smith <[email protected]> wrote: > >> > >> I'd like to see a statement of what the "missing data problem" is, and > >> how this solves it? Because I don't think this is entirely intuitive, > >> or that everyone necessarily has the same idea. > > > > I agree it represents different problems in different contexts. For > NumPy, I > > think the mechanism for dealing with it needs to be intuitive to work > with > > in a maximum number of contexts, avoiding surprises. Getting feedback > from a > > broad range of people is the only way a general solution can be designed > > with any level of confidence. > >> > >> > Reduction operations like 'sum', 'prod', 'min', and 'max' will operate > >> > as if the values weren't there > >> > >> For context: My experience with missing data is in statistical > >> analysis; I find R's NA support to be pretty awesome for those > >> purposes. The conceptual model it's based on is that an NA value is > >> some number that we just happen not to know. So from this perspective, > >> I find it pretty confusing that adding an unknown quantity to 3 should > >> result in 3, rather than another unknown quantity. (Obviously it > >> should be possible to compute the sum of the known values, but IME > >> it's important for the default behavior to be to fail loudly when > >> things are wonky, not to silently patch them up, possibly > >> incorrectly!) > > > > The conceptual model you describe sounds reasonable to me, and I > definitely > > like the idea of consistently following one such model for all default > > behaviors. > > > >> > >> Also, what should 'dot' do with missing values? > > > > A matrix multiplication is defined in terms of sums of products, so it > can > > be implemented to behave consistently with your conceptual model. > > >From the perspective of statistical analysis, I don't see much > advantage of this. > What to do with nans depends on the analysis, and needs to be looked > at for each case. > > Only easy descriptive statistics work without problems, nansum, .... > > All the other usages require rewriting the algorithm, see scipy.stats > versus scipy.mstats. In R often the nan handling is remove all > observations (rows) with at least one nan, or we go to some fancier > imputation of missing values algorithms. > > What happens if I just want to go back and forth between using Lapack > and minpack, none of them suddenly grow missing values handling, and > if they would it might not be what we want. > > arrays with nans are nice for data handling, but I don't see why we > should pay for any overhead for number crunching with numpy arrays. > > I didn't get the impression that there would be noticeable overhead. On the other points, I think the idea should be to provide a low level mechanism that it flexible enough to allow implementation of various use cases at a higher level. For instance, current masked arrays could be reimplemented if desired, etc. Not that I think that should be done...
<snip> Chuck
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
