On Wed, Jul 6, 2011 at 3:47 PM, <[email protected]> wrote: > On Wed, Jul 6, 2011 at 4:38 PM, <[email protected]> wrote: > > On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire > > <[email protected]> wrote: > >> > >> > >> On Wed, Jul 6, 2011 at 1:08 PM, <[email protected]> wrote: > >>> > >>> On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire > >>> <[email protected]> wrote: > >>> > > >>> > > >>> > On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker > >>> > <[email protected]> > >>> > wrote: > >>> >> > >>> >> Christopher Jordan-Squire wrote: > >>> >> > If we follow those rules for IGNORE for all computations, we > >>> >> > sometimes > >>> >> > get some weird output. For example: > >>> >> > [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix > >>> >> > multiply and not * with broadcasting.) Or should that sort of > >>> >> > operation > >>> >> > through an error? > >>> >> > >>> >> That should throw an error -- matrix computation is heavily > influenced > >>> >> by the shape and size of matrices, so I think IGNORES really don't > make > >>> >> sense there. > >>> >> > >>> >> > >>> > > >>> > If the IGNORES don't make sense in basic numpy computations then I'm > >>> > kinda > >>> > confused why they'd be included at the numpy core level. > >>> > > >>> >> > >>> >> Nathaniel Smith wrote: > >>> >> > It's exactly this transparency that worries Matthew and me -- we > feel > >>> >> > that the alterNEP preserves it, and the NEP attempts to erase it. > In > >>> >> > the NEP, there are two totally different underlying data > structures, > >>> >> > but this difference is blurred at the Python level. The idea is > that > >>> >> > you shouldn't have to think about which you have, but if you work > >>> >> > with > >>> >> > C/Fortran, then of course you do have to be constantly aware of > the > >>> >> > underlying implementation anyway. > >>> >> > >>> >> I don't think this bothers me -- I think it's analogous to things in > >>> >> numpy like Fortran order and non-contiguous arrays -- you can ignore > >>> >> all > >>> >> that when working in pure python when performance isn't critical, > but > >>> >> you need a deeper understanding if you want to work with the data in > C > >>> >> or Fortran or to tune performance in python. > >>> >> > >>> >> So as long as there is an API to query and control how things work, > I > >>> >> like that it's hidden from simple python code. > >>> >> > >>> >> -Chris > >>> >> > >>> >> > >>> > > >>> > I'm similarly not too concerned about it. Performance seems finicky > when > >>> > you're dealing with missing data, since a lot of arrays will likely > have > >>> > to > >>> > be copied over to other arrays containing only complete data before > >>> > being > >>> > handed over to BLAS. > >>> > >>> Unless you know the neutral value for the computation or you just want > >>> to do a forward_fill in time series, and you have to ask the user not > >>> to give you an unmutable array with NAs if they don't want extra > >>> copies. > >>> > >>> Josef > >>> > >> > >> Mean value replacement, or more generally single scalar value > replacement, > >> is generally not a good idea. It biases downward your standard error > >> estimates if you use mean replacement, and it will bias both if you use > >> anything other than mean replacement. The bias is gets worse with more > >> missing data. So it's worst in the precisely the cases where you'd want > to > >> fill in the data the most. (Though I admit I'm not too familiar with > time > >> series, so maybe this doesn't apply. But it's true as a general > principle in > >> statistics.) I'm not sure why we'd want to make this use case easier. > > Another qualification on this (I cannot help it). > I think this only applies if you use a prefabricated no-missing-values > algorithm. If I write it myself, I can do the proper correction for > the reduced number of observations. (similar to the case when we > ignore correlated information and use statistics based on uncorrelated > observations which also overestimate the amount of information we have > available.) > > Can you do that sort of technique with longitudinal (panel) data? I'm honestly curious because I haven't looked into such corrections before. I haven't been able to find a reference after a few quick google searches. I don't suppose you know one off the top of your head?
And you're right about the last measurement carried forward. I was just thinking about filling in all missing values with the same value. -Chris Jordan-Squire PS--Thanks for mentioning the statsmodels discussion. I'd been keeping track of that on a different email account, and I haven't realized it wasn't forwarding those messages correctly. > Josef > > > > > We just discussed a use case for pandas on the statsmodels mailing > > list, minute data of stock quotes (prices), if the quote is NA then > > fill it with the last price quote. If it would be necessary for memory > > usage and performance, this can be handled efficiently and with > > minimal copying. > > > > If you want to fill in a missing value without messing up any result > > statistics, then there is a large literature in statistics on > > imputations, repeatedly assigning values to a NA from an underlying > > distribution. scipy/statsmodels doesn't have anything like this (yet) > > but R and the others have it available, and it looks more popular in > > bio-statistics. > > > > (But similar to what Dag said, for statistical analysis it will be > > necessary to keep case specific masks and data arrays around. I > > haven't actually written any missing values algorithm yet, so I'm > > quite again.) > > > > Josef > > > >> -Chris Jordan-Squire > >> > >>> > >>> > My primary concern is that the np.NA stuff 'just > >>> > works'. Especially since I've never run into use cases in statistics > >>> > where > >>> > the difference between IGNORE and NA mattered. > >>> > > >>> > > >>> >> > >>> >> > >>> >> -- > >>> >> Christopher Barker, Ph.D. > >>> >> Oceanographer > >>> >> > >>> >> Emergency Response Division > >>> >> NOAA/NOS/OR&R (206) 526-6959 voice > >>> >> 7600 Sand Point Way NE (206) 526-6329 fax > >>> >> Seattle, WA 98115 (206) 526-6317 main reception > >>> >> > >>> >> [email protected] > >>> >> _______________________________________________ > >>> >> NumPy-Discussion mailing list > >>> >> [email protected] > >>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >>> > > >>> > > >>> > _______________________________________________ > >>> > NumPy-Discussion mailing list > >>> > [email protected] > >>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >>> > > >>> > > >>> _______________________________________________ > >>> NumPy-Discussion mailing list > >>> [email protected] > >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> [email protected] > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > >> > > > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
