Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

Christopher Jordan-Squire Wed, 06 Jul 2011 16:14:37 -0700

On Wed, Jul 6, 2011 at 3:47 PM, <[email protected]> wrote:

> On Wed, Jul 6, 2011 at 4:38 PM,  <[email protected]> wrote:
> > On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire
> > <[email protected]> wrote:
> >>
> >>
> >> On Wed, Jul 6, 2011 at 1:08 PM, <[email protected]> wrote:
> >>>
> >>> On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire
> >>> <[email protected]> wrote:
> >>> >
> >>> >
> >>> > On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker
> >>> > <[email protected]>
> >>> > wrote:
> >>> >>
> >>> >> Christopher Jordan-Squire wrote:
> >>> >> > If we follow those rules for IGNORE for all computations, we
> >>> >> > sometimes
> >>> >> > get some weird output. For example:
> >>> >> > [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
> >>> >> > multiply and not * with broadcasting.) Or should that sort of
> >>> >> > operation
> >>> >> > through an error?
> >>> >>
> >>> >> That should throw an error -- matrix computation is heavily
> influenced
> >>> >> by the shape and size of matrices, so I think IGNORES really don't
> make
> >>> >> sense there.
> >>> >>
> >>> >>
> >>> >
> >>> > If the IGNORES don't make sense in basic numpy computations then I'm
> >>> > kinda
> >>> > confused why they'd be included at the numpy core level.
> >>> >
> >>> >>
> >>> >> Nathaniel Smith wrote:
> >>> >> > It's exactly this transparency that worries Matthew and me -- we
> feel
> >>> >> > that the alterNEP preserves it, and the NEP attempts to erase it.
> In
> >>> >> > the NEP, there are two totally different underlying data
> structures,
> >>> >> > but this difference is blurred at the Python level. The idea is
> that
> >>> >> > you shouldn't have to think about which you have, but if you work
> >>> >> > with
> >>> >> > C/Fortran, then of course you do have to be constantly aware of
> the
> >>> >> > underlying implementation anyway.
> >>> >>
> >>> >> I don't think this bothers me -- I think it's analogous to things in
> >>> >> numpy like Fortran order and non-contiguous arrays -- you can ignore
> >>> >> all
> >>> >> that when working in pure python when performance isn't critical,
> but
> >>> >> you need a deeper understanding if you want to work with the data in
> C
> >>> >> or Fortran or to tune performance in python.
> >>> >>
> >>> >> So as long as there is an API to query and control how things work,
> I
> >>> >> like that it's hidden from simple python code.
> >>> >>
> >>> >> -Chris
> >>> >>
> >>> >>
> >>> >
> >>> > I'm similarly not too concerned about it. Performance seems finicky
> when
> >>> > you're dealing with missing data, since a lot of arrays will likely
> have
> >>> > to
> >>> > be copied over to other arrays containing only complete data before
> >>> > being
> >>> > handed over to BLAS.
> >>>
> >>> Unless you know the neutral value for the computation or you just want
> >>> to do a forward_fill in time series, and you have to ask the user not
> >>> to give you an unmutable array with NAs if they don't want extra
> >>> copies.
> >>>
> >>> Josef
> >>>
> >>
> >> Mean value replacement, or more generally single scalar value
> replacement,
> >> is generally not a good idea. It biases downward your standard error
> >> estimates if you use mean replacement, and it will bias both if you use
> >> anything other than mean replacement. The bias is gets worse with more
> >> missing data. So it's worst in the precisely the cases where you'd want
> to
> >> fill in the data the most. (Though I admit I'm not too familiar with
> time
> >> series, so maybe this doesn't apply. But it's true as a general
> principle in
> >> statistics.) I'm not sure why we'd want to make this use case easier.
>
> Another qualification on this (I cannot help it).
> I think this only applies if you use a prefabricated no-missing-values
> algorithm. If I write it myself, I can do the proper correction for
> the reduced number of observations. (similar to the case when we
> ignore correlated information and use statistics based on uncorrelated
> observations which also overestimate the amount of information we have
> available.)
>
>
Can you do that sort of technique with longitudinal (panel) data? I'm
honestly curious because I haven't looked into such corrections before. I
haven't been able to find a reference after a few quick google searches. I
don't suppose you know one off the top of your head?


And you're right about the last measurement carried forward. I was just
thinking about filling in all missing values with the same value.

-Chris Jordan-Squire

PS--Thanks for mentioning the statsmodels discussion. I'd been keeping track
of that on a different email account, and I haven't realized it wasn't
forwarding those messages correctly.




> Josef
>
>
>
> > We just discussed a use case for pandas on the statsmodels mailing
> > list, minute data of stock quotes (prices), if the quote is NA then
> > fill it with the last price quote. If it would be necessary for memory
> > usage and performance, this can be handled efficiently and with
> > minimal copying.
> >
> > If you want to fill in a missing value without messing up any result
> > statistics, then there is a large literature in statistics on
> > imputations, repeatedly assigning values to a NA from an underlying
> > distribution. scipy/statsmodels doesn't have anything like this (yet)
> > but R and the others have it available, and it looks more popular in
> > bio-statistics.
> >
> > (But similar to what Dag said, for statistical analysis it will be
> > necessary to keep case specific masks and data arrays around. I
> > haven't actually written any missing values algorithm yet, so I'm
> > quite again.)
> >
> > Josef
> >
> >> -Chris Jordan-Squire
> >>
> >>>
> >>> > My primary concern is that the np.NA stuff 'just
> >>> > works'. Especially since I've never run into use cases in statistics
> >>> > where
> >>> > the difference between IGNORE and NA mattered.
> >>> >
> >>> >
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Christopher Barker, Ph.D.
> >>> >> Oceanographer
> >>> >>
> >>> >> Emergency Response Division
> >>> >> NOAA/NOS/OR&R            (206) 526-6959   voice
> >>> >> 7600 Sand Point Way NE   (206) 526-6329   fax
> >>> >> Seattle, WA  98115       (206) 526-6317   main reception
> >>> >>
> >>> >> [email protected]
> >>> >> _______________________________________________
> >>> >> NumPy-Discussion mailing list
> >>> >> [email protected]
> >>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >>> >
> >>> >
> >>> > _______________________________________________
> >>> > NumPy-Discussion mailing list
> >>> > [email protected]
> >>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >>> >
> >>> >
> >>> _______________________________________________
> >>> NumPy-Discussion mailing list
> >>> [email protected]
> >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >>
> >>
> >> _______________________________________________
> >> NumPy-Discussion mailing list
> >> [email protected]
> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >>
> >>
> >
> _______________________________________________
> NumPy-Discussion mailing list
> [email protected]
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

Reply via email to