Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

josef . pktd Wed, 06 Jul 2011 17:47:31 -0700

On Wed, Jul 6, 2011 at 7:14 PM, Christopher Jordan-Squire
<[email protected]> wrote:
>
>
> On Wed, Jul 6, 2011 at 3:47 PM, <[email protected]> wrote:
>>
>> On Wed, Jul 6, 2011 at 4:38 PM,  <[email protected]> wrote:
>> > On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire
>> > <[email protected]> wrote:
>> >>
>> >>
>> >> On Wed, Jul 6, 2011 at 1:08 PM, <[email protected]> wrote:
>> >>>
>> >>> On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire
>> >>> <[email protected]> wrote:
>> >>> >
>> >>> >
>> >>> > On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker
>> >>> > <[email protected]>
>> >>> > wrote:
>> >>> >>
>> >>> >> Christopher Jordan-Squire wrote:
>> >>> >> > If we follow those rules for IGNORE for all computations, we
>> >>> >> > sometimes
>> >>> >> > get some weird output. For example:
>> >>> >> > [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is
>> >>> >> > matrix
>> >>> >> > multiply and not * with broadcasting.) Or should that sort of
>> >>> >> > operation
>> >>> >> > through an error?
>> >>> >>
>> >>> >> That should throw an error -- matrix computation is heavily
>> >>> >> influenced
>> >>> >> by the shape and size of matrices, so I think IGNORES really don't
>> >>> >> make
>> >>> >> sense there.
>> >>> >>
>> >>> >>
>> >>> >
>> >>> > If the IGNORES don't make sense in basic numpy computations then I'm
>> >>> > kinda
>> >>> > confused why they'd be included at the numpy core level.
>> >>> >
>> >>> >>
>> >>> >> Nathaniel Smith wrote:
>> >>> >> > It's exactly this transparency that worries Matthew and me -- we
>> >>> >> > feel
>> >>> >> > that the alterNEP preserves it, and the NEP attempts to erase it.
>> >>> >> > In
>> >>> >> > the NEP, there are two totally different underlying data
>> >>> >> > structures,
>> >>> >> > but this difference is blurred at the Python level. The idea is
>> >>> >> > that
>> >>> >> > you shouldn't have to think about which you have, but if you work
>> >>> >> > with
>> >>> >> > C/Fortran, then of course you do have to be constantly aware of
>> >>> >> > the
>> >>> >> > underlying implementation anyway.
>> >>> >>
>> >>> >> I don't think this bothers me -- I think it's analogous to things
>> >>> >> in
>> >>> >> numpy like Fortran order and non-contiguous arrays -- you can
>> >>> >> ignore
>> >>> >> all
>> >>> >> that when working in pure python when performance isn't critical,
>> >>> >> but
>> >>> >> you need a deeper understanding if you want to work with the data
>> >>> >> in C
>> >>> >> or Fortran or to tune performance in python.
>> >>> >>
>> >>> >> So as long as there is an API to query and control how things work,
>> >>> >> I
>> >>> >> like that it's hidden from simple python code.
>> >>> >>
>> >>> >> -Chris
>> >>> >>
>> >>> >>
>> >>> >
>> >>> > I'm similarly not too concerned about it. Performance seems finicky
>> >>> > when
>> >>> > you're dealing with missing data, since a lot of arrays will likely
>> >>> > have
>> >>> > to
>> >>> > be copied over to other arrays containing only complete data before
>> >>> > being
>> >>> > handed over to BLAS.
>> >>>
>> >>> Unless you know the neutral value for the computation or you just want
>> >>> to do a forward_fill in time series, and you have to ask the user not
>> >>> to give you an unmutable array with NAs if they don't want extra
>> >>> copies.
>> >>>
>> >>> Josef
>> >>>
>> >>
>> >> Mean value replacement, or more generally single scalar value
>> >> replacement,
>> >> is generally not a good idea. It biases downward your standard error
>> >> estimates if you use mean replacement, and it will bias both if you use
>> >> anything other than mean replacement. The bias is gets worse with more
>> >> missing data. So it's worst in the precisely the cases where you'd want
>> >> to
>> >> fill in the data the most. (Though I admit I'm not too familiar with
>> >> time
>> >> series, so maybe this doesn't apply. But it's true as a general
>> >> principle in
>> >> statistics.) I'm not sure why we'd want to make this use case easier.
>>
>> Another qualification on this (I cannot help it).
>> I think this only applies if you use a prefabricated no-missing-values
>> algorithm. If I write it myself, I can do the proper correction for
>> the reduced number of observations. (similar to the case when we
>> ignore correlated information and use statistics based on uncorrelated
>> observations which also overestimate the amount of information we have
>> available.)
>>
>
> Can you do that sort of technique with longitudinal (panel) data? I'm
> honestly curious because I haven't looked into such corrections before. I
> haven't been able to find a reference after a few quick google searches. I
> don't suppose you know one off the top of your head?


I was thinking mainly of simple cases where the correction only
requires to correctly count the number of observations in order to
adjust the degrees of freedom. For example, statistical tests that are
based on relatively simple statistics or ANOVA which just needs a
correct counting of the number of observations by groups. (This might
be partially covered by any NA ufunc implementation, that does mean,
var and cov correctly and maybe sorting like the current NaN sort.)

In the panel data case it might be possible to do this, if it can just
be treated like an unbalanced panel. I guess it depends on the details
of the model.

For regression, one way to remove an observation is to include a dummy
variable for that observation, or use X'X with rows zeroed out. R has
a package for multivariate normal with missing values that allows
calculation of expected values for the missing ones.

But in many of these cases, getting a clean (no-NA) copy of the data
will be simpler to implement.
(Leave-one-out cross validation as an IGNORE problem, instead of slicing?)

Then there are cases where the missingness contains information. If
observations are not randomly missing, then dropping the missing
information will bias the estimation results, and proper treatment
would require to model the fact that data is missing separately, e.g.
with a first step binomial model. Censored observations, e.g. no
measurements below a machine threshold are observed (maybe a Tobit
model), ...

filling forward might create mass points in the distribution, which
(to be "clean") would also have to be taken into account if they are a
sizable fraction of the data. However, dropping observations (or
outliers) might not be possible with time series data (or time series
panel data) if it screws up the interpretation of equal spaced time
periods. (Electricity or weather forecasts if your hours or seasons
get shifted all the time.)

Then it's starting to get messy, and I haven't looked at any details.
I'm just making up stories at this point. But, I guess, in these cases
it will often end up working with a data array and a mask array.

> And you're right about the last measurement carried forward. I was just
> thinking about filling in all missing values with the same value.

If I remember correctly, forward filling also showed up several times
on the mailing lists from scikits.timeseries users.

Josef

> -Chris Jordan-Squire
> PS--Thanks for mentioning the statsmodels discussion. I'd been keeping track
> of that on a different email account, and I haven't realized it wasn't
> forwarding those messages correctly.
>
>
>>
>> Josef
>>
>>
>> >
>> > We just discussed a use case for pandas on the statsmodels mailing
>> > list, minute data of stock quotes (prices), if the quote is NA then
>> > fill it with the last price quote. If it would be necessary for memory
>> > usage and performance, this can be handled efficiently and with
>> > minimal copying.
>> >
>> > If you want to fill in a missing value without messing up any result
>> > statistics, then there is a large literature in statistics on
>> > imputations, repeatedly assigning values to a NA from an underlying
>> > distribution. scipy/statsmodels doesn't have anything like this (yet)
>> > but R and the others have it available, and it looks more popular in
>> > bio-statistics.
>> >
>> > (But similar to what Dag said, for statistical analysis it will be
>> > necessary to keep case specific masks and data arrays around. I
>> > haven't actually written any missing values algorithm yet, so I'm
>> > quite again.)
>> >
>> > Josef
>> >
>> >> -Chris Jordan-Squire
>> >>
>> >>>
>> >>> > My primary concern is that the np.NA stuff 'just
>> >>> > works'. Especially since I've never run into use cases in statistics
>> >>> > where
>> >>> > the difference between IGNORE and NA mattered.
>> >>> >
>> >>> >
>> >>> >>
>> >>> >>
>> >>> >> --
>> >>> >> Christopher Barker, Ph.D.
>> >>> >> Oceanographer
>> >>> >>
>> >>> >> Emergency Response Division
>> >>> >> NOAA/NOS/OR&R            (206) 526-6959   voice
>> >>> >> 7600 Sand Point Way NE   (206) 526-6329   fax
>> >>> >> Seattle, WA  98115       (206) 526-6317   main reception
>> >>> >>
>> >>> >> [email protected]
>> >>> >> _______________________________________________
>> >>> >> NumPy-Discussion mailing list
>> >>> >> [email protected]
>> >>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >>> >
>> >>> >
>> >>> > _______________________________________________
>> >>> > NumPy-Discussion mailing list
>> >>> > [email protected]
>> >>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >>> >
>> >>> >
>> >>> _______________________________________________
>> >>> NumPy-Discussion mailing list
>> >>> [email protected]
>> >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >>
>> >>
>> >> _______________________________________________
>> >> NumPy-Discussion mailing list
>> >> [email protected]
>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >>
>> >>
>> >
>> _______________________________________________
>> NumPy-Discussion mailing list
>> [email protected]
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [email protected]
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NA/Missing Data Conference Call Summary

Reply via email to