Re: [Numpy-discussion] Concepts for masked/missing data

Travis Oliphant Sun, 26 Jun 2011 19:12:18 -0700

I haven't commented yet on the mailing list because of time pressures although 
I have spoken to Mark as often as I can --- and have encouraged him to pursue 
his ideas and discuss them with the community.  The Numeric Python discussion 
list has a long history of great dialogue to try and bring out as many 
perspectives as possible as we wrestle with improving the code base.   It is 
very encouraging to see that tradition continuing.

Because Enthought was mentioned earlier in this thread, I would like to try and 
clarify a few things about my employer and the company's interest.    Enthought 
has been very interested in the development of the NumPy and SciPy stack (and 
the broader SciPy community) for sometime.   With its limited resources, 
Enthought helped significantly to form the SciPy community and continues to 
sponsor it as much as it can.   Many developers who work at Enthought 
(including me) have personal interest in the NumPy / SciPy community and 
codebase that go beyond Enthought's ability to invest directly as well.   

While Enthought has limited resources to invest directly in pursuing the goal, 
Enthought is very interested in improving Python's use as a data analysis 
environment.    Because of that interest,  Enthought sponsored a "data-array" 
summit in May.   There is an inScight podcast that summarizes some of the event 
that you can listen to at http://inscight.org/2011/05/18/episode_13/.    The 
purpose of this event was to bring a few people together who have been working 
on different aspects of the problem (particularly around the labelled array, or 
data array problem).    We also wanted to jump start the activity of our 
interns and make sure that some of the use cases we have seen during the past 
several years while working on client projects had some light. 

The event was successful in that it generated *a lot* of ideas.   Some of these 
ideas were summarized in notes that are linked to at this convore thread: 
https://convore.com/python-scientific-computing/data-array-in-numpy/     One of 
the major ideas that emerged during the discussion is that NumPy needs to be 
able to handle missing data in a more integrated way (i.e. there need to be 
functions that do the "right" thing in the face of missing data).   One 
approach that was suggested during some of the discussion was that one way to 
handle missing data would be to introducing special nadtypes.   

Mark is one of 2 interns that we have this summer who are tasked at a high 
level with taking what was learned at the summit and implementing critical 
pieces as their skills and interests allow.    I have been talking with them 
individually to map out specific work targets for the summer.    Christopher 
Jordan-Squires is one of our interns who is studying to get a PhD in 
Mathematics at the Univ. of Washington.   He has a strong interest in 
statistics and a desire to make Python as easy to use as R for certain 
statistical work flows.    Mark Wiebe is known on this list because of his 
recent success at working on the NumPy code base.  As a result of that success, 
Mark is working on making improvements to NumPy that are seen as most critical 
to solving some of the same problems we keep seeing in our projects (labeled 
arrays being one of them).   We are also very interested in the Pandas project 
as it brings a data-structure like the successful DataFrame in R to the Python 
space (and it helps solve some of the problems our clients are seeing).   It 
would be good to make sure that core functionality that Pandas needs is 
available in NumPy where appropriate. 

The date-time work that Mark did was the first "low-hanging" fruit that needed 
to be finished.   The second project that Mark is involved with is creating an 
approach for missing data in NumPy.   I suggested the missing data dtypes (in 
part because Mark had expressed some concerns about the way dtypes are handled 
in NumPy and I would love for that mechanism for user-defined data-types and 
the whole data-type infrastructure to be improved as needed.)   Mark spent some 
time thinking about it and felt more comfortable with the masked array solution 
and that is where we are now. 

Enthought's main interest remains in seeing how much of the data array can and 
should be moved into low-level NumPy as well as the implementation of 
functionality (wherever it may live) to make data analysis easier and more 
productive in Python.    Again, though, this is something Enthought as a 
company can only invest limited resources in, and we want to make sure that 
Mark spends the time that we are sponsoring doing work that is seen as valuable 
by the community but more importantly matching our own internal needs. 

I will post a follow-on message that provides my current views on the subject 
of missing data and masked arrays.    

-Travis

On Jun 25, 2011, at 2:09 PM, Benjamin Root wrote:

> 
> 
> On Sat, Jun 25, 2011 at 1:57 PM, Nathaniel Smith <n...@pobox.com> wrote:
> On Sat, Jun 25, 2011 at 11:50 AM, Eric Firing <efir...@hawaii.edu> wrote:
> > On 06/25/2011 07:05 AM, Nathaniel Smith wrote:
> >> On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett<matthew.br...@gmail.com>  
> >> wrote:
> >>> To clarify, you're proposing for:
> >>>
> >>> a = np.sum(np.array([np.NA, np.NA])
> >>>
> >>> 1) ->  np.NA
> >>> 2) ->  0.0
> >>
> >> Yes -- and in R you get actually do get NA, while in numpy.ma you
> >> actually do get 0. I don't think this is a coincidence; I think it's
> >
> > No, you don't:
> >
> > In [2]: np.ma.array([2, 4], mask=[True, True]).sum()
> > Out[2]: masked
> >
> > In [4]: np.sum(np.ma.array([2, 4], mask=[True, True]))
> > Out[4]: masked
> 
> Huh. So in numpy.ma, sum([10, NA]) and sum([10]) are the same, but
> sum([NA]) and sum([]) are different? Sounds to me like you should file
> a bug on numpy.ma...
> 
> Actually, no... I should have tested this before replying earlier:
> 
> >>> a = np.ma.array([2, 4], mask=[True, True])
> >>> a
> masked_array(data = [-- --],
>              mask = [ True  True],
>        fill_value = 999999)
> 
> >>> a.sum()
> masked
> >>> a = np.ma.array([], mask=[])
> >>> a
> >>> a
> masked_array(data = [],
>              mask = [],
>        fill_value = 1e+20)
> >>> a.sum()
> masked
> 
> They are the same.
> 
> 
> Anyway, the general point is that in R, NA's propagate, and in
> numpy.ma, masked values are ignored (except, apparently, if all values
> are masked). Here, I actually checked these:
> 
> Python: np.ma.array([2, 4], mask=[True, False]).sum() -> 4
> R: sum(c(NA, 4)) -> NA
> 
> 
> If you want NaN behavior, then use NaNs.  If you want masked behavior, then 
> use masks.
> 
> Ben Root
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

---
Travis Oliphant
Enthought, Inc.
oliph...@enthought.com
1-512-536-1057
http://www.enthought.com

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Concepts for masked/missing data

Reply via email to