Re: [Numpy-discussion] using the same vocabulary for missing value ideas

Matthew Brett Wed, 06 Jul 2011 17:02:00 -0700

Hi,

On Wed, Jul 6, 2011 at 7:10 PM, Christopher Jordan-Squire
<[email protected]> wrote:
>
>
> On Wed, Jul 6, 2011 at 10:44 AM, Matthew Brett <[email protected]>
> wrote:
>>
>> Hi,
>>
>> On Wed, Jul 6, 2011 at 6:11 PM, Benjamin Root <[email protected]> wrote:
>> >
>> >
>> > On Wed, Jul 6, 2011 at 12:01 PM, Matthew Brett <[email protected]>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> On Wed, Jul 6, 2011 at 5:48 PM, Peter
>> >> <[email protected]> wrote:
>> >> > On Wed, Jul 6, 2011 at 5:38 PM, Matthew Brett
>> >> > <[email protected]>
>> >> > wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> On Wed, Jul 6, 2011 at 4:40 PM, Mark Wiebe <[email protected]>
>> >> >> wrote:
>> >> >>> It appears to me that one of the biggest reason some of us have
>> >> >>> been
>> >> >>> talking
>> >> >>> past each other in the discussions is that different people have
>> >> >>> different
>> >> >>> definitions for the terms being used. Until this is thoroughly
>> >> >>> cleared
>> >> >>> up, I
>> >> >>> feel the design process is tilting at windmills.
>> >> >>> In the interests of clarity in our discussions, here is a starting
>> >> >>> point
>> >> >>> which is consistent with the NEP. These definitions have been added
>> >> >>> in
>> >> >>> a
>> >> >>> glossary within the NEP. If there are any ideas for amendments to
>> >> >>> these
>> >> >>> definitions that we can agree on, I will update the NEP with those
>> >> >>> amendments. Also, if I missed any important terms which need to be
>> >> >>> added,
>> >> >>> please propose definitions for them.
>> >> >>> NA (Not Available)
>> >> >>>     A placeholder for a value which is unknown to computations.
>> >> >>> That
>> >> >>>     value may be temporarily hidden with a mask, may have been lost
>> >> >>>     due to hard drive corruption, or gone for any number of
>> >> >>> reasons.
>> >> >>>     This is the same as NA in the R project.
>> >> >>
>> >> >> Really?  Can one implement NA with a mask in R?  I thought an NA was
>> >> >> always bitpattern in R?
>> >> >
>> >> > I don't think that was what Mark was saying, see this bit later in
>> >> > this
>> >> > email:
>> >>
>> >> I think it would make a difference if there was an implementation that
>> >> had conflated masking with bitpatterns in terms of API.  I don't think
>> >> R is an example.
>> >>
>> >
>> > Of course R is not an example of that.  Nothing is.  This is merely
>> > conceptual.  Separate NA from np.NA in Mark's NEP, and you will see his
>> > point.  Consider it the logical intersection of NA in Mark's NEP and the
>> > aNEP.
>>
>> I am trying to work out what you feel you feel the points of
>> discussion are.  There's surely no point in continuing to debate
>> things we agree on.
>>
>> I don't think anyone disputes (or has ever disputed) that:
>>
>> There can be missing data implemented with bitpatterns
>> There can be missing data implemented with masks
>> Missing data can have propagate semantics
>> Missing data can have ignore semantics.
>> The implementation does not in itself constrain the semantics.
>>
> So, to be clear, is your concern is that you want to be able to tell
> difference between whether an np.NA comes from the bit pattern or the mask
> in its implementation? But why would you have both the parameterized dtype
> and the mask implementation at the same time? They implement the same
> abstraction.


In Mark's mind they implement the same abstraction.  In my mind, and
Nathaniels, and I think, Pierre's, and others, they are not the same
abstraction.  You can treat them the same if you want, even by
default, but they are two different ideas, with two different
implementations.

A bitmask NA value is absolutely completely missing.  It's a value
that says 'missing'
A masked-out value is temporarily or provisionally missing.   When you
take away the mask, the previous value is there.  These are two
different things.  They are each very easy to explain.

> Is your desire that the np.NA's are implemented solely through bit patterns
> and np.IGNORE is implemented solely through masks? So that you can think of
> the masks as being IGNORE flags? What if you want multiple types of IGNORE?
> (To ignore certain values because they're outliers, others because the data
> wouldn't make sense, and others because you're just focusing on a particular
> subgroup, for instance.)

Forgive me, I have been at dinner and had several glasses of wine.
So, what I'm about to say might be dumber than usual.  With that
rider:

I agree with Mark, we should avoid np.IGNORE because it conflates
ignore semantics with the masking implementation.

The idea of several different missings seems to me orthogonal.  There
can be different missings with bitmasks and different missings with
masks.

My fundamental point, that I accept I am not getting across with much
success, is the following:

In general, as Dag has pointed out elsewhere, numpy is close the metal
- you can almost feel the C array underneath the python numpy object.
 This is its strength.  It doesn't try and hide the C array from you,
it gives you the whole machinery, open kimono.

I can see an open kimono way of dealing with missing values.  There's
the bitpattern way.  If I do a[3] = np.NA, what I mean is 'store an NA
in the array memory'.  Exactly the same as when I do a[3] = 2, I mean
'store a 2 in the array memory'.   It's obvious and transparent, easy
to explain.

I can see an open kimono way of doing masking.   I make a masked
array.  The masked array has a 'mask'.   I can set the mask values to
"True" or "False".  I can get the array from underneath the mask.
It's obvious and transparent, easy to explain.

I can see that you might want, for practical purposes, to treat these
two 'missing' signals as being equivalalent.    I can even see that
you might not expose machinery to distinguish between them.  But, it
seems ugly and confusing to me, and to others, to try and make the
bitpattern and the masked missing value appear to be exactly the same.
 If I do this:

a[3] = np.NA

I want an NA in a[3].  I don't want you to make it look as if there's
an NA in a[3], I want there to be an NA in a[3].    I want to know
what I did.

So, maybe I want to 'mask' a[3].  Well then I make a masked array, and then I do

a.mask[3] = False # or True.

It's obvious.  It's explicit.  It does what I want.  I can feel the C
array and the mask array underneath.  I know what I did.

On the other hand, to try and conceal these implementation
differences, seems to me to break my feeling for numpy arrays, and
make me feel I have an object that is rather magic, that I don't fully
understand, and for which clever stuff is going on, under the hood,
that I worry about but have to trust.

I think this is not the numpy way.   I think I fully understand why
it's attractive, but I continue to think that it's a mistake, and one
that may take some time to become clear. It will become clear only
after a few years of trying to teach people, and noticing that when
they get to this stuff, they start switching off, and getting a bit
confused, and concluding it's all too hard for them.

I can see that we're starting to go round in circles again, and that
writing when drunk is unlikely to help that, so at this point, I will
drop out of the conversation and let y'all get on with it.

Thanks for the substantial question by the way, it was helpful,

Cheers,

Matthew
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] using the same vocabulary for missing value ideas

Reply via email to