[Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary

Matthew Brett Wed, 06 Jul 2011 05:46:14 -0700

Hi,

Sorry, I hope you don't mind, I moved this to it's own thread, trying
to separate comments on the NA debate from the discussion yesterday.


On Wed, Jul 6, 2011 at 1:27 PM, Dag Sverre Seljebotn
<[email protected]> wrote:
> On 07/06/2011 02:05 PM, Matthew Brett wrote:
>> Hi,
>>
>> Just for reference, I am using this as the latest version of the NEP -
>> I hope it's current:
>>
>> https://github.com/m-paradox/numpy/blob/7b10c9ab1616b9100e98dd2ab80cef639d5b5735/doc/neps/missing-data.rst
>>
>> I'm mostly relaying stuff I said, although generally (please do
>> correct me if I am wrong) I am just re-expressing points that
>> Nathaniel has already made in the alterNEP text and the emails.
>>
>> On Wed, Jul 6, 2011 at 12:46 AM, Christopher Jordan-Squire
>> <[email protected]>  wrote:
>> ...
>>> Since we only have Mark is only around Austin until early August, there's
>>> also broad agreement that we need to get something done quickly.
>>
>> I think I might have missed that part of the discussion :)
>>
>> I feel the need to emphasize the centrality of the assertion by
>> Nathaniel, and agreement by (at least) me, that the NA case (there
>> really is no data) and the IGNORE case (there is data but I'm
>> concealing it from you) are conceptually different, and come from
>> different use-cases.
>>
>> The underlying disagreement returned many times to this fundamental
>> difference between the NEP and alterNEP:
>>
>> In the NEP - by design - it is impossible to distinguish between na.NA
>> and na.IGNORE
>> The alterNEP insists you should be able to distinguish.
>>
>> Mark says something like "it's all missing data, there's no reason you
>> should want to distinguish".  Nathaniel and I were saying "the two
>> types of missing do have different use-cases, and it should be
>> possible to distinguish.  You might want to chose to treat them the
>> same, but you should be able to see what they are.".
>>
>> I returned several times to this (original point by Nathaniel):
>>
>> a[3] = np.NA
>>
>> (what does this mean?   I am altering the underlying array, or a mask?
>>    How would I explain this to someone?)
>>
>> We confirmed that, in order to make it difficult to know what your NA
>> is (masked or bit-pattern), Mark has to a) hinder access to the data
>> below the mask and b) prevent direct API access to the masking array.
>> I described this as 'hobbling the API' and Mark thought of it as
>> 'generic programming' (missing is always missing).
>
> Here's an HPC perspective...:
>
> If you, say, want to off-load array processing with a mask to some code
> running on a GPU, you really can't have the GPU go through some NumPy
> API. Or if you want to implement a masked array on a cluster with MPI,
> you similarly really, really want raw access.
>
> At least I feel that the transparency of NumPy is a huge part of its
> current success. Many more than me spend half their time in C/Fortran
> and half their time in Python.
>
> I tend to look at NumPy this way: Assuming you have some data in memory
> (possibly loaded by a C or Fortran library). (Almost) no matter how it
> is allocated, ordered, packed, aligned -- there's a way to find strides
> and dtypes to put a nice NumPy wrapper around it and use the memory from
> Python.
>
> So, my view on Mark's NEP was: With a reasonably amount of flexibility
> in how you decided to implement masking for your data, you can create a
> NumPy wrapper that will understand that. Whether your Fortran library
> exposes NAs in its 40GB buffer as bit patterns, or using a seperate
> mask, both will work.
>
> And IMO Mark's NEP comes rather close to this, you just need an
> additional NEP later to give raw details to the implementation details,
> once those are settled :-)

I was a little puzzled as to what you were trying to say, but I
suspect that's my ignorance about Numpy internals.

Superficially, I would have assumed that, making masked and
bit-pattern NAs behave the same in numpy, would take you away from the
raw data, in the sense that you not only need the dtype, you also need
the mask machinery, in order to know if you have an NA.   Later I
realized that you probably weren't saying that.  So, just for my
unhappy ignorance - how does the HPC perspective relate to debate
about "can / can't distinguish NA from ignore"?

Sorry, thanks,

Matthew
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary

Reply via email to