On Tue, Jul 15, 2014 at 7:40 PM, Aldcroft, Thomas
<aldcr...@head.cfa.harvard.edu> wrote:
>
> On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith <n...@pobox.com> wrote:
>>
>> OTOH, fixed length nul padded latin1 would be useful for various flat file
>> reading tasks.
>
> As one of the original agitators for this, let me re-iterate that what the
> astronomical community *really* wants is the original proposal as described
> by Chris Barker [1] and essentially what Charles said.  We have large data
> archives that have ASCII string data in binary formats like FITS and HDF5.
> The current readers for those datasets present users with numpy S data
> types, which in Python 3 cannot be compared to str (unicode) literals.  In
> many cases those datasets are large, and in my case I regularly deal with
> multi-Gb sized bytestring arrays.  Converting those to a U dtype is not
> practical.

This is feedback is *super* useful, thanks. Can you elaborate a bit
more on your requirements?

I get that:
- You have data that is treated as text, so it is convenient to be
able to use Python strings for things like equality tests, np.sum(arr
== "green") etc.
- Your data uses only ASCII characters, and you don't want to spend
more than 1 byte of memory per character.

Do you ever have 8 bit characters, and if so, what encoding do you use?

Does it matter to you that the memory layout for these 1-byte-per-char
strings remain fixed-width nul-padded concatenated strings (e.g.,
because you are mmap'ing files that have this format)? Or do FITS/HDF5
handle layout details internally and you don't care so long as the
above requirements are met?

Does the fixed-width nature of numpy strings cause problems in the
above setting?

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to