[Numpy-discussion] Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis
Hi all,

Say python’s builtin `int` type. It can be as large as memory allows.

np.ndarray on the other hand is optimized for vectorization via strides, memory 
structure and many things that I probably don’t know. Well the point is that it 
is convenient and efficient to use for many things in comparison to python’s 
built-in list of integers.

So, I am thinking whether something in between exists? (And obviously something 
more clever than np.array(dtype=object))

Probably something similar to `StringDType`, but for integers and floats. (It’s 
just my guess. I don’t know anything about `StringDType`, but just guessing it 
must be better than np.array(dtype=object) in combination with np.vectorize)

Regards,
dgpb

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Nathan
It is possible to do this using the new DType system.

Sebastian wrote a sketch for a DType backed by the GNU multiprecision float
library:
https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype

It adds a significant amount of complexity to store data outside the array
buffer and introduces the possibility of use-after-free and dangling
reference errors that are impossible if the array does not use embedded
references, so that’s the main reason it hasn’t been done much.

On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis  wrote:

> Hi all,
>
> Say python’s builtin `int` type. It can be as large as memory allows.
>
> np.ndarray on the other hand is optimized for vectorization via strides,
> memory structure and many things that I probably don’t know. Well the point
> is that it is convenient and efficient to use for many things in comparison
> to python’s built-in list of integers.
>
> So, I am thinking whether something in between exists? (And obviously
> something more clever than np.array(dtype=object))
>
> Probably something similar to `StringDType`, but for integers and floats.
> (It’s just my guess. I don’t know anything about `StringDType`, but just
> guessing it must be better than np.array(dtype=object) in combination
> with np.vectorize)
>
> Regards,
> dgpb
>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: nathan12...@gmail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis
Thank you for this.

I am just starting to think about these things, so I appreciate your patience.

But isn’t it still true that all elements of an array are still of the same 
size in memory?

I am thinking along the lines of per-element dynamic memory management. Such 
that if I had array [0, 1e1], the first element would default to reasonably 
small size in memory.

> On 13 Mar 2024, at 16:29, Nathan  wrote:
> 
> It is possible to do this using the new DType system. 
> 
> Sebastian wrote a sketch for a DType backed by the GNU multiprecision float 
> library: 
> https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype 
> 
> 
> It adds a significant amount of complexity to store data outside the array 
> buffer and introduces the possibility of use-after-free and dangling 
> reference errors that are impossible if the array does not use embedded 
> references, so that’s the main reason it hasn’t been done much.
> 
> On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis  > wrote:
> Hi all,
> 
> Say python’s builtin `int` type. It can be as large as memory allows.
> 
> np.ndarray on the other hand is optimized for vectorization via strides, 
> memory structure and many things that I probably don’t know. Well the point 
> is that it is convenient and efficient to use for many things in comparison 
> to python’s built-in list of integers.
> 
> So, I am thinking whether something in between exists? (And obviously 
> something more clever than np.array(dtype=object))
> 
> Probably something similar to `StringDType`, but for integers and floats. 
> (It’s just my guess. I don’t know anything about `StringDType`, but just 
> guessing it must be better than np.array(dtype=object) in combination with 
> np.vectorize)
> 
> Regards,
> dgpb
> 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> 
> To unsubscribe send an email to numpy-discussion-le...@python.org 
> 
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> 
> Member address: nathan12...@gmail.com 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Nathan
Yes, an array of references still has a fixed size width in the array
buffer. You can think of each entry in the array as a pointer to some other
memory on the heap, which can be a dynamic memory allocation.

There's no way in NumPy to support variable-sized array elements in the
array buffer, since that assumption is key to how numpy implements strided
ufuncs and broadcasting.,

On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis  wrote:

> Thank you for this.
>
> I am just starting to think about these things, so I appreciate your
> patience.
>
> But isn’t it still true that all elements of an array are still of the
> same size in memory?
>
> I am thinking along the lines of per-element dynamic memory management.
> Such that if I had array [0, 1e1], the first element would default to
> reasonably small size in memory.
>
> On 13 Mar 2024, at 16:29, Nathan  wrote:
>
> It is possible to do this using the new DType system.
>
> Sebastian wrote a sketch for a DType backed by the GNU multiprecision
> float library:
> https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype
>
> It adds a significant amount of complexity to store data outside the array
> buffer and introduces the possibility of use-after-free and dangling
> reference errors that are impossible if the array does not use embedded
> references, so that’s the main reason it hasn’t been done much.
>
> On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis 
> wrote:
>
>> Hi all,
>>
>> Say python’s builtin `int` type. It can be as large as memory allows.
>>
>> np.ndarray on the other hand is optimized for vectorization via strides,
>> memory structure and many things that I probably don’t know. Well the point
>> is that it is convenient and efficient to use for many things in comparison
>> to python’s built-in list of integers.
>>
>> So, I am thinking whether something in between exists? (And obviously
>> something more clever than np.array(dtype=object))
>>
>> Probably something similar to `StringDType`, but for integers and
>> floats. (It’s just my guess. I don’t know anything about `StringDType`,
>> but just guessing it must be better than np.array(dtype=object) in
>> combination with np.vectorize)
>>
>> Regards,
>> dgpb
>>
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: nathan12...@gmail.com
>>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com
>
>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: nathan12...@gmail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis
By the way, I think I am referring to integer arrays. (Or integer part of 
floats.)

I don’t think what I am saying sensibly applies to floats as they are.

Although, new float type could base its integer part on such concept.

—

Where I am coming from is that I started to hit maximum bounds on integer 
arrays, where most of values are very small and some become very large. And I 
am hitting memory limits. And I don’t have many zeros, so sparse arrays aren’t 
an option.

Approximately:
90% of my arrays could fit into `np.uint8`
1% requires `np.uint64`
the rest 9% are in between.

And there is no predictable order where is what, so splitting is not an option 
either.


> On 13 Mar 2024, at 17:53, Nathan  wrote:
> 
> Yes, an array of references still has a fixed size width in the array buffer. 
> You can think of each entry in the array as a pointer to some other memory on 
> the heap, which can be a dynamic memory allocation.
> 
> There's no way in NumPy to support variable-sized array elements in the array 
> buffer, since that assumption is key to how numpy implements strided ufuncs 
> and broadcasting.,
> 
> On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis  > wrote:
> Thank you for this.
> 
> I am just starting to think about these things, so I appreciate your patience.
> 
> But isn’t it still true that all elements of an array are still of the same 
> size in memory?
> 
> I am thinking along the lines of per-element dynamic memory management. Such 
> that if I had array [0, 1e1], the first element would default to 
> reasonably small size in memory.
> 
>> On 13 Mar 2024, at 16:29, Nathan > > wrote:
>> 
>> It is possible to do this using the new DType system. 
>> 
>> Sebastian wrote a sketch for a DType backed by the GNU multiprecision float 
>> library: 
>> https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype 
>> 
>> 
>> It adds a significant amount of complexity to store data outside the array 
>> buffer and introduces the possibility of use-after-free and dangling 
>> reference errors that are impossible if the array does not use embedded 
>> references, so that’s the main reason it hasn’t been done much.
>> 
>> On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis > > wrote:
>> Hi all,
>> 
>> Say python’s builtin `int` type. It can be as large as memory allows.
>> 
>> np.ndarray on the other hand is optimized for vectorization via strides, 
>> memory structure and many things that I probably don’t know. Well the point 
>> is that it is convenient and efficient to use for many things in comparison 
>> to python’s built-in list of integers.
>> 
>> So, I am thinking whether something in between exists? (And obviously 
>> something more clever than np.array(dtype=object))
>> 
>> Probably something similar to `StringDType`, but for integers and floats. 
>> (It’s just my guess. I don’t know anything about `StringDType`, but just 
>> guessing it must be better than np.array(dtype=object) in combination with 
>> np.vectorize)
>> 
>> Regards,
>> dgpb
>> 
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>> 
>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>> 
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>> 
>> Member address: nathan12...@gmail.com 
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>> 
>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>> 
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>> 
>> Member address: dom.grigo...@gmail.com 
> 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> 
> To unsubscribe send an email to numpy-discussion-le...@python.org 
> 
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> 
> Member address: nathan12...@gmail.com 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing lis

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Kevin Sheppard
Does the new DType system in NumPy 2 make something like this more
possible?  I would suspect that the user would have to write a lot of code
to have reasonable performance if it was.

Kevin


On Wed, Mar 13, 2024 at 3:55 PM Nathan  wrote:

> Yes, an array of references still has a fixed size width in the array
> buffer. You can think of each entry in the array as a pointer to some other
> memory on the heap, which can be a dynamic memory allocation.
>
> There's no way in NumPy to support variable-sized array elements in the
> array buffer, since that assumption is key to how numpy implements strided
> ufuncs and broadcasting.,
>
> On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis 
> wrote:
>
>> Thank you for this.
>>
>> I am just starting to think about these things, so I appreciate your
>> patience.
>>
>> But isn’t it still true that all elements of an array are still of the
>> same size in memory?
>>
>> I am thinking along the lines of per-element dynamic memory management.
>> Such that if I had array [0, 1e1], the first element would default to
>> reasonably small size in memory.
>>
>> On 13 Mar 2024, at 16:29, Nathan  wrote:
>>
>> It is possible to do this using the new DType system.
>>
>> Sebastian wrote a sketch for a DType backed by the GNU multiprecision
>> float library:
>> https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype
>>
>> It adds a significant amount of complexity to store data outside the
>> array buffer and introduces the possibility of use-after-free and dangling
>> reference errors that are impossible if the array does not use embedded
>> references, so that’s the main reason it hasn’t been done much.
>>
>> On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis 
>> wrote:
>>
>>> Hi all,
>>>
>>> Say python’s builtin `int` type. It can be as large as memory allows.
>>>
>>> np.ndarray on the other hand is optimized for vectorization via strides,
>>> memory structure and many things that I probably don’t know. Well the point
>>> is that it is convenient and efficient to use for many things in comparison
>>> to python’s built-in list of integers.
>>>
>>> So, I am thinking whether something in between exists? (And obviously
>>> something more clever than np.array(dtype=object))
>>>
>>> Probably something similar to `StringDType`, but for integers and
>>> floats. (It’s just my guess. I don’t know anything about `StringDType`,
>>> but just guessing it must be better than np.array(dtype=object) in
>>> combination with np.vectorize)
>>>
>>> Regards,
>>> dgpb
>>>
>>> ___
>>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>>> To unsubscribe send an email to numpy-discussion-le...@python.org
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>> Member address: nathan12...@gmail.com
>>>
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: dom.grigo...@gmail.com
>>
>>
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: nathan12...@gmail.com
>>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: kevin.k.shepp...@gmail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Matti Picus
I am not sure what kind of a scheme would support various-sized native 
ints. Any scheme that puts pointers in the array is going to be worse: 
the pointers will be 64-bit. You could store offsets to data, but then 
you would need to store both the offsets and the contiguous data, nearly 
doubling your storage. What shape are your arrays, that would be the 
minimum size of the offsets?


Matti


On 13/3/24 18:15, Dom Grigonis wrote:
By the way, I think I am referring to integer arrays. (Or integer part 
of floats.)


I don’t think what I am saying sensibly applies to floats as they are.

Although, new float type could base its integer part on such concept.

—

Where I am coming from is that I started to hit maximum bounds on 
integer arrays, where most of values are very small and some become 
very large. And I am hitting memory limits. And I don’t have many 
zeros, so sparse arrays aren’t an option.


Approximately:
90% of my arrays could fit into `np.uint8`
1% requires `np.uint64`
the rest 9% are in between.

And there is no predictable order where is what, so splitting is not 
an option either.




On 13 Mar 2024, at 17:53, Nathan  wrote:

Yes, an array of references still has a fixed size width in the array 
buffer. You can think of each entry in the array as a pointer to some 
other memory on the heap, which can be a dynamic memory allocation.


There's no way in NumPy to support variable-sized array elements in 
the array buffer, since that assumption is key to how numpy 
implements strided ufuncs and broadcasting.,


On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis  
wrote:


Thank you for this.

I am just starting to think about these things, so I appreciate
your patience.

But isn’t it still true that all elements of an array are still
of the same size in memory?

I am thinking along the lines of per-element dynamic memory
management. Such that if I had array [0, 1e1], the first
element would default to reasonably small size in memory.


On 13 Mar 2024, at 16:29, Nathan  wrote:

It is possible to do this using the new DType system.

Sebastian wrote a sketch for a DType backed by the GNU
multiprecision float library:
https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype

It adds a significant amount of complexity to store data outside
the array buffer and introduces the possibility of
use-after-free and dangling reference errors that are impossible
if the array does not use embedded references, so that’s the
main reason it hasn’t been done much.

On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis
 wrote:

Hi all,

Say python’s builtin `int` type. It can be as large as
memory allows.

np.ndarray on the other hand is optimized for vectorization
via strides, memory structure and many things that I
probably don’t know. Well the point is that it is convenient
and efficient to use for many things in comparison to
python’s built-in list of integers.

So, I am thinking whether something in between exists? (And
obviously something more clever than np.array(dtype=object))

Probably something similar to `StringDType`, but for
integers and floats. (It’s just my guess. I don’t know
anything about `StringDType`, but just guessing it must be
better than np.array(dtype=object) in combination with
np.vectorize)

Regards,
dgpb

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to
numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: nathan12...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: dom.grigo...@gmail.com


___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: nathan12...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: dom.grigo...@gmail.com



___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: matti.pi...@gmail.com


[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Jim Pivarski
This might be a good application of Awkward Array (https://awkward-array.org),
which applies a NumPy-like interface to arbitrary tree-like data, or ragged
(https://github.com/scikit-hep/ragged), a restriction of that to only
variable-length lists, but satisfying the Array API standard.

The variable-length data in Awkward Array hasn't been used to represent
arbitrary precision integers, though. It might be a good application of
"behaviors," which are documented here:
https://awkward-array.org/doc/main/reference/ak.behavior.html In principle,
it would be possible to define methods and overload NumPy ufuncs to
interpret variable-length lists of int8 as integers with arbitrary
precision. Numba might be helpful in accelerating that if normal
NumPy-style vectorization is insufficient.

If you're interested in following this route, I can help with first
implementations of that arbitrary precision integer behavior. (It's an
interesting application!)

Jim



On Wed, Mar 13, 2024, 12:28 PM Matti Picus  wrote:

> I am not sure what kind of a scheme would support various-sized native
> ints. Any scheme that puts pointers in the array is going to be worse:
> the pointers will be 64-bit. You could store offsets to data, but then
> you would need to store both the offsets and the contiguous data, nearly
> doubling your storage. What shape are your arrays, that would be the
> minimum size of the offsets?
>
> Matti
>
>
> On 13/3/24 18:15, Dom Grigonis wrote:
> > By the way, I think I am referring to integer arrays. (Or integer part
> > of floats.)
> >
> > I don’t think what I am saying sensibly applies to floats as they are.
> >
> > Although, new float type could base its integer part on such concept.
> >
> > —
> >
> > Where I am coming from is that I started to hit maximum bounds on
> > integer arrays, where most of values are very small and some become
> > very large. And I am hitting memory limits. And I don’t have many
> > zeros, so sparse arrays aren’t an option.
> >
> > Approximately:
> > 90% of my arrays could fit into `np.uint8`
> > 1% requires `np.uint64`
> > the rest 9% are in between.
> >
> > And there is no predictable order where is what, so splitting is not
> > an option either.
> >
> >
> >> On 13 Mar 2024, at 17:53, Nathan  wrote:
> >>
> >> Yes, an array of references still has a fixed size width in the array
> >> buffer. You can think of each entry in the array as a pointer to some
> >> other memory on the heap, which can be a dynamic memory allocation.
> >>
> >> There's no way in NumPy to support variable-sized array elements in
> >> the array buffer, since that assumption is key to how numpy
> >> implements strided ufuncs and broadcasting.,
> >>
> >> On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis 
> >> wrote:
> >>
> >> Thank you for this.
> >>
> >> I am just starting to think about these things, so I appreciate
> >> your patience.
> >>
> >> But isn’t it still true that all elements of an array are still
> >> of the same size in memory?
> >>
> >> I am thinking along the lines of per-element dynamic memory
> >> management. Such that if I had array [0, 1e1], the first
> >> element would default to reasonably small size in memory.
> >>
> >>> On 13 Mar 2024, at 16:29, Nathan 
> wrote:
> >>>
> >>> It is possible to do this using the new DType system.
> >>>
> >>> Sebastian wrote a sketch for a DType backed by the GNU
> >>> multiprecision float library:
> >>> https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype
> >>>
> >>> It adds a significant amount of complexity to store data outside
> >>> the array buffer and introduces the possibility of
> >>> use-after-free and dangling reference errors that are impossible
> >>> if the array does not use embedded references, so that’s the
> >>> main reason it hasn’t been done much.
> >>>
> >>> On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis
> >>>  wrote:
> >>>
> >>> Hi all,
> >>>
> >>> Say python’s builtin `int` type. It can be as large as
> >>> memory allows.
> >>>
> >>> np.ndarray on the other hand is optimized for vectorization
> >>> via strides, memory structure and many things that I
> >>> probably don’t know. Well the point is that it is convenient
> >>> and efficient to use for many things in comparison to
> >>> python’s built-in list of integers.
> >>>
> >>> So, I am thinking whether something in between exists? (And
> >>> obviously something more clever than np.array(dtype=object))
> >>>
> >>> Probably something similar to `StringDType`, but for
> >>> integers and floats. (It’s just my guess. I don’t know
> >>> anything about `StringDType`, but just guessing it must be
> >>> better than np.array(dtype=object) in combination with
> >>> np.vectorize)
> >>>
> >>> Regards,
> >>> dgpb
> >>>
> >>> _

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Jim Pivarski
After sending that email, I realize that I have to take it back: your
motivation is to minimize memory use. The variable-length lists in Awkward
Array (and therefore in ragged as well) are implemented using offset
arrays, and they're at minimum 32-bit. The scheme is more cache-coherent
(less "pointer chasing"), but doesn't reduce the size.

These offsets are 32-bit so that individual values can be selected from the
array in constant time. If you use a smaller integer size, like uint8, then
they have to be number of elements in the lists, rather than offsets (the
cumsum of number of elements in the lists). Then, to find a single value,
you have to add counts from the beginning of the array.

A standard way to store variable-length integers is to put the indicator of
whether you've seen the whole integer yet in a high bit (so each byte
effectively contributes 7 bits). That's also inherently non-random access.

But if random access is not a requirement, how about Blosc and bcolz?
That's a library that uses a very lightweight compression algorithm on the
arrays and uncompresses them on the fly (fast enough to be practical). That
sounds like it would fit your use-case better...

Jim




On Wed, Mar 13, 2024, 12:47 PM Jim Pivarski  wrote:

> This might be a good application of Awkward Array (
> https://awkward-array.org), which applies a NumPy-like interface to
> arbitrary tree-like data, or ragged (https://github.com/scikit-hep/ragged),
> a restriction of that to only variable-length lists, but satisfying the
> Array API standard.
>
> The variable-length data in Awkward Array hasn't been used to represent
> arbitrary precision integers, though. It might be a good application of
> "behaviors," which are documented here:
> https://awkward-array.org/doc/main/reference/ak.behavior.html In
> principle, it would be possible to define methods and overload NumPy ufuncs
> to interpret variable-length lists of int8 as integers with arbitrary
> precision. Numba might be helpful in accelerating that if normal
> NumPy-style vectorization is insufficient.
>
> If you're interested in following this route, I can help with first
> implementations of that arbitrary precision integer behavior. (It's an
> interesting application!)
>
> Jim
>
>
>
> On Wed, Mar 13, 2024, 12:28 PM Matti Picus  wrote:
>
>> I am not sure what kind of a scheme would support various-sized native
>> ints. Any scheme that puts pointers in the array is going to be worse:
>> the pointers will be 64-bit. You could store offsets to data, but then
>> you would need to store both the offsets and the contiguous data, nearly
>> doubling your storage. What shape are your arrays, that would be the
>> minimum size of the offsets?
>>
>> Matti
>>
>>
>> On 13/3/24 18:15, Dom Grigonis wrote:
>> > By the way, I think I am referring to integer arrays. (Or integer part
>> > of floats.)
>> >
>> > I don’t think what I am saying sensibly applies to floats as they are.
>> >
>> > Although, new float type could base its integer part on such concept.
>> >
>> > —
>> >
>> > Where I am coming from is that I started to hit maximum bounds on
>> > integer arrays, where most of values are very small and some become
>> > very large. And I am hitting memory limits. And I don’t have many
>> > zeros, so sparse arrays aren’t an option.
>> >
>> > Approximately:
>> > 90% of my arrays could fit into `np.uint8`
>> > 1% requires `np.uint64`
>> > the rest 9% are in between.
>> >
>> > And there is no predictable order where is what, so splitting is not
>> > an option either.
>> >
>> >
>> >> On 13 Mar 2024, at 17:53, Nathan  wrote:
>> >>
>> >> Yes, an array of references still has a fixed size width in the array
>> >> buffer. You can think of each entry in the array as a pointer to some
>> >> other memory on the heap, which can be a dynamic memory allocation.
>> >>
>> >> There's no way in NumPy to support variable-sized array elements in
>> >> the array buffer, since that assumption is key to how numpy
>> >> implements strided ufuncs and broadcasting.,
>> >>
>> >> On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis 
>> >> wrote:
>> >>
>> >> Thank you for this.
>> >>
>> >> I am just starting to think about these things, so I appreciate
>> >> your patience.
>> >>
>> >> But isn’t it still true that all elements of an array are still
>> >> of the same size in memory?
>> >>
>> >> I am thinking along the lines of per-element dynamic memory
>> >> management. Such that if I had array [0, 1e1], the first
>> >> element would default to reasonably small size in memory.
>> >>
>> >>> On 13 Mar 2024, at 16:29, Nathan 
>> wrote:
>> >>>
>> >>> It is possible to do this using the new DType system.
>> >>>
>> >>> Sebastian wrote a sketch for a DType backed by the GNU
>> >>> multiprecision float library:
>> >>> https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype
>> >>>
>> >>> It adds a significant amount of complexity to store data outside
>> >>> the arra

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis
Yup yup, good point.

So my array sizes in this case are 3e8. Thus, 32bit ints would be needed. So it 
is not a solution for this case.

Nevertheless, such concept would still be worthwhile for cases where integers 
are say max 256bits (or unlimited), then even if memory addresses or offsets 
are 64bit. This would both:
a) save memory if many of values in array are much smaller than 256bits
b) provide a standard for dynamically unlimited size values

—

For now, what could be a temporary solution for me, is a type, which stays at 
minimum/maximum when it goes below, above bounds.

Integer types don’t work here at all - np.uint8(255) + 2 = 1. Totally 
unacceptable
Floats are a bit better: np.float16(65500) + 100 = np.float16(inf). At least it 
didn’t reset and it went the right way (just a bit too much).


> On 13 Mar 2024, at 18:26, Matti Picus  wrote:
> 
> I am not sure what kind of a scheme would support various-sized native ints. 
> Any scheme that puts pointers in the array is going to be worse: the pointers 
> will be 64-bit. You could store offsets to data, but then you would need to 
> store both the offsets and the contiguous data, nearly doubling your storage. 
> What shape are your arrays, that would be the minimum size of the offsets?
> 
> Matti
> 
> 
> On 13/3/24 18:15, Dom Grigonis wrote:
>> By the way, I think I am referring to integer arrays. (Or integer part of 
>> floats.)
>> 
>> I don’t think what I am saying sensibly applies to floats as they are.
>> 
>> Although, new float type could base its integer part on such concept.
>> 
>> —
>> 
>> Where I am coming from is that I started to hit maximum bounds on integer 
>> arrays, where most of values are very small and some become very large. And 
>> I am hitting memory limits. And I don’t have many zeros, so sparse arrays 
>> aren’t an option.
>> 
>> Approximately:
>> 90% of my arrays could fit into `np.uint8`
>> 1% requires `np.uint64`
>> the rest 9% are in between.
>> 
>> And there is no predictable order where is what, so splitting is not an 
>> option either.
>> 
>> 
>>> On 13 Mar 2024, at 17:53, Nathan  wrote:
>>> 
>>> Yes, an array of references still has a fixed size width in the array 
>>> buffer. You can think of each entry in the array as a pointer to some other 
>>> memory on the heap, which can be a dynamic memory allocation.
>>> 
>>> There's no way in NumPy to support variable-sized array elements in the 
>>> array buffer, since that assumption is key to how numpy implements strided 
>>> ufuncs and broadcasting.,
>>> 
>>> On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis  wrote:
>>> 
>>>Thank you for this.
>>> 
>>>I am just starting to think about these things, so I appreciate
>>>your patience.
>>> 
>>>But isn’t it still true that all elements of an array are still
>>>of the same size in memory?
>>> 
>>>I am thinking along the lines of per-element dynamic memory
>>>management. Such that if I had array [0, 1e1], the first
>>>element would default to reasonably small size in memory.
>>> 
On 13 Mar 2024, at 16:29, Nathan  wrote:
 
It is possible to do this using the new DType system.
 
Sebastian wrote a sketch for a DType backed by the GNU
multiprecision float library:
https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype
 
It adds a significant amount of complexity to store data outside
the array buffer and introduces the possibility of
use-after-free and dangling reference errors that are impossible
if the array does not use embedded references, so that’s the
main reason it hasn’t been done much.
 
On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis
 wrote:
 
Hi all,
 
Say python’s builtin `int` type. It can be as large as
memory allows.
 
np.ndarray on the other hand is optimized for vectorization
via strides, memory structure and many things that I
probably don’t know. Well the point is that it is convenient
and efficient to use for many things in comparison to
python’s built-in list of integers.
 
So, I am thinking whether something in between exists? (And
obviously something more clever than np.array(dtype=object))
 
Probably something similar to `StringDType`, but for
integers and floats. (It’s just my guess. I don’t know
anything about `StringDType`, but just guessing it must be
better than np.array(dtype=object) in combination with
np.vectorize)
 
Regards,
dgpb
 
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to
numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis
Thanks for this.

Random access is unfortunately a requirement.

By the way, what is the difference between awkward and ragged?

> On 13 Mar 2024, at 18:59, Jim Pivarski  wrote:
> 
> After sending that email, I realize that I have to take it back: your 
> motivation is to minimize memory use. The variable-length lists in Awkward 
> Array (and therefore in ragged as well) are implemented using offset arrays, 
> and they're at minimum 32-bit. The scheme is more cache-coherent (less 
> "pointer chasing"), but doesn't reduce the size.
> 
> These offsets are 32-bit so that individual values can be selected from the 
> array in constant time. If you use a smaller integer size, like uint8, then 
> they have to be number of elements in the lists, rather than offsets (the 
> cumsum of number of elements in the lists). Then, to find a single value, you 
> have to add counts from the beginning of the array.
> 
> A standard way to store variable-length integers is to put the indicator of 
> whether you've seen the whole integer yet in a high bit (so each byte 
> effectively contributes 7 bits). That's also inherently non-random access.
> 
> But if random access is not a requirement, how about Blosc and bcolz? That's 
> a library that uses a very lightweight compression algorithm on the arrays 
> and uncompresses them on the fly (fast enough to be practical). That sounds 
> like it would fit your use-case better...
> 
> Jim
> 
> 
> 
> 
> On Wed, Mar 13, 2024, 12:47 PM Jim Pivarski  > wrote:
> This might be a good application of Awkward Array (https://awkward-array.org 
> ), which applies a NumPy-like interface to 
> arbitrary tree-like data, or ragged (https://github.com/scikit-hep/ragged 
> ), a restriction of that to only 
> variable-length lists, but satisfying the Array API standard.
> 
> The variable-length data in Awkward Array hasn't been used to represent 
> arbitrary precision integers, though. It might be a good application of 
> "behaviors," which are documented here: 
> https://awkward-array.org/doc/main/reference/ak.behavior.html 
>  In principle, 
> it would be possible to define methods and overload NumPy ufuncs to interpret 
> variable-length lists of int8 as integers with arbitrary precision. Numba 
> might be helpful in accelerating that if normal NumPy-style vectorization is 
> insufficient.
> 
> If you're interested in following this route, I can help with first 
> implementations of that arbitrary precision integer behavior. (It's an 
> interesting application!)
> 
> Jim
> 
> 
> 
> On Wed, Mar 13, 2024, 12:28 PM Matti Picus  > wrote:
> I am not sure what kind of a scheme would support various-sized native 
> ints. Any scheme that puts pointers in the array is going to be worse: 
> the pointers will be 64-bit. You could store offsets to data, but then 
> you would need to store both the offsets and the contiguous data, nearly 
> doubling your storage. What shape are your arrays, that would be the 
> minimum size of the offsets?
> 
> Matti
> 
> 
> On 13/3/24 18:15, Dom Grigonis wrote:
> > By the way, I think I am referring to integer arrays. (Or integer part 
> > of floats.)
> >
> > I don’t think what I am saying sensibly applies to floats as they are.
> >
> > Although, new float type could base its integer part on such concept.
> >
> > —
> >
> > Where I am coming from is that I started to hit maximum bounds on 
> > integer arrays, where most of values are very small and some become 
> > very large. And I am hitting memory limits. And I don’t have many 
> > zeros, so sparse arrays aren’t an option.
> >
> > Approximately:
> > 90% of my arrays could fit into `np.uint8`
> > 1% requires `np.uint64`
> > the rest 9% are in between.
> >
> > And there is no predictable order where is what, so splitting is not 
> > an option either.
> >
> >
> >> On 13 Mar 2024, at 17:53, Nathan  >> > wrote:
> >>
> >> Yes, an array of references still has a fixed size width in the array 
> >> buffer. You can think of each entry in the array as a pointer to some 
> >> other memory on the heap, which can be a dynamic memory allocation.
> >>
> >> There's no way in NumPy to support variable-sized array elements in 
> >> the array buffer, since that assumption is key to how numpy 
> >> implements strided ufuncs and broadcasting.,
> >>
> >> On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis  >> > 
> >> wrote:
> >>
> >> Thank you for this.
> >>
> >> I am just starting to think about these things, so I appreciate
> >> your patience.
> >>
> >> But isn’t it still true that all elements of an array are still
> >> of the same size in memory?
> >>
> >> I am thinking along the lines of per-element dynamic memory
> >> management. Such that if I had array [0, 1e1], the 

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Jim Pivarski
Awkward is more general: it has all the same data types (and is zero-copy
compatible with) Apache Arrow.

ragged is only lists (of lists) of numbers, so that it's possible to
describe as a shape and dtype. ragged adheres to the Array API, like NumPy
2.0 (am I right in that)? So, ragged is a useful subset.




On Wed, Mar 13, 2024, 1:17 PM Dom Grigonis  wrote:

> Thanks for this.
>
> Random access is unfortunately a requirement.
>
> By the way, what is the difference between awkward and ragged?
>
> On 13 Mar 2024, at 18:59, Jim Pivarski  wrote:
>
> After sending that email, I realize that I have to take it back: your
> motivation is to minimize memory use. The variable-length lists in Awkward
> Array (and therefore in ragged as well) are implemented using offset
> arrays, and they're at minimum 32-bit. The scheme is more cache-coherent
> (less "pointer chasing"), but doesn't reduce the size.
>
> These offsets are 32-bit so that individual values can be selected from
> the array in constant time. If you use a smaller integer size, like uint8,
> then they have to be number of elements in the lists, rather than offsets
> (the cumsum of number of elements in the lists). Then, to find a single
> value, you have to add counts from the beginning of the array.
>
> A standard way to store variable-length integers is to put the indicator
> of whether you've seen the whole integer yet in a high bit (so each byte
> effectively contributes 7 bits). That's also inherently non-random access.
>
> But if random access is not a requirement, how about Blosc and bcolz?
> That's a library that uses a very lightweight compression algorithm on the
> arrays and uncompresses them on the fly (fast enough to be practical). That
> sounds like it would fit your use-case better...
>
> Jim
>
>
>
>
> On Wed, Mar 13, 2024, 12:47 PM Jim Pivarski  wrote:
>
>> This might be a good application of Awkward Array (
>> https://awkward-array.org), which applies a NumPy-like interface to
>> arbitrary tree-like data, or ragged (https://github.com/scikit-hep/ragged),
>> a restriction of that to only variable-length lists, but satisfying the
>> Array API standard.
>>
>> The variable-length data in Awkward Array hasn't been used to represent
>> arbitrary precision integers, though. It might be a good application of
>> "behaviors," which are documented here:
>> https://awkward-array.org/doc/main/reference/ak.behavior.html In
>> principle, it would be possible to define methods and overload NumPy ufuncs
>> to interpret variable-length lists of int8 as integers with arbitrary
>> precision. Numba might be helpful in accelerating that if normal
>> NumPy-style vectorization is insufficient.
>>
>> If you're interested in following this route, I can help with first
>> implementations of that arbitrary precision integer behavior. (It's an
>> interesting application!)
>>
>> Jim
>>
>>
>>
>> On Wed, Mar 13, 2024, 12:28 PM Matti Picus  wrote:
>>
>>> I am not sure what kind of a scheme would support various-sized native
>>> ints. Any scheme that puts pointers in the array is going to be worse:
>>> the pointers will be 64-bit. You could store offsets to data, but then
>>> you would need to store both the offsets and the contiguous data, nearly
>>> doubling your storage. What shape are your arrays, that would be the
>>> minimum size of the offsets?
>>>
>>> Matti
>>>
>>>
>>> On 13/3/24 18:15, Dom Grigonis wrote:
>>> > By the way, I think I am referring to integer arrays. (Or integer part
>>> > of floats.)
>>> >
>>> > I don’t think what I am saying sensibly applies to floats as they are.
>>> >
>>> > Although, new float type could base its integer part on such concept.
>>> >
>>> > —
>>> >
>>> > Where I am coming from is that I started to hit maximum bounds on
>>> > integer arrays, where most of values are very small and some become
>>> > very large. And I am hitting memory limits. And I don’t have many
>>> > zeros, so sparse arrays aren’t an option.
>>> >
>>> > Approximately:
>>> > 90% of my arrays could fit into `np.uint8`
>>> > 1% requires `np.uint64`
>>> > the rest 9% are in between.
>>> >
>>> > And there is no predictable order where is what, so splitting is not
>>> > an option either.
>>> >
>>> >
>>> >> On 13 Mar 2024, at 17:53, Nathan  wrote:
>>> >>
>>> >> Yes, an array of references still has a fixed size width in the array
>>> >> buffer. You can think of each entry in the array as a pointer to some
>>> >> other memory on the heap, which can be a dynamic memory allocation.
>>> >>
>>> >> There's no way in NumPy to support variable-sized array elements in
>>> >> the array buffer, since that assumption is key to how numpy
>>> >> implements strided ufuncs and broadcasting.,
>>> >>
>>> >> On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis 
>>>
>>> >> wrote:
>>> >>
>>> >> Thank you for this.
>>> >>
>>> >> I am just starting to think about these things, so I appreciate
>>> >> your patience.
>>> >>
>>> >> But isn’t it still true that all elements of an array are 

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Homeier, Derek
On 13 Mar 2024, at 6:01 PM, Dom Grigonis  wrote:

So my array sizes in this case are 3e8. Thus, 32bit ints would be needed. So it 
is not a solution for this case.

Nevertheless, such concept would still be worthwhile for cases where integers 
are say max 256bits (or unlimited), then even if memory addresses or offsets 
are 64bit. This would both:
a) save memory if many of values in array are much smaller than 256bits
b) provide a standard for dynamically unlimited size values

In principle one could encode individual offsets in a smarter way, using just 
the minimal number of bits required,
but again that would make random access impossible or very expensive – probably 
more or less amounting to
what smart compression algorithms are already doing.
Another approach might be to to use the mask approach after all (or just flag 
all you uint8 data valued 2**8 as
overflows) and store the correct (uint64 or whatever) values and their indices 
in a second array.
May still not vectorise very efficiently with just numpy if your typical 
operations are non-local.

Derek

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis
My array is growing in a manner of:
array[slice] += values

so for now will just clip values:
res = np.add(array[slice], values, dtype=np.int64)
array[slice] = res
mask = res > MAX_UINT16
array[slice][mask] = MAX_UINT16

For this case, these large values do not have that much impact. And extra 
operation overhead is acceptable.

---

And adding more involved project to my TODOs for the future.

After all, it would be good to have an array, which (at preferably as minimal 
cost as possible) could handle anything you throw at it with near-optimal 
memory consumption and sensible precision handling, while keeping all the 
benefits of numpy.

Time will tell if that is achievable. If anyone had any good ideas regarding 
this I am all ears.

Much thanks to you all for information and ideas.
dgpb

> On 13 Mar 2024, at 21:00, Homeier, Derek  wrote:
> 
> On 13 Mar 2024, at 6:01 PM, Dom Grigonis  wrote:
>> 
>> So my array sizes in this case are 3e8. Thus, 32bit ints would be needed. So 
>> it is not a solution for this case.
>> 
>> Nevertheless, such concept would still be worthwhile for cases where 
>> integers are say max 256bits (or unlimited), then even if memory addresses 
>> or offsets are 64bit. This would both:
>> a) save memory if many of values in array are much smaller than 256bits
>> b) provide a standard for dynamically unlimited size values
> 
> In principle one could encode individual offsets in a smarter way, using just 
> the minimal number of bits required,
> but again that would make random access impossible or very expensive – 
> probably more or less amounting to
> what smart compression algorithms are already doing.
> Another approach might be to to use the mask approach after all (or just flag 
> all you uint8 data valued 2**8 as
> overflows) and store the correct (uint64 or whatever) values and their 
> indices in a second array.
> May still not vectorise very efficiently with just numpy if your typical 
> operations are non-local.
> 
> Derek
> 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Jim Pivarski
So that this doesn't get lost amid the discussion:
https://www.blosc.org/python-blosc2/python-blosc2.html

Blosc is on-the-fly compression, which is a more extreme way of making
variable-sized integers. The compression is in small chunks that fit into
CPU cachelines, such that it's random access per chunk. The compression is
lightweight enough that it can be faster to decompress, edit, and
recompress a chunk than it is to copy from RAM, edit, and copy back to RAM.
(The extra cost of compression is paid for by moving less data between RAM
and CPU. That's why I say "can be," because it depends on the entropy of
the data.) Since you have to copy data from RAM to CPU and back anyway, as
a part of any operation on an array, this can be a net win.

What you're trying to do with variable-length integers is a kind of
compression algorithm, an extremely lightweight one. That's why I think
that Blosc would fit your use-case, because it's doing the same kind of
thing, but with years of development behind it.

(Earlier, I recommended bcolz, which was a Python array based on Blosc, but
now I see that it has been deprecated. However, the link above goes to the
current version of the Python interface to Blosc, so I'd expect it to cover
the same use-cases.)

-- Jim





On Wed, Mar 13, 2024 at 4:45 PM Dom Grigonis  wrote:

> My array is growing in a manner of:
> array[slice] += values
>
> so for now will just clip values:
> res = np.add(array[slice], values, dtype=np.int64)
> array[slice] = res
> mask = res > MAX_UINT16
> array[slice][mask] = MAX_UINT16
>
> For this case, these large values do not have that much impact. And extra
> operation overhead is acceptable.
>
> ---
>
> And adding more involved project to my TODOs for the future.
>
> After all, it would be good to have an array, which (at preferably as
> minimal cost as possible) could handle anything you throw at it with
> near-optimal memory consumption and sensible precision handling, while
> keeping all the benefits of numpy.
>
> Time will tell if that is achievable. If anyone had any good ideas
> regarding this I am all ears.
>
> Much thanks to you all for information and ideas.
> dgpb
>
> On 13 Mar 2024, at 21:00, Homeier, Derek  wrote:
>
> On 13 Mar 2024, at 6:01 PM, Dom Grigonis  wrote:
>
>
> So my array sizes in this case are 3e8. Thus, 32bit ints would be needed.
> So it is not a solution for this case.
>
> Nevertheless, such concept would still be worthwhile for cases where
> integers are say max 256bits (or unlimited), then even if memory addresses
> or offsets are 64bit. This would both:
> a) save memory if many of values in array are much smaller than 256bits
> b) provide a standard for dynamically unlimited size values
>
>
> In principle one could encode individual offsets in a smarter way, using
> just the minimal number of bits required,
> but again that would make random access impossible or very expensive –
> probably more or less amounting to
> what smart compression algorithms are already doing.
> Another approach might be to to use the mask approach after all (or just
> flag all you uint8 data valued 2**8 as
> overflows) and store the correct (uint64 or whatever) values and their
> indices in a second array.
> May still not vectorise very efficiently with just numpy if your typical
> operations are non-local.
>
> Derek
>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com
>
>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: jpivar...@gmail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis
Thanks for reiterating, this looks promising!

> On 13 Mar 2024, at 23:22, Jim Pivarski  wrote:
> 
> So that this doesn't get lost amid the discussion: 
> https://www.blosc.org/python-blosc2/python-blosc2.html 
> 
> 
> Blosc is on-the-fly compression, which is a more extreme way of making 
> variable-sized integers. The compression is in small chunks that fit into CPU 
> cachelines, such that it's random access per chunk. The compression is 
> lightweight enough that it can be faster to decompress, edit, and recompress 
> a chunk than it is to copy from RAM, edit, and copy back to RAM. (The extra 
> cost of compression is paid for by moving less data between RAM and CPU. 
> That's why I say "can be," because it depends on the entropy of the data.) 
> Since you have to copy data from RAM to CPU and back anyway, as a part of any 
> operation on an array, this can be a net win.
> 
> What you're trying to do with variable-length integers is a kind of 
> compression algorithm, an extremely lightweight one. That's why I think that 
> Blosc would fit your use-case, because it's doing the same kind of thing, but 
> with years of development behind it.
> 
> (Earlier, I recommended bcolz, which was a Python array based on Blosc, but 
> now I see that it has been deprecated. However, the link above goes to the 
> current version of the Python interface to Blosc, so I'd expect it to cover 
> the same use-cases.)
> 
> -- Jim
> 
> 
> 
> 
> 
> On Wed, Mar 13, 2024 at 4:45 PM Dom Grigonis  > wrote:
> My array is growing in a manner of:
> array[slice] += values
> 
> so for now will just clip values:
> res = np.add(array[slice], values, dtype=np.int64)
> array[slice] = res
> mask = res > MAX_UINT16
> array[slice][mask] = MAX_UINT16
> 
> For this case, these large values do not have that much impact. And extra 
> operation overhead is acceptable.
> 
> ---
> 
> And adding more involved project to my TODOs for the future.
> 
> After all, it would be good to have an array, which (at preferably as minimal 
> cost as possible) could handle anything you throw at it with near-optimal 
> memory consumption and sensible precision handling, while keeping all the 
> benefits of numpy.
> 
> Time will tell if that is achievable. If anyone had any good ideas regarding 
> this I am all ears.
> 
> Much thanks to you all for information and ideas.
> dgpb
> 
>> On 13 Mar 2024, at 21:00, Homeier, Derek > > wrote:
>> 
>> On 13 Mar 2024, at 6:01 PM, Dom Grigonis > > wrote:
>>> 
>>> So my array sizes in this case are 3e8. Thus, 32bit ints would be needed. 
>>> So it is not a solution for this case.
>>> 
>>> Nevertheless, such concept would still be worthwhile for cases where 
>>> integers are say max 256bits (or unlimited), then even if memory addresses 
>>> or offsets are 64bit. This would both:
>>> a) save memory if many of values in array are much smaller than 256bits
>>> b) provide a standard for dynamically unlimited size values
>> 
>> In principle one could encode individual offsets in a smarter way, using 
>> just the minimal number of bits required,
>> but again that would make random access impossible or very expensive – 
>> probably more or less amounting to
>> what smart compression algorithms are already doing.
>> Another approach might be to to use the mask approach after all (or just 
>> flag all you uint8 data valued 2**8 as
>> overflows) and store the correct (uint64 or whatever) values and their 
>> indices in a second array.
>> May still not vectorise very efficiently with just numpy if your typical 
>> operations are non-local.
>> 
>> Derek
>> 
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>> 
>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>> 
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>> 
>> Member address: dom.grigo...@gmail.com 
> 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> 
> To unsubscribe send an email to numpy-discussion-le...@python.org 
> 
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> 
> Member address: jpivar...@gmail.com 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com