from:"Dom Grigonis"

[Numpy-discussion] Re: Add to NumPy a function to compute cumulative sums from 0.

2023-08-19 Thread Dom Grigonis

Unfortunately, I don’t have a good answer.

For now, I can only tell you what I think might benefit from improvement.

1. Verbosity. I appreciate that bracket syntax such as one in julia or matlab 
`[A B C ...]` is not possible, so functional is the only option. E.g. julia has 
functions named ‘cat’, ‘vcat’, ‘hcat’, ‘vhcat’. I myself have recently 
redefined np.concatenate to `np_c`. For simple operations, it would surely be 
nice to have methods. E.g. `arr.append(axis)/arr.prepend(axis)`.

2. Excessive number of functions. There seems to be very many functions for 
concatenating and stacking. Many operations can be done using different 
functions and approaches and usually one of them is several times faster than 
the rest. I will give an example. Stacking two 1d vectors as columns of 2d 
array:

arr = np.arange(100)
TIMER.repeat([
lambda: np.array([arr, arr]).T,
lambda: np.vstack([arr, arr]).T,
lambda: np.stack([arr, arr]).T,
lambda: np.c_[arr, arr],
lambda: np.column_stack((arr, arr)),
lambda: np.concatenate([arr[:, None], arr[:, None]], axis=1)
]).print(3)
# mean [[0.012 0.044 0.052 0.13  0.032 0.024]]
Instead, having fewer, but more intuitive/flexible and well optimised functions 
would be a bit more convenient.

3. Flattening and reshaping API is not very intuitive. e.g. torch flatten is an 
example of a function which has a desired level of flexibility in contrast to 
`np.flatten`. https://pytorch.org/docs/stable/generated/torch.flatten.html 
. I had similar 
issues with multidimensional searching, sorting, multi-dimensional overlaps and 
custom unique functions. In other words, all functionality is there already, 
but in more custom (although requirement is often very simple from perspective 
of how it looks in my mind) multi-dimensional cases, there is no easy API and I 
end up writing my own numpy functions and benchmarking numerous ways to achieve 
the same thing. By now, I have my own multi-dimensional unique, sort, search, 
flatten, more flexible ix_, which are not well tested, but already more 
convenient, flexible and often several times faster than numpy ones (although 
all they do is reuse existing numpy functionality).

I think these are more along the lines of numpy 2.0, rather than simple 
extension. It feels that API can generally be more flexible and intuitive and 
there is enough of existing numpy material and external examples from which to 
draw from to make next level API happen. Although I appreciate required effort 
and difficulties.

Having all that said, implementing julia’s equivalents ‘cat’, ‘vcat’, ‘hcat’, 
‘vhcat’ together with `arr.append(others, axis), arr.prepend(others, axis)` 
while ensuring that they use most optimised approaches could potentially make 
life easier for the time being.

—Nothing ever dies, just enters the state of deferred evaluation—
Dg

> On 19 Aug 2023, at 17:39, Ronald van Elburg  
> wrote:
> 
> I think ultimately the copy is unnecessary.
> 
> That being said introducing prepend and append functions concentrates the 
> complexity of the mapping in one place. Trying to avoid the extra copy would 
> probably lead to a more complex implementation of accumulate.  
> 
> How would in your view the prepend interface differ from concatenation or 
> stacking?
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Add to NumPy a function to compute cumulative sums from 0.

2023-08-22 Thread Dom Grigonis

I don’t have an issue with cumsum0 if it is approached as a request for a 
useful utility function.

But arguing that this is what a cumulative sum function should be doing is a 
very big stretch. Cumulative sum has its foundational meaning and purpose which 
is clearly reflected in its name, which is not to solve fencepost error, but to 
accumulate the summation sequence. Prepending 0 as part of it feels very 
unnatural. It is simply extra operation.

diff0, in my opinion, has a bit more intuitive sense to it, but obviously there 
is no need to add it if no one else needs/uses it.


> On 22 Aug 2023, at 17:36, john.daw...@camlingroup.com wrote:
> 
> Dom Grigonis wrote:
>> 1. Dimension length stays constant, while cumusm0 extends length to n+1, 
>> then np.diff, truncates it back. This adds extra complexity, while things 
>> are very convenient to work with when dimension length stays constant 
>> throughout the code.
> 
> For n values there are n-1 differences. Equivalently, for k differences there 
> are k+1 values. Herefor, `diff` ought to reduce length by 1 and `cumsum` 
> ought to increase it by 1. Returning arrays of the same length is a fencepost 
> error. This is a problem in the current behaviour of `cumsum` and the 
> proposed behaviour of `diff0`.

diff0 doesn’t solve the error in a strict sense. However, the first value of 
diff0 result becomes the starting point from which to count remaining 
differences, so with the right approach it does solve the issue - if starting 
values are subtracted then it is doing the same thing, just in different order. 
See below:

> 
> 
> EXAMPLE
> 
> Consider a path given by a list of points, say (101, 203), (102, 205), (107, 
> 204) and (109, 202). What are the positions at fractions, say 1/3 and 2/3, 
> along the path (linearly interpolating)?
> 
> The problem is naturally solved with `diff` and `cumsum0`:
> 
> ```
> import numpy as np
> from scipy import interpolate
> 
> positions = np.array([[101, 203], [102, 205], [107, 204], [109, 202]], 
> dtype=float)
> steps_2d = np.diff(positions, axis=0)
> steps_1d = np.linalg.norm(steps_2d, axis=1)
> distances = np.cumsum0(steps_1d)
> fractions = distances / distances[-1]
> interpolate_at = interpolate.make_interp_spline(fractions, positions, 1)
> interpolate_at(1/3)
> interpolate_at(2/3)
> ```
> 
> Please show how to solve the problem with `diff0` and `cumsum`.
> 

positions = np.array([[101, 203], [102, 205], [107, 204], [109, 202]], 
dtype=float)
positions_rel = positions - positions[0, None]
steps_2d = diff0(positions_rel, axis=0)
steps_1d = np.linalg.norm(steps_2d, axis=1)
distances = np.cumsum(steps_1d)
fractions = distances / distances[-1]
interpolate_at = interpolate.make_interp_spline(fractions, positions, 1)
print(interpolate_at(1/3))
print(interpolate_at(2/3))
> 
> EXAMPLE
> 
> Money is invested on 2023-01-01. The annualized rate is 4% until 2023-02-04 
> and 5% thence until 2023-04-02. By how much does the money multiply in this 
> time?
> 
> The problem is naturally solved with `diff`:
> 
> ```
> import numpy as np
> 
> percents = np.array([4, 5], dtype=float)
> times = np.array(["2023-01-01", "2023-02-04", "2023-04-02"], 
> dtype=np.datetime64)
> durations = np.diff(times)
> YEAR = np.timedelta64(365, "D")
> multipliers = (1 + percents / 100) ** (durations / YEAR)
> multipliers.prod()
> ```
> 
> Please show how to solve the problem with `diff0`. It makes sense to divide 
> `np.diff(times)` by `YEAR`, but it would not make sense to divide the output 
> of `np.diff0(times)` by `YEAR` because of its incongruous initial value.
> 
In my experience it is more sensible to use time series approach, where the 
whole path of investment is calculated. For modelling purposes, analysis and 
presentation to clients single code can then be used. I would do it like:
r = np.log(1 + np.array([0, 0.04, 0.05]))
start_date = np.array("2023-01-01", dtype=np.datetime64)
times = np.array(["2023-01-01", "2023-02-04", "2023-04-02"], 
dtype=np.datetime64)
t = (times - start_date).astype(float) / 365
dt = diff0(t)
normalised = np.exp(np.cumsum(r * dt))
# PLOT
s0 = 1000
plt.plot(s0 * normalised)

Apart from responses above, diff0 is useful in data analysis. Indices and 
observations usually have the same length. It is always convenient to keep it 
that way and it makes a nice, clean and simple code.
t = dates
s = observations
# Plot changes:
ds = diff0(s)
plt.plot(dates, ds)
# 2nd order changes
pl

[Numpy-discussion] Find location of slice in it's base

2023-08-31 Thread Dom Grigonis

Hi everyone,

I am working with shared arrays and their slices. And trying to subclass 
ndarray in such way so that they remap to memory on unpickling. I am using 
package SharedArray, which doesn’t micro-manage memory locations, but rather 
stores the whole map via shm using unique name. One issue I ran into is that 
when I pickle & unpickle slice of memory mapped array, I need to store the spec 
of the subset so that I can retrieve the same portion. 

The issue I am working on has been briefly discussed on stack: 
https://stackoverflow.com/questions/12421770/find-location-of-slice-in-numpy-array
 
.
 Although I am not sure if it is the exact same one.

So, in short, is there a way to determine location of a slice in original 
array. Ideally, it would be indexing which works for any non-copy slice/subset 
possible.

Kind regards,
Dg

—Nothing ever dies, just enters the state of deferred evaluation—

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Find location of slice in it's base

2023-08-31 Thread Dom Grigonis

I am looking for something that would work for stride tricks too. Essentially, 
any possible view.

> On 31 Aug 2023, at 23:03, Aaron Meurer  wrote:
> 
> In principle this can be reconstructed from the strides, shape, and
> base memory address (a.ctypes.data) of the view and base arrays.
> However, not all views can be reconstructed using slices alone, for
> example, views from reshape operations or stride tricks. I don't know
> if it's possible to just construct a view directly given a base,
> offset, shape, and strides, but I think ideally that's what you'd
> want.
> 
> Aaron Meurer
> 
> On Thu, Aug 31, 2023 at 1:25 PM Dom Grigonis  wrote:
>> 
>> Hi everyone,
>> 
>> I am working with shared arrays and their slices. And trying to subclass 
>> ndarray in such way so that they remap to memory on unpickling. I am using 
>> package SharedArray, which doesn’t micro-manage memory locations, but rather 
>> stores the whole map via shm using unique name. One issue I ran into is that 
>> when I pickle & unpickle slice of memory mapped array, I need to store the 
>> spec of the subset so that I can retrieve the same portion.
>> 
>> The issue I am working on has been briefly discussed on stack: 
>> https://stackoverflow.com/questions/12421770/find-location-of-slice-in-numpy-array.
>>  Although I am not sure if it is the exact same one.
>> 
>> So, in short, is there a way to determine location of a slice in original 
>> array. Ideally, it would be indexing which works for any non-copy 
>> slice/subset possible.
>> 
>> Kind regards,
>> Dg
>> 
>> —Nothing ever dies, just enters the state of deferred evaluation—
>> 
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: asmeu...@gmail.com
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Find location of slice in it's base

2023-08-31 Thread Dom Grigonis

Thank you, sounds just what I need. Will work on this in due time.

> On 1 Sep 2023, at 00:38, Robert Kern  wrote:
> 
> On Thu, Aug 31, 2023 at 3:25 PM Dom Grigonis  <mailto:dom.grigo...@gmail.com>> wrote:
> Hi everyone,
> 
> I am working with shared arrays and their slices. And trying to subclass 
> ndarray in such way so that they remap to memory on unpickling. I am using 
> package SharedArray, which doesn’t micro-manage memory locations, but rather 
> stores the whole map via shm using unique name. One issue I ran into is that 
> when I pickle & unpickle slice of memory mapped array, I need to store the 
> spec of the subset so that I can retrieve the same portion. 
> 
> The issue I am working on has been briefly discussed on stack: 
> https://stackoverflow.com/questions/12421770/find-location-of-slice-in-numpy-array
>  
> <https://stackoverflow.com/questions/12421770/find-location-of-slice-in-numpy-array>.
>  Although I am not sure if it is the exact same one.
> 
> So, in short, is there a way to determine location of a slice in original 
> array. Ideally, it would be indexing which works for any non-copy 
> slice/subset possible.
> 
> If you chase the `.base` chain all the way down, you get to a 
> `SharedArray.map_owner` object with its `.addr` attribute giving the memory 
> address that's assigned to it in the current process. This will be the same 
> address that's in the `'data'` key of that first `ndarray` returned from 
> `SharedArray.create()` that you are making your views from. In your view 
> `ndarray` (which may be a view of views, with slices, transposes, reshapes, 
> and arbitrary `stride_tricks` manipulations in between). If you store the 
> difference between the view `ndarray`'s `'data'` address from the 
> `map_owner.addr` address, I think that's all you need to recreate the array 
> in a different process which gets assigned a different memory address for the 
> same `shm` name. Just add that offset to the `map_owner.addr` and restore the 
> rest of the information in `__array_interface__`, and I think you should be 
> good to go. I don't think you'll need to infer or represent the precise path 
> of Python-level operations (slices, transposes, reshapes, etc.) to which it 
> got to that point.
> 
> -- 
> Robert Kern
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> <mailto:numpy-discussion@python.org>
> To unsubscribe send an email to numpy-discussion-le...@python.org 
> <mailto:numpy-discussion-le...@python.org>
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
> Member address: dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Proposal to accept NEP 52 (Python API cleanup for 2.0)

2023-09-15 Thread Dom Grigonis

Hello,

I have a couple of questions:
1. What is equivalent of np.byte_bounds? I have recently started using this.
2. Why are you removing business day functionality? Are there faster methods in 
python space for it? As far as I remember, for performance critical 
applications I have always resorted to numpy, including its business day 
functionality.

Regards,
DG

> On 15 Sep 2023, at 21:12, Ralf Gommers  wrote:
> 
> Hi all,
> 
> A lot of work has been happening to implement NEP 52 
> (https://numpy.org/neps/nep-0052-python-api-cleanup.html 
> ) over the past 1.5 
> months - mostly work by Mateusz Sokol, and review effort of Sebastian, Nathan 
> and myself. The majority of API changes have been made. There's more to do of 
> course and there are pending PRs for a good fraction of that. These two 
> tracking issues cover a lot of ground and discussion around decision on 
> individual APIs:
> 
> - main namespace: https://github.com/numpy/numpy/issues/24306 
> 
> - numpy.lib namespace: https://github.com/numpy/numpy/issues/24507 
> 
> 
> This PR with a migration guide will give a good sense of what has been 
> removed or changed so far: https://github.com/numpy/numpy/pull/24693 
> .
> 
> In https://github.com/numpy/numpy/pull/24620 
>  the NEP itself is being updated 
> for changes that have been made. And it will mark the NEP as Accepted, which 
> seems about time given that a lot of the work has already been merged. 
> 
> If there are no substantive objections within 7 days from this email, then 
> the NEP will be accepted; see NEP 0 for more details.
> 
> Cheers,
> Ralf
> 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Preserving subclass on concatenation

2023-10-14 Thread Dom Grigonis

Hi everyone,

Quick question. Is there a way to concatenate arrays while preserving 
subclasses or is it left for custom implementation due to ambiguities of this 
operation?

Regards,
DG
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Seeking feedback: design doc for `namedarray`, a lightweight array data structure with named dimensions

2023-10-19 Thread Dom Grigonis

I would make use of it if it was also supporting pure-numpy indices too. 
Pure-numpy n-dim array with indices is what I was after for a while now. The 
reason is exactly that - to shed heavy dependencies as pandas and have 
performance of pure numpy.

Regards,
DG

> On 20 Oct 2023, at 00:51, Anderson Banihirwe  wrote:
> 
> :wave:t5: folks, [there has been growing interest in a lightweight array 
> structure](https://github.com/pydata/xarray/issues/3981) that's in the same 
> vein as [xarray's 
> Variable](https://docs.xarray.dev/en/stable/generated/xarray.Variable.html). 
> we've put together a design doc for `namedarray`, and we could use your 
> feedback/input. 
> 
> ## what is `namedarray`?
> 
> in essence, `namedarray` aims to be a lighter version of xarray's 
> Variable—shedding some of the heavier dependencies (e.g. Pandas) but still 
> retaining the goodness of named dimensions. 
> 
> ## what makes it special?
> 
> * **Array Protocol Compatibility**: we are planning to make it compatible 
> with existing array protocols and the new [Python array API 
> standard](https://data-apis.org/array-api/latest/).
> * **Duck-Array Objects**: designed to wrap around multiple duck-array 
> objects, like NumPy, Dask, Sparse, Pint, CuPy, and PyTorch.
> 
> ## why are we doing this?
> 
> the goal is to bridge the gap between power and simplicity, providing a 
> lightweight alternative for scientific computing tasks that don't require the 
> full firepower of Xarray (`DataArray` and `Dataset`).
> 
> ## share your thoughts
> 
> We've put together a design doc that goes into the nitty-gritty of 
> `namedarray`. your insights could be invaluable in making this initiative a 
> success. please give it a read and share your thoughts 
> [here](https://github.com/pydata/xarray/discussions/8080)
> 
> * **Design Doc**: [namedarray Design 
> Document](https://github.com/pydata/xarray/blob/main/design_notes/named_array_design_doc.md)
> 
> cross posting from [Scientifc Python 
> Discourse](https://discuss.scientific-python.org/t/seeking-feedback-design-doc-for-namedarray-a-lightweight-array-data-structure-with-named-dimensions/841)
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Function that searches arrays for the first element that satisfies a condition

2023-10-26 Thread Dom Grigonis

Could you please give a concise example? I know you have provided one, but it 
is engrained deep in verbose text and has some typos in it, which makes hard to 
understand exactly what inputs should result in what output.

Regards,
DG

> On 25 Oct 2023, at 22:59, rosko37  wrote:
> 
> I know this question has been asked before, both on this list as well as 
> several threads on Stack Overflow, etc. It's a common issue. I'm NOT asking 
> for how to do this using existing Numpy functions (as that information can be 
> found in any of those sources)--what I'm asking is whether Numpy would accept 
> inclusion of a function that does this, or whether (possibly more likely) 
> such a proposal has already been considered and rejected for some reason.
> 
> The task is this--there's a large array and you want to find the next element 
> after some index that satisfies some condition. Such elements are common, and 
> the typical number of elements to be searched through is small relative to 
> the size of the array. Therefore, it would greatly improve performance to 
> avoid testing ALL elements against the conditional once one is found that 
> returns True. However, all built-in functions that I know of test the entire 
> array. 
> 
> One can obviously jury-rig some ways, like for instance create a "for" loop 
> over non-overlapping slices of length slice_length and call something like 
> np.where(cond) on each--that outer "for" loop is much faster than a loop over 
> individual elements, and the inner loop at most will go slice_length-1 
> elements past the first "hit". However, needing to use such a convoluted 
> piece of code for such a simple task seems to go against the Numpy spirit of 
> having one operation being one function of the form func(arr)".
> 
> A proposed function for this, let's call it "np.first_true(arr, start_idx, 
> [stop_idx])" would be best implemented at the C code level, possibly in the 
> same code file that defines np.where. I'm wondering if I, or someone else, 
> were to write such a function, if the Numpy developers would consider merging 
> it as a standard part of the codebase. It's possible that the idea of such a 
> function is bad because it would violate some existing broadcasting or fancy 
> indexing rules. Clearly one could make it possible to pass an "axis" argument 
> to np.first_true() that would select an axis to search over in the case of 
> multi-dimensional arrays, and then the result would be an array of indices of 
> one fewer dimension than the original array. So 
> np.first_true(np.array([1,5],[2,7],[9,10],cond) would return [1,1,0] for 
> cond(x): x>4. The case where no elements satisfy the condition would need to 
> return a "signal value" like -1. But maybe there are some weird cases where 
> there isn't a sensible return val
 ue, hence why such a function has not been added.
> 
> -Andrew Rosko
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Function that searches arrays for the first element that satisfies a condition

2023-10-26 Thread Dom Grigonis

If such issue is at numpy level,
eg xor, which tests for number truth value is equal to n:
xor([1, 1, 0], 2) == True
xor([1, 0, 0], 2) == False

I try to use builtin iterator functions for efficiency, such as combination of 
filter + next.


If, however, the problem is at numpy level, I find `numba` does a pretty good 
job. I had a similar issue and I couldn’t beat numba’s performance with Cython. 
Most likely due to the reason that I don’t know how to use Cython most 
optimally, but in my experience numba is good enough.

import numba as nb
import numpy as np

@nb.njit
def inner(x, func):
result = np.full(x.shape[0], -1, dtype=np.int32)
for i in range(x.shape[0]):
for j in range(x.shape[1]):
if func(x[i, j]):
result[i] = j
break
return result

def first_true_nb_func(arr, cond):
func = nb.njit(cond)
return inner(arr, func)


@nb.njit
def first_true_nb(arr):
result = np.full(arr.shape[0], -1, dtype=np.int32)
for i in range(arr.shape[0]):
for j in range(arr.shape[1]):
if arr[i, j] > 4:
result[i] = j
break
return result


def first_true(arr, cond):
result = np.full(arr.shape[0], -1, dtype=np.int32)
for i in range(arr.shape[0]):
for j in range(arr.shape[1]):
if cond(arr[i, j]):
result[i] = j
break
return result


arr = np.array([[1,5],[2,7],[9,10]])
print(first_true_nb_func(arr, lambda x: x > 4)) # [1, 1, 0] 163 ms
print(first_true(arr, lambda x: x > 4)) # [1, 1, 0] 4.48 µs
print(first_true_nb(arr))   # [1, 1, 0] 1.02 µs

# LARGER ARRAY
arr = 4 + np.random.normal(0, 1, (100, 5))
print(first_true_nb_func(arr, lambda x: x > 4)) # 152 ms
print(first_true(arr, lambda x: x > 4)) # 69.7 µs
print(first_true_nb(arr))   # 1.02 µs
So numba is a very good option if not needing to source callable. Although I 
think with certain size numba with callable would outperform pure-python 
solution.


Having that said, I completely support the idea that optimised mechanism for 
such situations was part of numpy. Maybe np.where_first_n(arr, op, value, n=1, 
axis=None), where op is a selection of standard comparison operators.

Args:
* Obviously having `cond` to be a callable would be most flexible, but not sure 
if it was easy to achieve good performance with it. Same as in example above.
* `first`, `last` args are not needed as input can be the slice view.
* where_last_n is not needed as input can be reversed view.

Regards,
DG


> On 26 Oct 2023, at 16:07, Ilhan Polat  wrote:
> 
> It's typically called short-circuiting or quick exit when the target 
> condition is met. 
> 
> if you have an array a = np.array([-1, 2, 3, 4, , 1]) and you are 
> looking for a true/false result whether anything is negative or not (a < 
> 0).any() will generate a bool array equal to and then check all entries of 
> that bool array just to reach the conclusion true which was already true at 
> the first entry. Instead it spends 1 units of time for all entries.
> 
> We did similar things on SciPy side Cython level, but they are not really 
> competitive, instead more of a convenience. More general discussion I opened 
> is in https://github.com/data-apis/array-api/issues/675 
> <https://github.com/data-apis/array-api/issues/675>
> 
> 
> 
> 
> 
> On Thu, Oct 26, 2023 at 2:52 PM Dom Grigonis  <mailto:dom.grigo...@gmail.com>> wrote:
> Could you please give a concise example? I know you have provided one, but it 
> is engrained deep in verbose text and has some typos in it, which makes hard 
> to understand exactly what inputs should result in what output.
> 
> Regards,
> DG
> 
> > On 25 Oct 2023, at 22:59, rosko37  > <mailto:rosk...@gmail.com>> wrote:
> > 
> > I know this question has been asked before, both on this list as well as 
> > several threads on Stack Overflow, etc. It's a common issue. I'm NOT asking 
> > for how to do this using existing Numpy functions (as that information can 
> > be found in any of those sources)--what I'm asking is whether Numpy would 
> > accept inclusion of a function that does this, or whether (possibly more 
> > likely) such a proposal has already been considered and rejected for some 
> > reason.
> > 
> > The task is this--there's a large array and you want to find the next 
> > element after some index that satisfies some condition. Such elements are 
> > common, and the typical number of elements to be searched through is small 
> > relative to the size of the array. Therefore, it would greatly improve 
> > performance to avoid testing ALL elements against the conditional once one 
> > is found that retu

[Numpy-discussion] Re: Function that searches arrays for the first element that satisfies a condition

2023-10-30 Thread Dom Grigonis

I juggled a bit and found pretty nice solution using numba. Which is probably 
not very robust, but proves that such thing can be optimised while retaining 
flexibility. Check if it works for your use cases and let me know if anything 
fails or if it is slow compared to what you used.

first_true_str = """
def first_true(arr, n):
result = np.full((n, arr.shape[1]), -1, dtype=np.int32)
for j in range(arr.shape[1]):
k = 0
for i in range(arr.shape[0]):
x = arr[i:i + 1, j]
if cond(x):
result[k, j] = i
k += 1
if k >= n:
break
return result
"""


class FirstTrue:
CONTEXT = {'np': np}

def __init__(self, expr):
self.expr = expr
self.expr_ast = ast.parse(expr, mode='exec').body[0].value
self.func_ast = ast.parse(first_true_str, mode='exec')
self.func_ast.body[0].body[1].body[1].body[1].test = self.expr_ast
self.func_cmp = compile(self.func_ast, filename="", mode="exec")
exec(self.func_cmp, self.CONTEXT)
self.func_nb = nb.njit(self.CONTEXT[self.func_ast.body[0].name])

def __call__(self, arr, n=1, axis=None):
# PREPARE INPUTS
in_1d = False
if axis is None:
arr = np.ravel(arr)[:, None]
in_1d = True
elif axis == 0:
if arr.ndim == 1:
in_1d = True
arr = arr[:, None]
else:
raise ValueError('axis ~in (None, 0)')
res = self.func_nb(arr, n)
if in_1d:
res = res[:, 0]
return res


if __name__ == '__main__':
arr = np.arange(125).reshape((5, 5, 5))
ft = FirstTrue('np.sum(x) > 30')
print(ft(arr, n=2, axis=0))
[[1 0 0 0 0]
 [2 1 1 1 1]]
In [16]: %timeit ft(arr, 2, axis=0)
1.31 µs ± 3.94 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Regards,
DG

> On 29 Oct 2023, at 23:18, rosko37  wrote:
> 
> An example with a 1-D array (where it is easiest to see what I mean) is the 
> following. I will follow Dom Grigonis's suggestion that the range not be 
> provided as a separate argument, as it can be just as easily "folded into" 
> the array by passing a slice. So it becomes just:
> idx = first_true(arr, cond)
> 
> As Dom also points out, the "cond" would likely need to be a "function 
> pointer" (i.e., the name of a function defined elsewhere, turning first_true 
> into a higher-order function), unless there's some way to pass a parseable 
> expression for simple cases. A few special cases like the first zero/nonzero 
> element could be handled with dedicated options (sort of like matplotlib 
> colors), but for anything beyond that it gets unwieldy fast.
> 
> So let's say we have this:
> **
> def cond(x):
> return x>50
> 
> search_arr = np.exp(np.arange(0,1000))
> 
> print(np.first_true(search_arr, cond))
> ***
> 
> This should print 4, because the element of search_arr at index 4 (i.e. the 
> 5th element) is e^4, which is slightly greater than 50 (while e^3 is less 
> than 50). It should return this without testing the 6th through 1000th 
> elements of the array at all to see whether they exceed 50 or not. This 
> example is rather contrived, because simply taking the natural log of 50 and 
> rounding up is far superior, not even evaluating the array of exponentials 
> (which my example clearly still does--and in the use cases I've had for such 
> a function, I can't predict the array elements like this--they come from 
> loaded data, the output of a simulation, etc., and are all already in a numpy 
> array). And in this case, since the values are strictly increasing, 
> search_sorted() would work as well. But it illustrates the idea.
> 
> 
> 
> 
> 
> On Thu, Oct 26, 2023 at 5:54 AM Dom Grigonis  <mailto:dom.grigo...@gmail.com>> wrote:
> Could you please give a concise example? I know you have provided one, but it 
> is engrained deep in verbose text and has some typos in it, which makes hard 
> to understand exactly what inputs should result in what output.
> 
> Regards,
> DG
> 
> > On 25 Oct 2023, at 22:59, rosko37  > <mailto:rosk...@gmail.com>> wrote:
> > 
> > I know this question has been asked before, both on this list as well as 
> > several threads on Stack Overflow, etc. It's a common issue. I'm NOT asking 
> > for how to do this using existing Numpy functions (as that information can 
> > be found in any of those sources)--what I'm asking is whether Numpy would 
> > accept inclusion of a function that does this, or whether (poss

[Numpy-discussion] Re: Function that searches arrays for the first element that satisfies a condition

2023-10-31 Thread Dom Grigonis

This results in a very slow code. The function calls of 
_pred = numba.njit(pred)
are expensive and this sort of approach will be comparable to pure python 
functions.

This is only recommended for sourcing functions that are not called frequently, 
but rather have a large computational content within them. In other words not 
suitable for predicates.

Regards,
DG

> On 1 Nov 2023, at 01:05, Juan Nunez-Iglesias  wrote:
> 
> If you add a layer of indirection with Numba you can get a *very* nice API:
> 
> @numba.njit
> def _first(arr, pred):
> for i, elem in enumerate(arr):
> if pred(elem):
> return i
> 
> def first(arr, pred):
> _pred = numba.njit(pred)
> return _first(arr, _pred)
> 
> This even works with lambdas! (TIL, thanks Numba devs!)
> 
> >>> first(np.random.random(10_000_000), lambda x: x > 0.99)
> 215
> 
> Since Numba has ufunc support I don't suppose it would be hard to make it 
> work with an axis= argument, but I've never played with that API myself.
> 
> On Tue, 31 Oct 2023, at 6:49 PM, Lev Maximov wrote:
>> I've implemented such functions in Cython and packaged them into a library 
>> called numpy_illustrated <https://pypi.org/project/numpy-illustrated/>
>> 
>> It exposes the following functions:
>> 
>> find(a, v)  # returns the index of the first occurrence of v in a
>> first_above(a, v)   # returns the index of the first element in a that is 
>> strictly above v
>> first_nonzero(a)   # returns the index of the first nonzero element
>> 
>> They scan the array and bail out immediately once the match is found. Have a 
>> significant performance gain if the element to be
>> found is closer to the beginning of the array. Have roughly the same speed 
>> as alternative methods if the value is missing.
>> 
>> The complete signatures of the functions look like this:
>> 
>> find(a, v, rtol=1e-05, atol=1e-08, sorted=False, default=-1, raises=False)
>> first_above(a, v, sorted=False, missing=-1, raises=False)
>> first_nonzero(a, missing=-1, raises=False)
>> 
>> This covers the most common use cases and does not accept Python callbacks 
>> because accepting them would nullify any speed gain
>> one would expect from such a function. A Python callback can be implemented 
>> with Numba, but anyone who can write the callback
>> in Numba has no need for a library that wraps it into a dedicated function.
>> 
>> The library has a 100% test coverage. Code style 'black'. It should be easy 
>> to add functions like 'first_below' if necessary.
>> 
>> A more detailed description of these functions can be found here 
>> <https://betterprogramming.pub/the-numpy-illustrated-library-7531a7c43ffb?sk=8dd60bfafd6d49231ac76cb148a4d16f>.
>> 
>> Best regards,
>>   Lev Maximov
>> 
>> On Tue, Oct 31, 2023 at 3:50 AM Dom Grigonis > <mailto:dom.grigo...@gmail.com>> wrote:
>> I juggled a bit and found pretty nice solution using numba. Which is 
>> probably not very robust, but proves that such thing can be optimised while 
>> retaining flexibility. Check if it works for your use cases and let me know 
>> if anything fails or if it is slow compared to what you used.
>> 
>> 
>> 
>> first_true_str = """
>> def first_true(arr, n):
>> result = np.full((n, arr.shape[1]), -1, dtype=np.int32)
>> for j in range(arr.shape[1]):
>> k = 0
>> for i in range(arr.shape[0]):
>> x = arr[i:i + 1, j]
>> if cond(x):
>> result[k, j] = i
>> k += 1
>> if k >= n:
>> break
>> return result
>> """
>> 
>> 
>> class FirstTrue:
>> CONTEXT = {'np': np}
>> 
>> def __init__(self, expr):
>> self.expr = expr
>> self.expr_ast = ast.parse(expr, mode='exec').body[0].value
>> self.func_ast = ast.parse(first_true_str, mode='exec')
>> self.func_ast.body[0].body[1].body[1].body[1].test = self.expr_ast
>> self.func_cmp = compile(self.func_ast, filename="", mode="exec")
>> exec(self.func_cmp, self.CONTEXT)
>> self.func_nb = nb.njit(self.CONTEXT[self.func_ast.body[0].name])
>> 
>> def __call__(self, arr, n=1, axis=None):
>> # PREPARE INPUTS
>> in_1d = False
>> if axis is None:
>> arr = np.ravel(arr)[:, None]
>> in_1d = True
&

[Numpy-discussion] Re: Function that searches arrays for the first element that satisfies a condition

2023-11-01 Thread Dom Grigonis


Your comparisons do not paint correct picture.
a) Most of time here is spent for array allocation and the actual time of the 
loop gets swallowed in the noise
a) You do not test early stop - your benchmarks simply test full loop as 
condition is almost never hit - 5 standard deviations...

Here is comparison with Cython:

import npi
arr = np.random.random(100_000)
%timeit npi.first_above(arr, 5) # 66.7 µs
%timeit first(arr, lambda x: x > 5) # 83.8 ms
arr[100] += 10
%timeit npi.first_above(arr, 5) # 16.2 µs
%timeit first(arr, lambda x: x > 5) # 86.9 ms

It is in the magnitude of 1000 x slower.

Regards,
DG

> On 1 Nov 2023, at 14:07, Juan Nunez-Iglesias  wrote:
> 
> Have you tried timing things? Thankfully this is easy to test because the 
> Python source of numba-jitted functions is available at jitted_func.py_func.
> 
> In [23]: @numba.njit
> ...: def _first(arr, pred):
> ...: for i, elem in enumerate(arr):
> ...: if pred(elem):
> ...: return i
> ...:
> ...: def first(arr, pred):
> ...: _pred = numba.njit(pred)
> ...: return _first(arr, _pred)
> ...:
> 
> In [24]: arr = np.random.random(100_000_000)
> 
> In [25]: %timeit first(arr, lambda x: x > 5)
> 72 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> 
> In [26]: %timeit arr + 5
> 90.3 ms ± 762 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
> 
> In [27]: %timeit _first.py_func(arr, lambda x: x > 5)
> 7.8 s ± 46.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> 
> So numba gives a >100x speedup. It's still not as fast as a NumPy function 
> call that doesn't have an allocation overhead:
> 
> In [30]: arr2 = np.empty_like(arr, dtype=bool)
> 
> In [32]: %timeit np.greater(arr, 5, out=arr2)
> 13.9 ms ± 69.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> 
> But it's certainly much better than pure Python! And it's not a huge cost for 
> the flexibility.
> 
> Juan.
> 
> On Wed, 1 Nov 2023, at 10:42 AM, Dom Grigonis wrote:
>> This results in a very slow code. The function calls of 
>> 
>> 
>> _pred = numba.njit(pred)
>> 
>> are expensive and this sort of approach will be comparable to pure python 
>> functions.
>> 
>> This is only recommended for sourcing functions that are not called 
>> frequently, but rather have a large computational content within them. In 
>> other words not suitable for predicates.
>> 
>> Regards,
>> DG
>> 
>>> On 1 Nov 2023, at 01:05, Juan Nunez-Iglesias >> <mailto:j...@fastmail.com>> wrote:
>>> 
>>> If you add a layer of indirection with Numba you can get a *very* nice API:
>>> 
>>> @numba.njit
>>> def _first(arr, pred):
>>> for i, elem in enumerate(arr):
>>> if pred(elem):
>>> return i
>>> 
>>> def first(arr, pred):
>>> _pred = numba.njit(pred)
>>> return _first(arr, _pred)
>>> 
>>> This even works with lambdas! (TIL, thanks Numba devs!)
>>> 
>>> >>> first(np.random.random(10_000_000), lambda x: x > 0.99)
>>> 215
>>> 
>>> Since Numba has ufunc support I don't suppose it would be hard to make it 
>>> work with an axis= argument, but I've never played with that API myself.
>>> 
>>> On Tue, 31 Oct 2023, at 6:49 PM, Lev Maximov wrote:
>>>> I've implemented such functions in Cython and packaged them into a library 
>>>> called numpy_illustrated <https://pypi.org/project/numpy-illustrated/>
>>>> 
>>>> It exposes the following functions:
>>>> 
>>>> find(a, v)  # returns the index of the first occurrence of v in a
>>>> first_above(a, v)   # returns the index of the first element in a that is 
>>>> strictly above v
>>>> first_nonzero(a)   # returns the index of the first nonzero element
>>>> 
>>>> They scan the array and bail out immediately once the match is found. Have 
>>>> a significant performance gain if the element to be
>>>> found is closer to the beginning of the array. Have roughly the same speed 
>>>> as alternative methods if the value is missing.
>>>> 
>>>> The complete signatures of the functions look like this:
>>>> 
>>>> find(a, v, rtol=1e-05, atol=1e-08, sorted=False, default=-1, raises=False)
>>>> first_above(a, v, sorted=False, missing=-1, raises=False)
>>>> first_nonzero(a, missing=-1, raises=False)
>>>> 
>>>> This co

[Numpy-discussion] Re: Function that searches arrays for the first element that satisfies a condition

2023-11-01 Thread Dom Grigonis

Ok, I think I didn’t do a full justice.

So actually, numba is fairly fast, given everything is precompiled. Predicates 
still make things slower, but not as much as I initially thought.

However, the way your code is structured, it has to re-compile both predicate 
and target function (with new predicate) on every call.

If you source the same already compiled predicate, it does not need to 
recompile the target function. See timings below.

Full scan results:
* Pre-compiled numba function (hardcoded predicate) is fastest
* Pre-compiled numba function (with flexible predicate) is second in speed and 
in line with Cython
* Pure python function is significantly slower, but 3rd place
* Any need to compile numba for one-off call makes it the slowest option

Early stopping results:
* Pre-compiled numba function (hardcoded predicate) is significantly faster 
than anything else
* Pre-compiled numba function (with flexible predicate), cython and pure python 
perform roughly the same

Conclusions:
* Use pre-compiled hardcoded numba when it is re-used many times
* Use pre-compiled numba with predicates when it is re-used many times with 
several variations (so flexibility justifies decrease in performance)
* Use Cython for one-off cases whenever it can handle it
* Use pure python for one-off cases, when Cython isn’t flexible enough.

Finally, just use numpy when prototyping and optimise later.

These tests are done on smaller arrays (100K size) than you tested. With large 
arrays such as yours there is an extra dimension to take into account, which I 
don’t think is relevant for majority of use cases, but obviously needs to be 
explored when working with arrays of such size.


Regards,
DG

import numpy as np
import numba as nb
import npi


def first_pure_py(arr, pred):
for i, elem in enumerate(arr):
if pred(elem):
return i

_first = nb.njit(first_pure_py)


def first_nb(arr, pred):
return _first(arr, nb.njit(pred))

@nb.njit
def first_pure_nb(arr):
for i, elem in enumerate(arr):
if elem > 5:
return i


pred = lambda x: x > 5
pred_cmp = nb.njit(pred)

# FULL SCAN
arr = np.random.random(100_000)
TIMER.repeat([
lambda: first_pure_py(arr, pred),   # 0.15640
lambda: nb.njit(first_pure_py)(arr, pred_cmp),  # 0.54549 cmp target
lambda: nb.njit(pred)(1),   # 0.32500 cmp predicate
lambda: first_nb(arr, pred),# 0.84759 = sum of previous 
2
lambda: _first(arr, pred_cmp),  # 0.00069 pre-compiled
lambda: first_pure_nb(arr), # 0.00052
lambda: npi.first_above(arr, 5),# 0.00071
]).print(5, idx=1, t=True)  # {'reps': 5, 'n': 10}

# EARLY STOPPING
arr2 = np.random.random(100_000)
arr2[100] += 10
TIMER.repeat([
lambda: first_pure_py(arr2, pred),  # 0.00014
lambda: nb.njit(first_pure_py)(arr, pred_cmp),  # 0.55114 cmp target
lambda: nb.njit(pred)(1),   # 0.31801 cmp predicate
lambda: first_nb(arr2, pred),   # 0.83330 = sum of previous 
2
lambda: _first(arr2, pred_cmp), # 0.00013 pre-compiled
lambda: first_pure_nb(arr2),# 0.1
lambda: npi.first_above(arr2, 5),   # 0.00021
]).print(5, idx=1, t=True)  # {'reps': 5, 'n': 10}


> On 1 Nov 2023, at 16:19, Dom Grigonis  wrote:
> 
> 
> Your comparisons do not paint correct picture.
> a) Most of time here is spent for array allocation and the actual time of the 
> loop gets swallowed in the noise
> a) You do not test early stop - your benchmarks simply test full loop as 
> condition is almost never hit - 5 standard deviations...
> 
> Here is comparison with Cython:
> 
> import npi
> arr = np.random.random(100_000)
> %timeit npi.first_above(arr, 5) # 66.7 µs
> %timeit first(arr, lambda x: x > 5) # 83.8 ms
> arr[100] += 10
> %timeit npi.first_above(arr, 5) # 16.2 µs
> %timeit first(arr, lambda x: x > 5) # 86.9 ms
> 
> It is in the magnitude of 1000 x slower.
> 
> Regards,
> DG
> 
>> On 1 Nov 2023, at 14:07, Juan Nunez-Iglesias > <mailto:j...@fastmail.com>> wrote:
>> 
>> Have you tried timing things? Thankfully this is easy to test because the 
>> Python source of numba-jitted functions is available at jitted_func.py_func.
>> 
>> In [23]: @numba.njit
>> ...: def _first(arr, pred):
>> ...: for i, elem in enumerate(arr):
>> ...: if pred(elem):
>> ...: return i
>> ...:
>> ...: def first(arr, pred):
>> ...: _pred = numba.njit(pred)
>> ...: return _first(arr, _pred)
>> ...:
>> 
>> In [24]: arr = np.random.random(100_000_000)
>> 
&g

[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Dom Grigonis

Is numba solution an option for you?

> On 17 Nov 2023, at 20:49, Stefan van der Walt via NumPy-Discussion 
>  wrote:
> 
> Hi all,
> 
> I am trying to sample k N-dimensional vectors from a uniform distribution 
> without replacement.
> It seems like this should be straightforward, but I can't seem to pin it down.
> 
> Specifically, I am trying to get random indices in an d0 x d1 x d2.. x dN-1 
> array.
> 
> I thought about sneaking in a structured dtype into `rng.integers`, but of 
> course that doesn't work.
> 
> If we had a string sampler, I could sample k unique words (consisting of 
> digits), and convert them to indices.
> 
> I could over-sample and filter out the non-unique indices. Or iteratively 
> draw blocks of samples until I've built up my k unique indices.
> 
> The most straightforward solution would be to flatten indices, and to sample 
> from those. The integers get large quickly, though. The rng.integers 
> docstring suggests that it can handle object arrays for very large integers:
> 
>> When using broadcasting with uint64 dtypes, the maximum value (2**64)
>> cannot be represented as a standard integer type.
>> The high array (or low if high is None) must have object dtype, e.g., 
>> array([2**64]).
> 
> But, that doesn't work:
> 
> In [35]: rng.integers(np.array([2**64], dtype=object))
> ValueError: high is out of bounds for int64
> 
> Is there an elegant way to handle this problem?
> 
> Best regards,
> Stéfan
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Seeking feedback: design doc for `namedarray`, a lightweight array data structure with named dimensions

2023-12-01 Thread Dom Grigonis

I think this is the right place to mention `scipp` library.

> On 1 Dec 2023, at 17:36, Adrin  wrote:
> 
> Some historical discussions on a namedarray on the scikit-learn side: 
> https://github.com/scikit-learn/enhancement_proposals/pull/25 
> <https://github.com/scikit-learn/enhancement_proposals/pull/25>
> 
> Might be useful to y'all.
> 
> On Fri, Oct 20, 2023 at 8:49 AM Dom Grigonis  <mailto:dom.grigo...@gmail.com>> wrote:
> I would make use of it if it was also supporting pure-numpy indices too. 
> Pure-numpy n-dim array with indices is what I was after for a while now. The 
> reason is exactly that - to shed heavy dependencies as pandas and have 
> performance of pure numpy.
> 
> Regards,
> DG
> 
> > On 20 Oct 2023, at 00:51, Anderson Banihirwe  > <mailto:axbanihi...@gmail.com>> wrote:
> > 
> > :wave:t5: folks, [there has been growing interest in a lightweight array 
> > structure](https://github.com/pydata/xarray/issues/3981 
> > <https://github.com/pydata/xarray/issues/3981>) that's in the same vein as 
> > [xarray's 
> > Variable](https://docs.xarray.dev/en/stable/generated/xarray.Variable.html 
> > <https://docs.xarray.dev/en/stable/generated/xarray.Variable.html>). we've 
> > put together a design doc for `namedarray`, and we could use your 
> > feedback/input. 
> > 
> > ## what is `namedarray`?
> > 
> > in essence, `namedarray` aims to be a lighter version of xarray's 
> > Variable—shedding some of the heavier dependencies (e.g. Pandas) but still 
> > retaining the goodness of named dimensions. 
> > 
> > ## what makes it special?
> > 
> > * **Array Protocol Compatibility**: we are planning to make it compatible 
> > with existing array protocols and the new [Python array API 
> > standard](https://data-apis.org/array-api/latest/ 
> > <https://data-apis.org/array-api/latest/>).
> > * **Duck-Array Objects**: designed to wrap around multiple duck-array 
> > objects, like NumPy, Dask, Sparse, Pint, CuPy, and PyTorch.
> > 
> > ## why are we doing this?
> > 
> > the goal is to bridge the gap between power and simplicity, providing a 
> > lightweight alternative for scientific computing tasks that don't require 
> > the full firepower of Xarray (`DataArray` and `Dataset`).
> > 
> > ## share your thoughts
> > 
> > We've put together a design doc that goes into the nitty-gritty of 
> > `namedarray`. your insights could be invaluable in making this initiative a 
> > success. please give it a read and share your thoughts 
> > [here](https://github.com/pydata/xarray/discussions/8080 
> > <https://github.com/pydata/xarray/discussions/8080>)
> > 
> > * **Design Doc**: [namedarray Design 
> > Document](https://github.com/pydata/xarray/blob/main/design_notes/named_array_design_doc.md
> >  
> > <https://github.com/pydata/xarray/blob/main/design_notes/named_array_design_doc.md>)
> > 
> > cross posting from [Scientifc Python 
> > Discourse](https://discuss.scientific-python.org/t/seeking-feedback-design-doc-for-namedarray-a-lightweight-array-data-structure-with-named-dimensions/841
> >  
> > <https://discuss.scientific-python.org/t/seeking-feedback-design-doc-for-namedarray-a-lightweight-array-data-structure-with-named-dimensions/841>)
> > ___
> > NumPy-Discussion mailing list -- numpy-discussion@python.org 
> > <mailto:numpy-discussion@python.org>
> > To unsubscribe send an email to numpy-discussion-le...@python.org 
> > <mailto:numpy-discussion-le...@python.org>
> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> > <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
> > Member address: dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com>
> 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> <mailto:numpy-discussion@python.org>
> To unsubscribe send an email to numpy-discussion-le...@python.org 
> <mailto:numpy-discussion-le...@python.org>
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
> Member address: adrin.jal...@gmail.com <mailto:adrin.jal...@gmail.com>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Change definition of complex sign (and use it in copysign)

2023-12-23 Thread Dom Grigonis

To me this sounds like a reasonable change.

It does seem like there is a return value which is more sensible than 
alternatives.

And the fact that sympy is already doing that indicates that same conclusion 
was reached more than once.

I am not dealing much with complex numbers at the moment, but I see this being 
of non-trivial convenience when I need to.

Regards,
DG

> On 22 Dec 2023, at 15:48, Oscar Benjamin  wrote:
> 
> On Fri, 22 Dec 2023 at 13:25,  wrote:
>> 
>> Anyway, to me the main question would be whether this would break any 
>> workflows (though it is hard to see how it could, given that the previous 
>> definition was really rather useless...).
> 
> SymPy already defines sign(z) as z/abs(z) (with sign(0) = 0) as proposed here.
> 
> Checking this I see that the current mismatch causes a bug when
> SymPy's lambdify function is used to evaluate the sign function with
> NumPy:
> 
> In [12]: sign(z).subs(z, 1+1j)
> Out[12]: 0.707106781186548 + 0.707106781186548⋅ⅈ
> 
> In [13]: lambdify(z, sign(z))(1+1j) # uses numpy
> Out[13]: (1+0j)
> 
> The proposed change to NumPy's sign function would fix this bug.
> 
> --
> Oscar
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Change definition of complex sign (and use it in copysign)

2024-01-04 Thread Dom Grigonis

I think this suggestion regarding sign is solid. From both theoretical and 
practical points of view.

And agree with all of Aaron’s points as well.

Regards,
DG

> On 4 Jan 2024, at 22:58, Robert Kern  wrote:
> 
> On Wed, Jan 3, 2024 at 4:09 PM Aaron Meurer  > wrote:
> sign(z) = z/|z| is a fairly standard definition. See
> https://oeis.org/wiki/Sign_function  and
> https://en.wikipedia.org/wiki/Sign_function 
> . It's also implemented
> this way in MATLAB and Mathematica (see
> https://www.mathworks.com/help/symbolic/sign.html 
>  and
> https://reference.wolfram.com/language/ref/Sign.html 
> ). The function
> z/|z| is useful because it represents a normalization of z as a vector
> in the complex plane onto the unit circle.
> 
> With that being said, I'm not so sure about the suggestion about
> extending copysign(x1, x2) as |x1|*sign(x2). I generally think of
> copysign as a function to manipulate the floating-point representation
> of a number. It literally copies the sign *bit* from x2 into x1. It's
> useful because of things like -0.0, which is otherwise difficult to
> work with since it compares equal to 0.0. I would find it surprising
> for copysign to do a numeric calculation on complex numbers. Also,
> your suggested definition would be wrong for 0.0 and -0.0, since
> sign(0) is 0, and this is precisely where copysign matters.
> 
> Agreed on all points.
> 
> -- 
> Robert Kern
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: API: make numpy.lib._arraysetops.intersect1d work on multiple arrays #25688

2024-02-02 Thread Dom Grigonis

Just curious, how much faster is it compared to currently recommended `reduce` 
approach?

DG

> On 2 Feb 2024, at 17:31, Marten van Kerkwijk  wrote:
> 
>> For my own work, I required the intersect1d function to work on multiple 
>> arrays while returning the indices (using `return_indizes=True`). 
>> Consequently I changed the function in numpy and now I am seeking 
>> feedback from the community.
>> 
>> This is the corresponding PR: https://github.com/numpy/numpy/pull/25688
> 
> 
> 
> To me this looks like a very sensible generalization.  In terms of numpy
> API, the only real change is that, effectively, the assume_unique and
> return_indices arguments become keyword-only, i.e., in the unlikely case
> that someone passed those as positional, a trivial backward-compatible
> change will fix it.
> 
> -- Marten
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: API: make numpy.lib._arraysetops.intersect1d work on multiple arrays #25688

2024-02-02 Thread Dom Grigonis

Also, I don’t know if this could be of value, but my use case for this is to 
find overlaps, then split arrays into overlapping and non-overlapping segments.

Thus, it might be useful for `return_indices=True` to return indices of all 
instances, not only the first.

Also, in my case I need both overlapping and non-overlapping indices, but this 
would become ambiguous with more than 2 arrays.

If it was left with 2 array input, then it can be extended to return both 
overlapping and non-overlapping parts. I think it could be another potential 
path to consider.

E.g. what would be the speed comparison:
intr = intersect1d(arr1, arr2, assume_unique=False)
intr = intersect1d(intr, np.unique(arr3), assume_unique=True)

# VS new

intr = intersect1d(arr1, arr2, arr3, assume_unique=False)
Then, does the gain from such generalisation justify constriction it introduces?

Regards,
DG

> On 2 Feb 2024, at 17:31, Marten van Kerkwijk  wrote:
> 
>> For my own work, I required the intersect1d function to work on multiple 
>> arrays while returning the indices (using `return_indizes=True`). 
>> Consequently I changed the function in numpy and now I am seeking 
>> feedback from the community.
>> 
>> This is the corresponding PR: https://github.com/numpy/numpy/pull/25688
> 
> 
> 
> To me this looks like a very sensible generalization.  In terms of numpy
> API, the only real change is that, effectively, the assume_unique and
> return_indices arguments become keyword-only, i.e., in the unlikely case
> that someone passed those as positional, a trivial backward-compatible
> change will fix it.
> 
> -- Marten
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: API: make numpy.lib._arraysetops.intersect1d work on multiple arrays #25688

2024-02-03 Thread Dom Grigonis


> Seems reasonable. I don't know if it is faster, but NumPy is all about 
> vectorization.
Surely speed needs to fit into equation somehow?

Another point to consider. Is function `in1d` able to achieve everything that 
`intersect1d` does?

For 2 arrays of length 1600.
def intersect1d_via_in1d(x, y):
return np.unique(r1[np.in1d(x, y)])

%timeit np.intersect1d(r1, r2)  # 927 µs ± 10.2
%timeit intersect1d_via_in1d(r1, r2)# 163 µs ± 1.73
Retrieving indices and other outputs that `intersect1d` provides also seems 
trivial.

If it is faster and doesn’t consume more memory, then maybe it is more 
worthwhile re-using it in `intersect1d`?

Regards,
DG___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: API: make numpy.lib._arraysetops.intersect1d work on multiple arrays #25688

2024-02-06 Thread Dom Grigonis

Hi Stephan,

Thanks for replying to my concerns.

I have looked a bit more into it and `intersect1d` is not the right concept for 
my problem anyway. And it is great that it works faster than reduce approach.

Also, I did more speed tests and, after all, `in1d` approach is not generally 
faster. It is faster in some cases and slower in others. Also, it uses more 
memory in all cases.

Regards,
DG

> On 5 Feb 2024, at 14:30, Stephan Kuschel via NumPy-Discussion 
>  wrote:
> 
> Dear Dom,
> 
> thanks for bringing up the possible constriction. I agree that this would be 
> serious argument against the change.
> 
> However, as you said the overlapping/non-overlapping indices would become 
> ambiguous with more than two arrays. And calling the fucntion with only two 
> arrays at a time would still be possible. So we will be unable to generalize 
> in the future towards a problem, that only has ambinuous solutions. So I fail 
> to see what exactly we the other use case would be.
> 
> The point of this change is not the luxory of allowing multiple arrays to 
> calculate the intersection. Its all about getting the indices in the original 
> arrays, using `return_indices=True`.
> 
> All the Best
> Stephan
> 
> Am 02.02.24 um 17:36 schrieb Dom Grigonis:
>> Also, I don’t know if this could be of value, but my use case for this is to 
>> find overlaps, then split arrays into overlapping and non-overlapping 
>> segments.
>> Thus, it might be useful for `return_indices=True` to return indices of all 
>> instances, not only the first.
>> Also, in my case I need both overlapping and non-overlapping indices, but 
>> this would become ambiguous with more than 2 arrays.
>> If it was left with 2 array input, then it can be extended to return both 
>> overlapping and non-overlapping parts. I think it could be another potential 
>> path to consider.
>> E.g. what would be the speed comparison:
>> intr=  intersect1d(arr1, arr2, assume_unique=False)
>> intr=  intersect1d(intr, np.unique(arr3), assume_unique=True)
>> # VS new
>> intr=  intersect1d(arr1, arr2, arr3, assume_unique=False)
>> Then, does the gain from such generalisation justify constriction it 
>> introduces?
>> Regards,
>> DG
>>> On 2 Feb 2024, at 17:31, Marten van Kerkwijk >> <mailto:m...@astro.utoronto.ca><mailto:m...@astro.utoronto.ca 
>>> <mailto:m...@astro.utoronto.ca>>> wrote:
>>> 
>>>> For my own work, I required the intersect1d function to work on multiple
>>>> arrays while returning the indices (using `return_indizes=True`).
>>>> Consequently I changed the function in numpy and now I am seeking
>>>> feedback from the community.
>>>> 
>>>> This is the corresponding PR: https://github.com/numpy/numpy/pull/25688 
>>>> <https://github.com/numpy/numpy/pull/25688><https://github.com/numpy/numpy/pull/25688
>>>>  <https://github.com/numpy/numpy/pull/25688>>
>>> 
>>> 
>>> 
>>> To me this looks like a very sensible generalization.  In terms of numpy
>>> API, the only real change is that, effectively, the assume_unique and
>>> return_indices arguments become keyword-only, i.e., in the unlikely case
>>> that someone passed those as positional, a trivial backward-compatible
>>> change will fix it.
>>> 
>>> -- Marten
>>> ___
>>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>>> <mailto:numpy-discussion@python.org> <mailto:numpy-discussion@python.org 
>>> <mailto:numpy-discussion@python.org>>
>>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>>> <mailto:numpy-discussion-le...@python.org><mailto:numpy-discussion-le...@python.org
>>>  <mailto:numpy-discussion-le...@python.org>>
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/><https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>>  <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>>
>>> Member address: dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com>
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>> <mailto:numpy-discussion@python.org>
>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>> <mailto:numpy-discussion-le...@python.org>
>> https://mail.python.org/mailman3/lists/numpy-discussion.py

[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays

2024-02-15 Thread Dom Grigonis

What were your conclusions after experimenting with chained ufuncs?

If the speed is comparable to numexpr, wouldn’t it be `nicer` to have 
non-string input format?

It would feel a bit less like a black-box.

Regards,
DG

> On 15 Feb 2024, at 22:52, Marten van Kerkwijk  wrote:
> 
> Hi Oyibo,
> 
>> I'm proposing the introduction of a `pipe` method for NumPy arrays to 
>> enhance their usability and expressiveness.
> 
> I think it is an interesting idea, but agree with Robert that it is
> unlikely to fly on its own.  Part of the logic of even frowning on
> methods like .mean() and .sum() is that ndarray is really a data
> container, and should have methods related to that, as much as possible
> independent of the meaning of those data (which is given by the dtype).
> 
> A bit more generally, your example is nice, but a pipe can have just one
> input, while of course many operations require two or more.
> 
>> - Optimization: While NumPy may not currently optimize chained
>> expressions, the introduction of pipe lays the groundwork for
>> potential future optimizations with lazy evaluation.
> 
> Optimization might indeed be made possible, though I would think that
> for that one may be better off with something like dask.
> 
> That said, I've been playing with the ability to chain ufuncs to
> optimize their execution, by applying the ufuncs in series on small
> pieces of larger arrays, thus avoiding large temporaries (a bit like
> numexpr but with the idea of defining a fast function rather than giving
> an expression as a string); see https://github.com/mhvk/chain_ufunc
> 
> All the best,
> 
> Marten
> 
> 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays

2024-02-15 Thread Dom Grigonis

Just to clarify, I am not the one who suggested pipes. :)

Read the issue. My 2 cents:

From my experience, calling methods is generally faster than functions. I 
figure it is due to having less overhead figuring out the input. Maybe it is 
not significant for large data, but it does make a difference even when working 
for medium sized arrays - say float size 5000.

%timeit a.sum()
3.17 µs
%timeit np.sum(a)
5.18 µs

(In my experience, `sum` for medium size arrays often becomes a bottleneck in 
greedy optimisation algorithms where distances are calculated over and over for 
partial space.)

In short, all I want to say is that it would be great if such if speed 
considerations were addressed if/when developing piping or anything similar.

E.g. Pipe implementation could allow additions of optimisations.

Then numexpr could then make a plugin.

At the top user writes:
np.pipe_use_plugin(numexpr.plug_pipe)# or something similar

Then, numexpr would kick-in whenever appropriate when using pipes.

Regards,
DG

> On 16 Feb 2024, at 00:12, Marten van Kerkwijk  wrote:
> 
>> What were your conclusions after experimenting with chained ufuncs?
>> 
>> If the speed is comparable to numexpr, wouldn’t it be `nicer` to have
>> non-string input format?
>> 
>> It would feel a bit less like a black-box.
> 
> I haven't gotten further than it yet, it is just some toying around I've
> been doing.  But I'd indeed prefer not to go via strings -- possibly
> numexpr could use a similar mechanism to what I did to construct the
> function that is being evaluated.
> 
> Aside: your suggestion of the pipe led to some further discussion at
> https://github.com/numpy/numpy/issues/25826#issuecomment-1947342581
> -- as a more general way of passing arrays to functions.
> 
> -- Marten
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays

2024-02-15 Thread Dom Grigonis

Thanks for this, every little helps.

One more thing to mention on this topic.

From a certain size dot product becomes faster than sum (due to parallelisation 
I guess?).

E.g.
def dotsum(arr):
a = arr.reshape(1000, 100)
return a.dot(np.ones(100)).sum()

a = np.ones(10)

In [45]: %timeit np.add.reduce(a, axis=None)
42.8 µs ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [43]: %timeit dotsum(a)
26.1 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

But theoretically, sum, should be faster than dot product by a fair bit.

Isn’t parallelisation implemented for it?

Regards,
DG


> On 16 Feb 2024, at 01:37, Marten van Kerkwijk  wrote:
> 
> It is more that np.sum checks if there is a .sum() method and if so
> calls that.  And then `ndarray.sum()` calls `np.add.reduce(array)`.

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays

2024-02-16 Thread Dom Grigonis

Good to know it is not only on my PC.

I have done a fair bit of work trying to find more efficient sum.

The only faster option that I have found was PyTorch. (although thinking about 
it now maybe it’s because it was using MKL, don’t remember)

MKL is faster, but I use OpenBLAS.

Scipp library is parallelized, and its performance becomes similar to `dotsum` 
for large arrays, but it is slower than numpy or dotsum for size less than 
(somewhere towards) ~200k.

Apart from these I ran out of options and simply implemented my own sum, where 
it uses either `np.sum` or `dotsum` depending on which is faster.

This is the chart, where it can be seen the point where dotsum becomes faster 
than np.sum.
https://gcdnb.pbrd.co/images/j8n3EsRz5g5v.png?o=1 

I am not sure how much (and for how many people) the improvement is needed / 
essential, but I have found several stack posts regarding this when I was 
looking into this. It is definitely to me though.

Theoretically, if implemented with same optimisations, sum should be ~2x faster 
than dotsum. Or am I missing something?

Regards,
DG

> On 16 Feb 2024, at 04:54, Homeier, Derek  wrote:
> 
> 
> 
>> On 16 Feb 2024, at 2:48 am, Marten van Kerkwijk  
>> wrote:
>> 
>>> In [45]: %timeit np.add.reduce(a, axis=None)
>>> 42.8 µs ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> 
>>> In [43]: %timeit dotsum(a)
>>> 26.1 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> 
>>> But theoretically, sum, should be faster than dot product by a fair bit.
>>> 
>>> Isn’t parallelisation implemented for it?
>> 
>> I cannot reproduce that:
>> 
>> In [3]: %timeit np.add.reduce(a, axis=None)
>> 19.7 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>> 
>> In [4]: %timeit dotsum(a)
>> 47.2 µs ± 360 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>> 
>> But almost certainly it is indeed due to optimizations, since .dot uses
>> BLAS which is highly optimized (at least on some platforms, clearly
>> better on yours than on mine!).
>> 
>> I thought .sum() was optimized too, but perhaps less so?
> 
> 
> I can confirm at least it does not seem to use multithreading – with the 
> conda-installed numpy+BLAS
> I almost exactly reproduce your numbers, whereas linked against my own 
> OpenBLAS build
> 
> In [3]: %timeit np.add.reduce(a, axis=None)
> 19 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> 
> # OMP_NUM_THREADS=1
> In [4]: %timeit dots(a)
> 20.5 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
> 
> # OMP_NUM_THREADS=8
> In [4]: %timeit dots(a)
> 9.84 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> 
> add.reduce shows no difference between the two and always remains at <= 100 % 
> CPU usage.
> dotsum is scaling still better with larger matrices, e.g. ~4 x for 1000x1000.
> 
> Cheers,
>   Derek
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Converting array to string

2024-02-24 Thread Dom Grigonis

Hello,

I am seeking a bit of help.

I need a fast way to transfer numpy arrays in json format.

Thus, converting array to list is not an option and I need something similar to:
a = np.ones(1000)
%timeit a.tobytes()
17.6 ms
This is quick, fast and portable. In other words I am very happy with this, 
but...

Json does not support bytes.

Any method of subsequent conversion from bytes to string is number of times 
slower than the actual serialisation.

So my question is: Is there any way to serialise directly to string?

I remember there used to be 2 methods: tobytes and tostring. However, I see 
that tostring is deprecated and its functionality is equivalent to `tobytes`. 
Is it completely gone? Or is there a chance I can still find a performant 
version of converting to and back from `str` type of non-human readable form?

Regards,
DG

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Converting array to string

2024-02-25 Thread Dom Grigonis

Thank you for your answer,

Yeah, I was unsure if it ever existed in the first place.

Space is less of an issue and something like `a.data.hex()` would be fine as 
long as its speed was on par with `a.tobytes()`. However, it is 10x slower on 
my machine.

This e-mail is pretty much a final check (after a fair bit of research and 
attempts) that it can not be done so I can eliminate this possibility as 
feasible and concentrate on other options.

Regards,
DG

> On 25 Feb 2024, at 20:30, Robert Kern  wrote:
> 
> On Sat, Feb 24, 2024 at 7:17 PM Dom Grigonis  <mailto:dom.grigo...@gmail.com>> wrote:
> Hello,
> 
> I am seeking a bit of help.
> 
> I need a fast way to transfer numpy arrays in json format.
> 
> Thus, converting array to list is not an option and I need something similar 
> to:
> a = np.ones(1000)
> %timeit a.tobytes()
> 17.6 ms
> This is quick, fast and portable. In other words I am very happy with this, 
> but...
> 
> Json does not support bytes.
> 
> Any method of subsequent conversion from bytes to string is number of times 
> slower than the actual serialisation.
> 
> So my question is: Is there any way to serialise directly to string?
> 
> I remember there used to be 2 methods: tobytes and tostring. However, I see 
> that tostring is deprecated and its functionality is equivalent to `tobytes`. 
> Is it completely gone? Or is there a chance I can still find a performant 
> version of converting to and back from `str` type of non-human readable form?
>  
> The old `tostring` was actually the same as `tobytes`. In Python 2, the `str` 
> type was what `bytes` is now, a string of octets. In Python 3, `str` became a 
> string a Unicode characters (what you want) and the `bytes` type was 
> introduced for the string of octects so `tostring` was merely _renamed_ to 
> `tobytes` to match. `tostring` never returned a string of Unicode characters 
> suitable for inclusion in JSON.
> 
> AFAICT, there is not likely to be a much more efficient way to convert from 
> an array to a reasonable JSONable encoding (e.g. base64). The time you are 
> seeing is the time it takes to encode that amount of data, period. That said, 
> if you want to use a quite inefficient hex encoding, `a.data.hex()` is 
> somewhat faster than the base64 encoding, but it's less than ideal in terms 
> of space usage.
> 
> -- 
> Robert Kern
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Converting array to string

2024-02-25 Thread Dom Grigonis

Does it have native bytes support? To me, it's either having fast conversion to 
`string` or data format with native bytes support.

Sometimes readability is important, sometimes speed takes priority. Even with a 
good, unified data structure for arrays, indexed arrays, etc., it is always 
good to know that if it becomes a bottleneck I can substitute array value with 
its bytes representation with little to none extra work.

DG

> On 25 Feb 2024, at 21:02, Robert Kern  wrote:
> 
> On Sun, Feb 25, 2024 at 1:52 PM Dom Grigonis  <mailto:dom.grigo...@gmail.com>> wrote:
> Thank you for your answer,
> 
> Yeah, I was unsure if it ever existed in the first place.
> 
> Space is less of an issue and something like `a.data.hex()` would be fine as 
> long as its speed was on par with `a.tobytes()`. However, it is 10x slower on 
> my machine.
> 
> This e-mail is pretty much a final check (after a fair bit of research and 
> attempts) that it can not be done so I can eliminate this possibility as 
> feasible and concentrate on other options.
> 
> I think that mostly runs the gamut for pure JSON. Consider looking at BJData, 
> but it's "JSON-based" and not quite pure JSON.
> 
> https://neurojson.org/ <https://neurojson.org/>
> 
> -- 
> Robert Kern
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: JSON format for multi-dimensional data

2024-02-28 Thread Dom Grigonis

I sure like the idea of this.

I agree that this should be external to numpy. At least until it becomes a 
standard in a sense that json itself is.

And that n-dimensional array should ideally be extended to indexed structures 
with named dimensions and attributes. To accommodate a superset of: xarray & 
scipp.

Regards,
DG

> On 28 Feb 2024, at 08:34, fangqq--- via NumPy-Discussion 
>  wrote:
> 
> aside from the previously mentioned ticket 
> https://github.com/numpy/numpy/issues/12481, I also made a similar proposal, 
> posted in 2021
> 
> https://github.com/numpy/numpy/issues/20461
> https://mail.python.org/archives/list/numpy-discussion@python.org/message/EVQW2PO64464JEN3RQXSCDP32RQDIQFW/
> 
> 
> lightweight JSON annotations for various data structures (trees, tables, 
> graphs), especially ND-arrays, are defined in the JData spec
> 
> https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md#data-annotation-keywords
> 
> 
> JSON/binary JSON annotation encoders/decoders have been implemented for 
> Python (https://pypi.org/project/jdata/), MATLAB/Octave 
> (https://github.com/fangq/jsonlab), JavaScript/NodeJS 
> (https://www.npmjs.com/package/jda), as well as C++ (JSON for Modern C++, 
> https://json.nlohmann.me/features/binary_formats/bjdata/)
> 
> 
> I have been extensively used this annotation in JSON/binary JSON in my 
> neuroimaging data portal, https://neurojson.io/, for example, for 3D data
> 
> https://neurojson.org/db/fieldtrip(atlas)/FieldTrip--Brainnetome--Brainnetome
> https://neurojson.org/db/fieldtrip(atlas)/FieldTrip--Brainnetome--Brainnetome#preview=$.tissue
> 
> for mesh data
> 
> https://neurojson.org/db/brainmeshlibrary/BrainWeb--Subject04--gm--surf
> https://neurojson.org/db/brainmeshlibrary/BrainWeb--Subject04--gm--surf#preview=$
> 
> the ND array supports binary data with loss-less compression. I've also 
> implemented
> 
> 
> in a renewed thread posted in 2022, I also tested the blosc2 
> (https://www.blosc.org/) compression codecs and got excellent read/write speed
> 
> https://mail.python.org/archives/list/numpy-discussion@python.org/thread/JIT4AIVEYJLSSHTSA7GOUBIVQLT3WPRU/#U33R5GL34OTL7EZX2VRQGOO4KUWED56M
> https://mail.python.org/archives/list/numpy-discussion@python.org/message/TUO7CKTQL2GGH2MIWSBH6YCO3GX4AV2O/
> 
> the blosc2 compression codes are supported in my python and matlab/C parsers.
> 
> 
> Qianqian
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: JSON format for multi-dimensional data

2024-02-29 Thread Dom Grigonis

Could be of interest to scipp. They already put work to link to 
https://www.nexusformat.org. Maybe they would be interested in JData as well.

—

Also, what could be interesting is "format independent standard”.

So that there is a general standard of data structures,
which is adapted by different data formats,
which are in turn implemented in different languages.

Something like SansIO concept in the world of protocols.

—

Benefits:
a) This way, if format obeys the standard, I would be certain that directed 
graph has node, edge, and graph attributes.
b) Also, selling this to different libraries would be easier. Numpy would only 
need to implement 2 methods (to/from), which can be used by different formats 
and libraries.
c) Another benefit of this would be that it would be possible to skip “format” 
step to convert between different libraries. E.g. xarray and scipp.
d) Finally, if I decide to make my own data class, say graph. I only need 2 
methods to be able to convert it to any other library.

—

I would happily be part of such project.

Regards,
DG
> On 29 Feb 2024, at 15:04, phili...@loco-labs.io wrote:
> 
> Thank you dom for this encouraging comment !
> 
> I agree with these remarks.
> I will indeed integrate the extensions made by scipp to Xarray.
> 
> Note: I am also looking for feedback regarding the analysis of tabular 
> structures (e.g. to identify the hidden multidimensional structure): 
> https://github.com/loco-philippe/tab-analysis/blob/main/docs/tabular_analysis.
>  pdf.
> Do you think this might be of interest to scipp or Xarray?
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] ENH: tobytes / frombuffer extension

2024-03-05 Thread Dom Grigonis

I suggest adding flag - `full: bool`

by default it is False. And if it is True, then it runs: 
https://github.com/numpy/numpy/blob/main/doc/neps/nep-0001-npy-format.rst

The whole point is that it should be implemented in C.

Currently:
```
def save(a):
 m = io.BytesIO()
 np.save(m, a)   # / np.lib.format.write_array
 return m

a = np.ones((1000, 100))

 %timeit save(a)
90.8 µs
%timeit a.tobytes()
37 µs
```

So... For extra few bytes performance decreases almost 3 times. Mostly due to 
copy operations. There are stack questions regarding this. Also, various 
libraries implementing custom types in messagepack and json serialisers, etc...

Such improvement would benefit performance and improve simplicity in various 
places.

Regards,
DG
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Automatic Clipping of array to upper / lower bounds of dtype

2024-03-09 Thread Dom Grigonis

Hello,

Can't find answer to this anywhere.

What I would like is to automatically clip the values if they breach the bounds.

I have done a simple clipping, and overwritten __iadd__, __isub__, __setitem__, 
…

But I am wandering if there is a specified way to do this. Or maybe at least a 
centralised place exists to do such thing? E.g. Only 1 method to override?

Regards and thanks,
dgpb
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Automatic Clipping of array to upper / lower bounds of dtype

2024-03-10 Thread Dom Grigonis

Much thanks!

Another related question while I am at it. It says clip is supposed to be 
faster than np.maximum(mp.minumum(arr, max), min). However:
a = np.arange(100)
%timeit a.clip(4, 20)# 8.48 µs
%timeit np.maximum(np.minimum(a, 20), 4)# 2.09 µs
Is this expected?

Regards,
dg


> On 10 Mar 2024, at 09:59, Ralf Gommers  wrote:
> 
> 
> 
> On Sat, Mar 9, 2024 at 11:23 PM Dom Grigonis  <mailto:dom.grigo...@gmail.com>> wrote:
> Hello,
> 
> Can't find answer to this anywhere.
> 
> What I would like is to automatically clip the values if they breach the 
> bounds.
> 
> I have done a simple clipping, and overwritten __iadd__, __isub__, 
> __setitem__, …
> 
> But I am wandering if there is a specified way to do this. Or maybe at least 
> a centralised place exists to do such thing? E.g. Only 1 method to override?
> 
> That centralized method is `__array_wrap__`; a subclass that implements 
> `__array_wrap__` by applying `np.clip` and then returning self should do this 
> I think.
> 
> Cheers,
> Ralf
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Automatic Clipping of array to upper / lower bounds of dtype

2024-03-10 Thread Dom Grigonis

Thanks,

True, clip does get faster, but threshold is around 10k on my PC.

Also, can’t get __array_wrap__ to work. The arguments it receives after 
__iadd__ are all post-operation. Decided not to do it this way this time so not 
to hardcode such functionality into the class, but if there is a way to 
robustly achieve this it would be good to know.

Regards,
dg

> On 10 Mar 2024, at 18:43, Ralf Gommers  wrote:
> 
> 
> 
> On Sun, Mar 10, 2024 at 9:14 AM Dom Grigonis  <mailto:dom.grigo...@gmail.com>> wrote:
> Much thanks!
> 
> Another related question while I am at it. It says clip is supposed to be 
> faster than np.maximum(mp.minumum(arr, max), min). However:
> a = np.arange(100)
> %timeit a.clip(4, 20)# 8.48 µs
> %timeit np.maximum(np.minimum(a, 20), 4)# 2.09 µs
> Is this expected?
> 
> Make sure that you're not benchmarking with very small arrays (2 us is on the 
> order of function call overhead) and that the timing are reproducible. `clip` 
> is more efficient:
> 
> >>> %timeit np.clip(a, 4, 20)
> 70 µs ± 304 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
> >>> %timeit np.clip(a, 4, 20)
> 72.8 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
> >>> %timeit np.maximum(np.minimum(a, 20), 4)
> 742 µs ± 8.45 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
> 
> Ralf
> 
>  
> 
> Regards,
> dg
> 
> 
>> On 10 Mar 2024, at 09:59, Ralf Gommers > <mailto:ralf.gomm...@gmail.com>> wrote:
>> 
>> 
>> 
>> On Sat, Mar 9, 2024 at 11:23 PM Dom Grigonis > <mailto:dom.grigo...@gmail.com>> wrote:
>> Hello,
>> 
>> Can't find answer to this anywhere.
>> 
>> What I would like is to automatically clip the values if they breach the 
>> bounds.
>> 
>> I have done a simple clipping, and overwritten __iadd__, __isub__, 
>> __setitem__, …
>> 
>> But I am wandering if there is a specified way to do this. Or maybe at least 
>> a centralised place exists to do such thing? E.g. Only 1 method to override?
>> 
>> That centralized method is `__array_wrap__`; a subclass that implements 
>> `__array_wrap__` by applying `np.clip` and then returning self should do 
>> this I think.
>> 
>> Cheers,
>> Ralf
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>> <mailto:numpy-discussion@python.org>
>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>> <mailto:numpy-discussion-le...@python.org>
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>> Member address: dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com>
> 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> <mailto:numpy-discussion@python.org>
> To unsubscribe send an email to numpy-discussion-le...@python.org 
> <mailto:numpy-discussion-le...@python.org>
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
> Member address: ralf.gomm...@googlemail.com 
> <mailto:ralf.gomm...@googlemail.com>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> <mailto:numpy-discussion@python.org>
> To unsubscribe send an email to numpy-discussion-le...@python.org 
> <mailto:numpy-discussion-le...@python.org>
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
> Member address: dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis

Hi all,

Say python’s builtin `int` type. It can be as large as memory allows.

np.ndarray on the other hand is optimized for vectorization via strides, memory 
structure and many things that I probably don’t know. Well the point is that it 
is convenient and efficient to use for many things in comparison to python’s 
built-in list of integers.

So, I am thinking whether something in between exists? (And obviously something 
more clever than np.array(dtype=object))

Probably something similar to `StringDType`, but for integers and floats. (It’s 
just my guess. I don’t know anything about `StringDType`, but just guessing it 
must be better than np.array(dtype=object) in combination with np.vectorize)

Regards,
dgpb

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis

Thank you for this.

I am just starting to think about these things, so I appreciate your patience.

But isn’t it still true that all elements of an array are still of the same 
size in memory?

I am thinking along the lines of per-element dynamic memory management. Such 
that if I had array [0, 1e1], the first element would default to reasonably 
small size in memory.

> On 13 Mar 2024, at 16:29, Nathan  wrote:
> 
> It is possible to do this using the new DType system. 
> 
> Sebastian wrote a sketch for a DType backed by the GNU multiprecision float 
> library: 
> https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype 
> <https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype>
> 
> It adds a significant amount of complexity to store data outside the array 
> buffer and introduces the possibility of use-after-free and dangling 
> reference errors that are impossible if the array does not use embedded 
> references, so that’s the main reason it hasn’t been done much.
> 
> On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis  <mailto:dom.grigo...@gmail.com>> wrote:
> Hi all,
> 
> Say python’s builtin `int` type. It can be as large as memory allows.
> 
> np.ndarray on the other hand is optimized for vectorization via strides, 
> memory structure and many things that I probably don’t know. Well the point 
> is that it is convenient and efficient to use for many things in comparison 
> to python’s built-in list of integers.
> 
> So, I am thinking whether something in between exists? (And obviously 
> something more clever than np.array(dtype=object))
> 
> Probably something similar to `StringDType`, but for integers and floats. 
> (It’s just my guess. I don’t know anything about `StringDType`, but just 
> guessing it must be better than np.array(dtype=object) in combination with 
> np.vectorize)
> 
> Regards,
> dgpb
> 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> <mailto:numpy-discussion@python.org>
> To unsubscribe send an email to numpy-discussion-le...@python.org 
> <mailto:numpy-discussion-le...@python.org>
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
> Member address: nathan12...@gmail.com <mailto:nathan12...@gmail.com>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis

By the way, I think I am referring to integer arrays. (Or integer part of 
floats.)

I don’t think what I am saying sensibly applies to floats as they are.

Although, new float type could base its integer part on such concept.

—

Where I am coming from is that I started to hit maximum bounds on integer 
arrays, where most of values are very small and some become very large. And I 
am hitting memory limits. And I don’t have many zeros, so sparse arrays aren’t 
an option.

Approximately:
90% of my arrays could fit into `np.uint8`
1% requires `np.uint64`
the rest 9% are in between.

And there is no predictable order where is what, so splitting is not an option 
either.


> On 13 Mar 2024, at 17:53, Nathan  wrote:
> 
> Yes, an array of references still has a fixed size width in the array buffer. 
> You can think of each entry in the array as a pointer to some other memory on 
> the heap, which can be a dynamic memory allocation.
> 
> There's no way in NumPy to support variable-sized array elements in the array 
> buffer, since that assumption is key to how numpy implements strided ufuncs 
> and broadcasting.,
> 
> On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis  <mailto:dom.grigo...@gmail.com>> wrote:
> Thank you for this.
> 
> I am just starting to think about these things, so I appreciate your patience.
> 
> But isn’t it still true that all elements of an array are still of the same 
> size in memory?
> 
> I am thinking along the lines of per-element dynamic memory management. Such 
> that if I had array [0, 1e1], the first element would default to 
> reasonably small size in memory.
> 
>> On 13 Mar 2024, at 16:29, Nathan > <mailto:nathan.goldb...@gmail.com>> wrote:
>> 
>> It is possible to do this using the new DType system. 
>> 
>> Sebastian wrote a sketch for a DType backed by the GNU multiprecision float 
>> library: 
>> https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype 
>> <https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype>
>> 
>> It adds a significant amount of complexity to store data outside the array 
>> buffer and introduces the possibility of use-after-free and dangling 
>> reference errors that are impossible if the array does not use embedded 
>> references, so that’s the main reason it hasn’t been done much.
>> 
>> On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis > <mailto:dom.grigo...@gmail.com>> wrote:
>> Hi all,
>> 
>> Say python’s builtin `int` type. It can be as large as memory allows.
>> 
>> np.ndarray on the other hand is optimized for vectorization via strides, 
>> memory structure and many things that I probably don’t know. Well the point 
>> is that it is convenient and efficient to use for many things in comparison 
>> to python’s built-in list of integers.
>> 
>> So, I am thinking whether something in between exists? (And obviously 
>> something more clever than np.array(dtype=object))
>> 
>> Probably something similar to `StringDType`, but for integers and floats. 
>> (It’s just my guess. I don’t know anything about `StringDType`, but just 
>> guessing it must be better than np.array(dtype=object) in combination with 
>> np.vectorize)
>> 
>> Regards,
>> dgpb
>> 
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>> <mailto:numpy-discussion@python.org>
>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>> <mailto:numpy-discussion-le...@python.org>
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>> Member address: nathan12...@gmail.com <mailto:nathan12...@gmail.com>
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>> <mailto:numpy-discussion@python.org>
>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>> <mailto:numpy-discussion-le...@python.org>
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>> Member address: dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com>
> 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> <mailto:numpy-discussion@python.org>
> To unsubscribe send an email to numpy-discussion-le...@python.org 
> <mailto:numpy-discussion-le...@python.org>
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> <ht

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis

Yup yup, good point.

So my array sizes in this case are 3e8. Thus, 32bit ints would be needed. So it 
is not a solution for this case.

Nevertheless, such concept would still be worthwhile for cases where integers 
are say max 256bits (or unlimited), then even if memory addresses or offsets 
are 64bit. This would both:
a) save memory if many of values in array are much smaller than 256bits
b) provide a standard for dynamically unlimited size values

—

For now, what could be a temporary solution for me, is a type, which stays at 
minimum/maximum when it goes below, above bounds.

Integer types don’t work here at all - np.uint8(255) + 2 = 1. Totally 
unacceptable
Floats are a bit better: np.float16(65500) + 100 = np.float16(inf). At least it 
didn’t reset and it went the right way (just a bit too much).


> On 13 Mar 2024, at 18:26, Matti Picus  wrote:
> 
> I am not sure what kind of a scheme would support various-sized native ints. 
> Any scheme that puts pointers in the array is going to be worse: the pointers 
> will be 64-bit. You could store offsets to data, but then you would need to 
> store both the offsets and the contiguous data, nearly doubling your storage. 
> What shape are your arrays, that would be the minimum size of the offsets?
> 
> Matti
> 
> 
> On 13/3/24 18:15, Dom Grigonis wrote:
>> By the way, I think I am referring to integer arrays. (Or integer part of 
>> floats.)
>> 
>> I don’t think what I am saying sensibly applies to floats as they are.
>> 
>> Although, new float type could base its integer part on such concept.
>> 
>> —
>> 
>> Where I am coming from is that I started to hit maximum bounds on integer 
>> arrays, where most of values are very small and some become very large. And 
>> I am hitting memory limits. And I don’t have many zeros, so sparse arrays 
>> aren’t an option.
>> 
>> Approximately:
>> 90% of my arrays could fit into `np.uint8`
>> 1% requires `np.uint64`
>> the rest 9% are in between.
>> 
>> And there is no predictable order where is what, so splitting is not an 
>> option either.
>> 
>> 
>>> On 13 Mar 2024, at 17:53, Nathan  wrote:
>>> 
>>> Yes, an array of references still has a fixed size width in the array 
>>> buffer. You can think of each entry in the array as a pointer to some other 
>>> memory on the heap, which can be a dynamic memory allocation.
>>> 
>>> There's no way in NumPy to support variable-sized array elements in the 
>>> array buffer, since that assumption is key to how numpy implements strided 
>>> ufuncs and broadcasting.,
>>> 
>>> On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis  wrote:
>>> 
>>>Thank you for this.
>>> 
>>>I am just starting to think about these things, so I appreciate
>>>your patience.
>>> 
>>>But isn’t it still true that all elements of an array are still
>>>of the same size in memory?
>>> 
>>>I am thinking along the lines of per-element dynamic memory
>>>management. Such that if I had array [0, 1e1], the first
>>>element would default to reasonably small size in memory.
>>> 
>>>>On 13 Mar 2024, at 16:29, Nathan  wrote:
>>>> 
>>>>It is possible to do this using the new DType system.
>>>> 
>>>>Sebastian wrote a sketch for a DType backed by the GNU
>>>>multiprecision float library:
>>>>https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype
>>>> 
>>>>It adds a significant amount of complexity to store data outside
>>>>the array buffer and introduces the possibility of
>>>>use-after-free and dangling reference errors that are impossible
>>>>if the array does not use embedded references, so that’s the
>>>>main reason it hasn’t been done much.
>>>> 
>>>>On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis
>>>> wrote:
>>>> 
>>>>Hi all,
>>>> 
>>>>Say python’s builtin `int` type. It can be as large as
>>>>memory allows.
>>>> 
>>>>np.ndarray on the other hand is optimized for vectorization
>>>>via strides, memory structure and many things that I
>>>>probably don’t know. Well the point is that it is convenient
>>>>and efficient to use for many things in comparison to
>>>>python’s built-in list of integers.
>>>> 
>>>>So, I am thinking whe

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis

Thanks for this.

Random access is unfortunately a requirement.

By the way, what is the difference between awkward and ragged?

> On 13 Mar 2024, at 18:59, Jim Pivarski  wrote:
> 
> After sending that email, I realize that I have to take it back: your 
> motivation is to minimize memory use. The variable-length lists in Awkward 
> Array (and therefore in ragged as well) are implemented using offset arrays, 
> and they're at minimum 32-bit. The scheme is more cache-coherent (less 
> "pointer chasing"), but doesn't reduce the size.
> 
> These offsets are 32-bit so that individual values can be selected from the 
> array in constant time. If you use a smaller integer size, like uint8, then 
> they have to be number of elements in the lists, rather than offsets (the 
> cumsum of number of elements in the lists). Then, to find a single value, you 
> have to add counts from the beginning of the array.
> 
> A standard way to store variable-length integers is to put the indicator of 
> whether you've seen the whole integer yet in a high bit (so each byte 
> effectively contributes 7 bits). That's also inherently non-random access.
> 
> But if random access is not a requirement, how about Blosc and bcolz? That's 
> a library that uses a very lightweight compression algorithm on the arrays 
> and uncompresses them on the fly (fast enough to be practical). That sounds 
> like it would fit your use-case better...
> 
> Jim
> 
> 
> 
> 
> On Wed, Mar 13, 2024, 12:47 PM Jim Pivarski  <mailto:jpivar...@gmail.com>> wrote:
> This might be a good application of Awkward Array (https://awkward-array.org 
> <https://awkward-array.org/>), which applies a NumPy-like interface to 
> arbitrary tree-like data, or ragged (https://github.com/scikit-hep/ragged 
> <https://github.com/scikit-hep/ragged>), a restriction of that to only 
> variable-length lists, but satisfying the Array API standard.
> 
> The variable-length data in Awkward Array hasn't been used to represent 
> arbitrary precision integers, though. It might be a good application of 
> "behaviors," which are documented here: 
> https://awkward-array.org/doc/main/reference/ak.behavior.html 
> <https://awkward-array.org/doc/main/reference/ak.behavior.html> In principle, 
> it would be possible to define methods and overload NumPy ufuncs to interpret 
> variable-length lists of int8 as integers with arbitrary precision. Numba 
> might be helpful in accelerating that if normal NumPy-style vectorization is 
> insufficient.
> 
> If you're interested in following this route, I can help with first 
> implementations of that arbitrary precision integer behavior. (It's an 
> interesting application!)
> 
> Jim
> 
> 
> 
> On Wed, Mar 13, 2024, 12:28 PM Matti Picus  <mailto:matti.pi...@gmail.com>> wrote:
> I am not sure what kind of a scheme would support various-sized native 
> ints. Any scheme that puts pointers in the array is going to be worse: 
> the pointers will be 64-bit. You could store offsets to data, but then 
> you would need to store both the offsets and the contiguous data, nearly 
> doubling your storage. What shape are your arrays, that would be the 
> minimum size of the offsets?
> 
> Matti
> 
> 
> On 13/3/24 18:15, Dom Grigonis wrote:
> > By the way, I think I am referring to integer arrays. (Or integer part 
> > of floats.)
> >
> > I don’t think what I am saying sensibly applies to floats as they are.
> >
> > Although, new float type could base its integer part on such concept.
> >
> > —
> >
> > Where I am coming from is that I started to hit maximum bounds on 
> > integer arrays, where most of values are very small and some become 
> > very large. And I am hitting memory limits. And I don’t have many 
> > zeros, so sparse arrays aren’t an option.
> >
> > Approximately:
> > 90% of my arrays could fit into `np.uint8`
> > 1% requires `np.uint64`
> > the rest 9% are in between.
> >
> > And there is no predictable order where is what, so splitting is not 
> > an option either.
> >
> >
> >> On 13 Mar 2024, at 17:53, Nathan  >> <mailto:nathan.goldb...@gmail.com>> wrote:
> >>
> >> Yes, an array of references still has a fixed size width in the array 
> >> buffer. You can think of each entry in the array as a pointer to some 
> >> other memory on the heap, which can be a dynamic memory allocation.
> >>
> >> There's no way in NumPy to support variable-sized array elements in 
> >> the array buffer, since that assumption is key to how numpy 
> >> implemen

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis

My array is growing in a manner of:
array[slice] += values

so for now will just clip values:
res = np.add(array[slice], values, dtype=np.int64)
array[slice] = res
mask = res > MAX_UINT16
array[slice][mask] = MAX_UINT16

For this case, these large values do not have that much impact. And extra 
operation overhead is acceptable.

---

And adding more involved project to my TODOs for the future.

After all, it would be good to have an array, which (at preferably as minimal 
cost as possible) could handle anything you throw at it with near-optimal 
memory consumption and sensible precision handling, while keeping all the 
benefits of numpy.

Time will tell if that is achievable. If anyone had any good ideas regarding 
this I am all ears.

Much thanks to you all for information and ideas.
dgpb

> On 13 Mar 2024, at 21:00, Homeier, Derek  wrote:
> 
> On 13 Mar 2024, at 6:01 PM, Dom Grigonis  wrote:
>> 
>> So my array sizes in this case are 3e8. Thus, 32bit ints would be needed. So 
>> it is not a solution for this case.
>> 
>> Nevertheless, such concept would still be worthwhile for cases where 
>> integers are say max 256bits (or unlimited), then even if memory addresses 
>> or offsets are 64bit. This would both:
>> a) save memory if many of values in array are much smaller than 256bits
>> b) provide a standard for dynamically unlimited size values
> 
> In principle one could encode individual offsets in a smarter way, using just 
> the minimal number of bits required,
> but again that would make random access impossible or very expensive – 
> probably more or less amounting to
> what smart compression algorithms are already doing.
> Another approach might be to to use the mask approach after all (or just flag 
> all you uint8 data valued 2**8 as
> overflows) and store the correct (uint64 or whatever) values and their 
> indices in a second array.
> May still not vectorise very efficiently with just numpy if your typical 
> operations are non-local.
> 
> Derek
> 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis

Thanks for reiterating, this looks promising!

> On 13 Mar 2024, at 23:22, Jim Pivarski  wrote:
> 
> So that this doesn't get lost amid the discussion: 
> https://www.blosc.org/python-blosc2/python-blosc2.html 
> <https://www.blosc.org/python-blosc2/python-blosc2.html>
> 
> Blosc is on-the-fly compression, which is a more extreme way of making 
> variable-sized integers. The compression is in small chunks that fit into CPU 
> cachelines, such that it's random access per chunk. The compression is 
> lightweight enough that it can be faster to decompress, edit, and recompress 
> a chunk than it is to copy from RAM, edit, and copy back to RAM. (The extra 
> cost of compression is paid for by moving less data between RAM and CPU. 
> That's why I say "can be," because it depends on the entropy of the data.) 
> Since you have to copy data from RAM to CPU and back anyway, as a part of any 
> operation on an array, this can be a net win.
> 
> What you're trying to do with variable-length integers is a kind of 
> compression algorithm, an extremely lightweight one. That's why I think that 
> Blosc would fit your use-case, because it's doing the same kind of thing, but 
> with years of development behind it.
> 
> (Earlier, I recommended bcolz, which was a Python array based on Blosc, but 
> now I see that it has been deprecated. However, the link above goes to the 
> current version of the Python interface to Blosc, so I'd expect it to cover 
> the same use-cases.)
> 
> -- Jim
> 
> 
> 
> 
> 
> On Wed, Mar 13, 2024 at 4:45 PM Dom Grigonis  <mailto:dom.grigo...@gmail.com>> wrote:
> My array is growing in a manner of:
> array[slice] += values
> 
> so for now will just clip values:
> res = np.add(array[slice], values, dtype=np.int64)
> array[slice] = res
> mask = res > MAX_UINT16
> array[slice][mask] = MAX_UINT16
> 
> For this case, these large values do not have that much impact. And extra 
> operation overhead is acceptable.
> 
> ---
> 
> And adding more involved project to my TODOs for the future.
> 
> After all, it would be good to have an array, which (at preferably as minimal 
> cost as possible) could handle anything you throw at it with near-optimal 
> memory consumption and sensible precision handling, while keeping all the 
> benefits of numpy.
> 
> Time will tell if that is achievable. If anyone had any good ideas regarding 
> this I am all ears.
> 
> Much thanks to you all for information and ideas.
> dgpb
> 
>> On 13 Mar 2024, at 21:00, Homeier, Derek > <mailto:dhom...@gwdg.de>> wrote:
>> 
>> On 13 Mar 2024, at 6:01 PM, Dom Grigonis > <mailto:dom.grigo...@gmail.com>> wrote:
>>> 
>>> So my array sizes in this case are 3e8. Thus, 32bit ints would be needed. 
>>> So it is not a solution for this case.
>>> 
>>> Nevertheless, such concept would still be worthwhile for cases where 
>>> integers are say max 256bits (or unlimited), then even if memory addresses 
>>> or offsets are 64bit. This would both:
>>> a) save memory if many of values in array are much smaller than 256bits
>>> b) provide a standard for dynamically unlimited size values
>> 
>> In principle one could encode individual offsets in a smarter way, using 
>> just the minimal number of bits required,
>> but again that would make random access impossible or very expensive – 
>> probably more or less amounting to
>> what smart compression algorithms are already doing.
>> Another approach might be to to use the mask approach after all (or just 
>> flag all you uint8 data valued 2**8 as
>> overflows) and store the correct (uint64 or whatever) values and their 
>> indices in a second array.
>> May still not vectorise very efficiently with just numpy if your typical 
>> operations are non-local.
>> 
>> Derek
>> 
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>> <mailto:numpy-discussion@python.org>
>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>> <mailto:numpy-discussion-le...@python.org>
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>> Member address: dom.grigo...@gmail.com <mailto:dom.grigo...@gmail.com>
> 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> <mailto:numpy-discussion@python.org>
> To unsubscribe send an email to numpy-discussion-le...@python.or

[Numpy-discussion] Re: Polyfit error in displacement

2024-03-25 Thread Dom Grigonis

Hello Luca,

I am a bit confused by the output of VB script.

Equation is: y = f(x), where
x is in the order of 0-2K
y is in the order of 5-10K

The output of fitted polynomial is in y-space, thus I would expect fitted 
values to be similar to those of Y.

Now, sc values are very small and alternating sign.

I would suspect that vb script is doing something else than 3-rd degree 
polynomial fitting.

I would suggest producing scatter plots with fitted lines so you can inspect 
situation visually.

This is all I can say without inspecting VB code.

Correct me, if I am missing something obvious - this does happen.

Regards,
dgpb

> On 25 Mar 2024, at 18:04, Luca Bertolotti  wrote:
> 
> Hello
> in a vb program they use 3rd degree approx and get this value including 
> displacement:(SC)
> 
> 
> Ii think that i'm doing the same with numpy but I get different value does 
> anyone can help me please
> 
> radious = [1821, 1284, 957, 603,450, 245]
> y = [6722, 6940, 7227, 7864,8472, 10458]
> p = np.polyfit(radious, y, 3,)
> t = np.polyval(p, radious)
> [ 6703.33694696  7061.23784145  7051.49974149  7838.84623289
>   8654.47847319 10373.60076402]
> You can see polyval is difference from the sc. of the table
> Any help is really appreciated
> 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: NumPy-Financial: RFC: Dropping decimal.Decimal support

2024-03-29 Thread Dom Grigonis

Hi Kai,

I am not using numpy financial, but from what I know:

Given numpy 2.0 has a support for MPFD types I think it is very reasonable to 
drop Decimal support and remove unnecessary complexity.

Regards,
dgpb

> On 30 Mar 2024, at 02:20, kaistri...@gmail.com wrote:
> 
> Hi all,
> 
> I'm currently working NumPy-Financial (npf) and would like to ask for 
> comments on dropping decimal.Decimal support in npf.  I'm proposing to drop 
> Decimal support for npf. There are a few reasons for this:
> 
> 1. No one appears to use decimal types. Since it's release in 2019, not a 
> single issue has been raised that uses decimal support. Furthermore, I have 
> taken a quick survey of dependent projects, none of which seems to use 
> decimals.
> 
> 2. It's not actually supported that thoroughly. Npf claims that: “Functions 
> support the :class:`decimal.Decimal` type unless otherwise stated.”. This 
> isn't technically true. Many functions that claim to support decimal classes 
> don't. I think it would be better to have a consistent interface that would 
> either entierly support decimal or not support it at all.
> 
> 3. I'm currently working on mimicking broadcasting behaviour for functions. 
> This involves lots of for loops. To speed these up I will use either numba or 
> Cython. Neither of these support decimal classes natively.
> 
> If you have any comments on whether this seems to be the right option, I 
> would be very interested in hearing it. 
> 
> Kind regards 
> 
> Kai
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: JSON format for multi-dimensional data

2024-04-16 Thread Dom Grigonis

Looks good,

I am currently using very low complexity data structures. Could you post some 
examples how would the data look like for simple stuff like 2d array with 2 
coordinate arrays without any extra attributes?

Regards,
dgpb

> On 15 Apr 2024, at 15:38, phili...@loco-labs.io wrote:
> 
> Hello, 
> 
> I created a first version of a neutral format for multi-dimensional data 
> (https://nbviewer.org/github/loco-philippe/ntv-numpy/blob/main/example/example_ntv_numpy.ipynb
>  ) and I made available a first version of a package 
> (https://github.com/loco-philippe/ntv-numpy/blob/main/README.md) with:
> - a reversible (lossless round-trip) Xarray interface,
> - a reversible scipp interface
> - a reversible astropy.NDData interface
> - a reversible JSON interface
> 
> The previous Notebook shows that we can, thanks to this neutral format, share 
> any dataset with any tool.
> 
> I will integrate in a second version the existing structure for tabular data 
> (https://github.com/loco-philippe/ntv-pandas/blob/main/README.md) and the 
> associated reversible interface .
> 
> If you have examples of other tools to integrate or validation datasets, I'm 
> interested!
> 
> Have a nice day
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: feature request: N-D Gaussian function (not random distribution)

2024-07-20 Thread Dom Grigonis

For statistics functions there is `scipy` package.

If you are referring to pdf of n-dimensional gaussian distribution, 
`scipy.stats.multivariate_normal.pdf` should do the trick.

If you are referring to something else, then a bit of clarification would be 
helpful.

Regards,
dg

> On 20 Jul 2024, at 09:04, tomnewton...@gmail.com wrote:
> 
> Hello,
> 
> Apologies if either (a) this is the wrong place to post this or (b) this 
> functionality already exists and I didn't manage to find it.
> 
> I have found myself many times in the past wishing that some sort of N-D 
> Gaussian function exists in NumPy. For example, when I wish to test that some 
> plot or data analysis method is working correctly, being able to call 
> `np.gauss()` on a (M,N) array of coordinates or passing it the arrays 
> generated by a meshgrid, along with tuples for sigma and mu, would be very 
> convienient.
> 
> I could write such a function myself, but that would not be convenient to 
> reuse (copying the function to every program/project I want to use it in), 
> and many other mathematical functions have "convenience" functions in NumPy 
> (such as the square root). More importantly, I imagine that any such function 
> that was written into NumPy by the people who regularly contribute to the 
> project would be far better than one I wrote myself, as I am not tremendously 
> good at programming.
> 
> Regards,
> Tom
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: A better syntax for using ufunc.at?

2024-07-25 Thread Dom Grigonis

It would be nicer, yes.

However, at least from my experience, it does not seem to be used often enough 
to go extra mile for the sake of convenience alone.

Regards,
dg

> On 24 Jul 2024, at 13:49, Oras P.  wrote:
> 
> I am aware that to do unbuffered addition operation, I can use `np.add.at` 
> like this:
> ```
> np.add.at(arr, idxs, vals)
> ```
> I think this syntax looks a bit unnatural, and it is not obvious what it does 
> at first glance. An idea I have is to use a custom accessor, like
> ```
> arr.at[idxs] += vals
> # or 
> arr.unbuffered[idxs] += vals
> ```
> While I'm not fluent in Numpy's working mechanisms, this seems possible to 
> implement by just having the method `.at`/`.unbuffered` return a reference to 
> the original array with a special flag set, then make the `+=` operator, etc 
> check this flag and use the unbuffered operation accordingly. 
> 
> Has this kind of feature been proposed at all? I did try to search Github and 
> this mailing list for something similar, but I'm quite new to numpy 
> development, so apologies in advance if this is not the right place to ask.
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Applying a function on local environments

2024-07-26 Thread Dom Grigonis

Hello Thomas,

It seems that what you are after is application of custom kernels.

Similar existing things to look at:
* For linear kernels, have a look at `np.correlate` function.
* For custom kernels, see `numba` stencils. See: 
https://numba.pydata.org/numba-doc/latest/user/stencil.html

One reason why it might not exist already is because performance of such thing 
would be poor. In C++ this makes perfect sense. However this way in `numpy` is 
fairly slow compared to tailored solutions for various problems that use 
np.correlate, np.convolve and similar.

Also, it is not that hard to implement naive version of this.
E.g. a quick implementation for 2d function.
def correlate_func2d(a, func):
"""
Examples:
>>> a = np.ones((4, 4))
>>> def func(a, i, j):
... sub = a[max(i-1, 0):i+2, max(j-1, 0):j+2]
... return sub.sum()
>>> correlate_func2d(a, func)
array([[4., 6., 6., 4.],
   [6., 9., 9., 6.],
   [6., 9., 9., 6.],
   [4., 6., 6., 4.]])
"""
vfunc = np.vectorize(func, excluded={0})
n, m = a.shape
return vfunc(a, np.arange(n)[:,None], np.arange(m))
Maybe something flexible that works for arbitrary number of dimensions could be 
useful. Although it wouldn’t be very performant, but very convenient (same 
spirit as np.vectorize).

Subarray creation could be factored out of input function (for better 
convenience, but loss of flexibility). In this case it would be good if `mode` 
argument was modelled after `np.correlate`.

Regards,
DG
> On 25 Jul 2024, at 19:13, langt...@forwiss.uni-passau.de wrote:
> 
> Dear all,
> 
> my goal would be to apply some function on a local environment of size K x K 
> x K where K is bigger than 1 and odd. For example, if K = 3, it would be nice 
> to apply some function "func" on such an environment around each 
> n-dimensional position within the n-dimensional array. So, for K = 3 and the 
> position (1,1,1) if a 3D array, one would collect the array values at the 
> positions X = [(0,0,0),(0,0,1),(0,0,2),(0,1,0),...(2,2,2)] (K**3 = 27 in 
> total in this example) and yield those values to "func". The result value at 
> position (1,1,1) in the output array would be y = func(X). The same would 
> apply for all entries excluding the padding area (or according to some 
> padding policy).
> 
> While I coded this many times on plain buffers in C++, I was wondering if 
> there would be an *efficient* way to do this in numpy? Up to now I relied on 
> the ndenumerate way of iterating n-dimensional arrays and aggregating the 
> values using slices, but this turns out to be unbearably slow even for quite 
> small arrays :-/
> Is there a better "numpy-isque" way to do this? I thought of writing a custom 
> view or subclassing, but the efficient aggregation of local environments 
> using the ndenumerate and slice approach is slow, yet in C/C++ random access 
> and native support for parallelism (by means of OpenMP) would drastically 
> accelerate this.
> 
> Or would it even be preferred to add this functionality to the library? I 
> could imagine a special view, e.g., "LocalEnvironmentArrayView" or just a 
> simple function with the intended usage something like the following:
> 
> * a = ... # some n-dimensional numpy array
> * func = lambda env: np.mean(env) - np.std(env)
> * w = numpy.apply_on_local_environments(a, func, environment_size=3)
> 
> So, each entry in the interior (or also the boundary, depending on the 
> padding policy) of w would correspond to the evaluation of func on that local 
> environment.
> 
> Best regards,
> Thomas
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Add to NumPy a function to compute cumulative sums from 0.

2023-08-12 Thread Dom Grigonis

From my point of view, such function is a bit of a corner-case to be added to 
numpy. And it doesn’t justify it’s naming anymore. It is not one operation 
anymore. It is a cumsum and prepending 0. And it is very difficult to argue why 
prepending 0 to cumsum is a part of cumsum.

What I would rather vouch for is adding an argument to `np.diff` so that it 
leaves first row unmodified.
def diff0(a, axis=-1):
"""Differencing which appends first item along the axis"""
a0 = np.take(a, [0], axis=axis)
return np.concatenate([a0, np.diff(a, n=1, axis=axis)], axis=axis)
This would be more sensible from conceptual point of view. As difference can 
not be made, the result is the difference from absolute origin. With 
recognition that first non-origin value in a sequence is the one after it. And 
if the first row is the origin in a specific case, then that origin is 
correctly defined in relation to absolute origin.

Then, if origin row is needed, then it can be prepended in the beginning of a 
procedure. And np.diff and np.cumsum are inverses throughout the sequential 
code.

np.diff0 was one the first functions I had added to my numpy utils and been 
using it instead of np.diff quite a lot.

I think general flag to prevent fencepost errors could be added to all 
functions, where required, so that the flow is seamless retains initial 
dimension length. Taking some time to ensure consistency across numpy in this 
dimension could be of long term value.

E.g. rolling functions in numbagg and bottleneck leave nans, because there is 
no other sensible value to go there instead. While in this case, sensible value 
exists. Just not in `cumsum` function.

> On 11 Aug 2023, at 15:53, Juan Nunez-Iglesias  wrote:
> 
> I'm very sensitive to the issues of adding to the already bloated numpy API, 
> but I would definitely find use in this function. I literally made this error 
> (thinking that the first element of cumsum should be 0) just a couple of days 
> ago! What are the plans for the "extended" NumPy API after 2.0? Is there a 
> good place for these variants?
> 
> On Fri, 11 Aug 2023, at 2:07 AM, john.daw...@camlingroup.com wrote:
>> `cumsum` computes the sum of the first k summands for every k from 1. 
>> Judging by my experience, it is more often useful to compute the sum of 
>> the first k summands for every k from 0, as `cumsum`'s behaviour leads 
>> to fencepost-like problems.
>> https://en.wikipedia.org/wiki/Off-by-one_error#Fencepost_error
>> For example, `cumsum` is not the inverse of `diff`. I propose adding a 
>> function to NumPy to compute cumulative sums beginning with 0, that is, 
>> an inverse of `diff`. It might be called `cumsum0`. The following code 
>> is probably not the best way to implement it, but it illustrates the 
>> desired behaviour.
>> 
>> ```
>> def cumsum0(a, axis=None, dtype=None, out=None):
>>"""
>>Return the cumulative sum of the elements along a given axis,
>>beginning with 0.
>> 
>>cumsum0 does the same as cumsum except that cumsum computes the sum
>>of the first k summands for every k from 1 and cumsum, from 0.
>> 
>>Parameters
>>--
>>a : array_like
>>Input array.
>>axis : int, optional
>>Axis along which the cumulative sum is computed. The default
>>(None) is to compute the cumulative sum over the flattened
>>array.
>>dtype : dtype, optional
>>Type of the returned array and of the accumulator in which the
>>elements are summed. If `dtype` is not specified, it defaults to
>>the dtype of `a`, unless `a` has an integer dtype with a
>>precision less than that of the default platform integer. In
>>that case, the default platform integer is used.
>>out : ndarray, optional
>>Alternative output array in which to place the result. It must
>>have the same shape and buffer length as the expected output but
>>the type will be cast if necessary. See
>>:ref:`ufuncs-output-type` for more details.
>> 
>>Returns
>>---
>>cumsum0_along_axis : ndarray.
>>A new array holding the result is returned unless `out` is
>>specified, in which case a reference to `out` is returned. If
>>`axis` is not None the result has the same shape as `a` except
>>along `axis`, where the dimension is smaller by 1.
>> 
>>See Also
>>
>>cumsum : Cumulatively sum array elements, beginning with the first.
>>sum : Sum array elements.
>>trapz : Integration of array values using the composite trapezoidal rule.
>>diff : Calculate the n-th discrete difference along given axis.
>> 
>>Notes
>>-
>>Arithmetic is modular when using integer types, and no error is
>>raised on overflow.
>> 
>>``cumsum0(a)[-1]`` may not be equal to ``sum(a)`` for floating-point
>>values since ``sum`` may use a pairwise summation routine, reducing
>>the roundoff-error. See `sum` for more inf

[Numpy-discussion] Re: Add to NumPy a function to compute cumulative sums from 0.

2023-08-15 Thread Dom Grigonis

> On 14 Aug 2023, at 15:22, john.daw...@camlingroup.com wrote:
> 
>> From my point of view, such function is a bit of a corner-case to be added 
>> to numpy. And it doesn’t justify it’s naming anymore. It is not one 
>> operation anymore. It is a cumsum and prepending 0. And it is very difficult 
>> to argue why prepending 0 to cumsum is a part of cumsum.
> 
> That is backwards. Consider the array [x0, x1, x2].
> 
> The sum of the first 0 elements is 0.
> The sum of the first 1 elements is x0.
> The sum of the first 2 elements is x0+x1.
> The sum of the first 3 elements is x0+x1+x2.
> 
> Hence, the array of partial sums is [0, x0, x0+x1, x0+x1+x2].
> 
> Thus, the operation [x0, x1, x2] -> [0, x0, x0+x1, x0+x1+x2] is a natural and 
> primitive one.
> 
> The current behaviour of numpy.cumsum is the composition of two basic 
> operations, computing the partial sums and omitting the initial value:
> 
> [x0, x1, x2] -> [0, x0, x0+x1, x0+x1+x2] -> [x0, x0+x1, x0+x1+x2].
In reality both of these functions do exactly what they need to do. But the 
issue, as I understand it, is to have one of these in such way, so that they 
are inverses of each other. The only question is which one is better suitable 
for it and provides most benefits.

Arguments for np.diff0:
1. Dimension length stays constant, while cumusm0 extends length to n+1, then 
np.diff, truncates it back. This adds extra complexity, while things are very 
convenient to work with when dimension length stays constant throughout the 
code.
2. Although I see your argument about element 0, but the fact is that it 
doesn’t exist at all. in np.diff0 case at least half of it exists and the other 
half has a half decent rationale. In cumsum0 case it just appeared out of 
nowhere and in your example above you are providing very different logic to 
what np.cumsum is intrinsically. Ilhan has accurately pointed it out in his 
e-mail.

For now, I only see my point of view and I can list a number of cases from data 
analysis and modelling, where I found np.diff0 to be a fairly optimal choice to 
use and it made things smoother. While I haven’t seen any real-life examples 
where np.cumsum0 would be useful so I am naturally biased. I would appreciate 
If anyone provided some examples that justify np.cumsum0 - for now I just can’t 
think of any case where this could actually be useful or why it would be more 
convenient/sensible than np.diff0.

>> What I would rather vouch for is adding an argument to `np.diff` so that it 
>> leaves first row unmodified.
>> def diff0(a, axis=-1):
>>"""Differencing which appends first item along the axis"""
>>a0 = np.take(a, [0], axis=axis)
>>return np.concatenate([a0, np.diff(a, n=1, axis=axis)], axis=axis)
>> This would be more sensible from conceptual point of view. As difference can 
>> not be made, the result is the difference from absolute origin. With 
>> recognition that first non-origin value in a sequence is the one after it. 
>> And if the first row is the origin in a specific case, then that origin is 
>> correctly defined in relation to absolute origin.
>> Then, if origin row is needed, then it can be prepended in the beginning of 
>> a procedure. And np.diff and np.cumsum are inverses throughout the 
>> sequential code.
>> np.diff0 was one the first functions I had added to my numpy utils and been 
>> using it instead of np.diff quite a lot.
> 
> This suggestion is bad: diff0 is conceptually confused. numpy.diff changes an 
> array of numpy.datetime64s to an array of numpy.timedelta64s, but numpy.diff0 
> changes an array of numpy.datetime64s to a heterogeneous array where one 
> element is a numpy.datetime64 and the rest are numpy.timedelta64s. In 
> general, whereas numpy.diff changes an array of positions to an array of 
> displacements, diff0 changes an array of positions to a heterogeneous array 
> where one element is a position and the rest are displacements.

This isn’t really argument against np.diff0, just one aspect of it which would 
have to be dealt with. If instead of just prepending, the difference from 0 was 
made, it would result in numpy.timedelta64s. So not a big issue.

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Add to NumPy a function to compute cumulative sums from 0.

2023-08-15 Thread Dom Grigonis

With this I agree, this sounds like a more radical (in a good way) solution.

> So I think this is a feature request of "prepend", "append" in a convenient 
> fashion not to ufuncs but to ndarray. Because concatenation is just pain in 
> NumPy and ubiquitous operation all around. Hence probably we should get a 
> decision on that instead of discussing each case separately.

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Proposing a flattening functionality for deeply nested lists in NumPy

2024-12-30 Thread Dom Grigonis via NumPy-Discussion

Hi Mark,

I think this has already been implemented.

For Iterables (list included), there is `more_itertools.collapse`, which you 
can use to achieve examples in your code as 
`list(more_itertools.collapse(list))`. See: 
https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.collapse

If you don’t mind intermediate conversions:
a) For well behaved arrays, such as `numpy.ndarray`, you can do 
`np.ravel(list).tolist()` or there is a bit more functional `torch.flatten` (I 
think `numpy` could add this functionality too - it is quite useful).
b) For ragged arrays, there is `awkward.flatten`.

Given that your need is not related to `numpy.ndarray` objects, I don’t think 
`numpy` is the right place for it. I would alternatively suggest exploring 
`more_itertools` and if something is missing, issue `Iterator` related 
proposals over there.


> On 31 Dec 2024, at 06:37, Mark via NumPy-Discussion 
>  wrote:
> 
> Hello all,
> 
> Many people have asked how to flatten a nested list into a one-dimensional 
> list (e.g., see this StackOverflow thread 
> ).
>  While flattening a 2D list is relatively straightforward, deeply nested 
> lists can become cumbersome to handle. To address this challenge, I propose 
> adding a built-in list-flattening functionality to NumPy.
>  
> By adding this feature to NumPy, the library would not only simplify a 
> frequently used task but also enhance its overall usability, making it an 
> even more powerful tool for data manipulation and scientific computing.
>  
> The code snippet below demonstrates how a nested list can be flattened, 
> enabling conversion into a NumPy array. I believe this would be a valuable 
> addition to the package. See also this issue 
> . 
> 
> from collections.abc import Iterable
>  
> def flatten_list(iterable):
>
> """
> Flatten a (nested) list into a one-dimensional list.
>
> Parameters
> --
> iterable : iterable
> The input collection.
>
> Returns
> ---
> flattened_list : list
> A one-dimensional list containing all the elements from the input,
> with any nested structures flattened.
>
> Examples
> 
>
> Flattening a list containing nested lists:
>
> >>> obj = [[1, 2, 3], [1, 2, 3]]
> >>> flatten_list(obj)
> [1, 2, 3, 1, 2, 3]
>
> Flattening a list with sublists of different lengths:
>
> >>> obj = [1, [7, 4], [8, 1, 5]]
> >>> flatten_list(obj)
> [1, 7, 4, 8, 1, 5]
>
> Flattening a deeply nested list.
>
> >>> obj = [1, [2], [[3]], [[[4]]],]
> >>> flatten_list(obj)
> [1, 2, 3, 4]
>
> Flattening a list with various types of elements:
>
> >>> obj = [1, [2], (3), (4,), {5}, np.array([1,2,3]), range(3), 'Hello']
> >>> flatten_list(obj)
> [1, 2, 3, 4, 5, 1, 2, 3, 0, 1, 2, 'Hello']
>
> """
>
> if not isinstance(iterable, Iterable) or isinstance(iterable, str):
> return [iterable]
>
> def flatten_generator(iterable):
>
> for item in iterable:
>
> if isinstance(item, Iterable) and not isinstance(item, str):
> yield from flatten_generator(item)
>
> else:
> yield item
>
> return list(flatten_generator(iterable))
> 
> 
> 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

52 matches

Mail list logo