[Numpy-discussion] Enhancement: np.convolve(..., mode="normalized")

2023-11-22 Thread filip . dominec
Convolution is often used for smoothing noisy data; a typical use will keep the 
'same' length of data and may look like this:

>convol = 2**-np.linspace(-2,2,100)**2; 
>y2 = np.convolve(y,convol/np.sum(convol), mode='same') ## simple smoothing
>ax.plot(x, y2, label="simple smoothing", color='g')

However, when the smoothed curve has some nonzero background value at its 
edges, this convolution mode internally pads it with zeros, resulting in the 
curve looking like moustache of Frank Zappa.

I made an example plot illustrating this here: 
https://www.fzu.cz/~dominecf/misc/numpy_smoothing_example.png. Such a result, 
i.e. the green curve, is not publication ready!

1) One way around is to np.pad(..., mode='edge'), then convolve & properly 
truncate the curve back to its original length. This is not a correct approach, 
however, as it makes the curve edges smooth, but their actual shape becomes 
very sensitive to the pointwise noise. Moreover, it artificially removes the 
curve's slope at its edges.

2) Another way around is to generate an auxiliary "Zappa's moustache" by 
applying the same convolution to a fresh array of np.ones_like(y). Then one can 
normalize the convolved curve by this auxiliary function. This has only one 
downside of keeping the curve more noisy at its edges, which however appears 
more scientifically honest to me - at the dataset edges one simply has less 
means to filter out noise.

>convol = 2**-np.linspace(-2,2,100)**2; 
>norm = np.convolve(np.ones_like(y),convol/np.sum(convol), mode='same')
>y2 = np.convolve(y,convol/np.sum(convol), mode='same')/norm ## simple 
> smoothing
>ax.plot(x, y2, label="approach 2", color='k')

In my experimental work, I am missing this behaviour of np.convolve in a single 
function. I suggest this option should be accessible numpy under the 
mode="normalized" option. (Actually I believe this could have been set as 
default, but this would break compatibility.) 

Thanks for consideration,
Filip
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] NEP 55 Updates and call for testing

2023-11-22 Thread Nathan
Hi all,

This week I updated NEP 55 to reflect the changes I made to the prototype
since
I initially sent out the NEP. The updated NEP is available on the NumPy
website:
https://numpy.org/neps/nep-0055-string_dtype.html.

Updates to the NEP
++

The changes since the original version of the NEP focus on fully defining
the C
API surface we would like to add to the NumPy C API and an implementation
of a
per-dtype-instance arena allocator to manage heap allocations. This enabled
major improvements to the prototype, including implementing the small string
optimization and locking all access to heap memory behind a fine-grained
mutex
which should prevent seg faults or memory corruption in a multithreaded
context. Thanks to Warren Weckesser for his proof of concept code and help
with
the small string optimization implementation, he has been added as an
author to
reflect his contributions.

With these changes the stringdtype prototype is feature complete.

Call to Review NEP 55
+

I'm requesting another round of review on the NEP with an eye toward
acceptance
before the NumPy 2.0 release branch is created from main. If I can manage
it, my
plan is to have a pull request open that merges the stringdtype codebase
into
NumPy before the branch is created. That said, if we decide that we need
more
time, or if some issue comes up, I'm happy with this going into main after
the
NumPy 2.0 release branch is created.

The most significant feedback we have not addressed from the last round of
review was Warren's suggestion to add a default missing data sentinel to
NumPy
itself. For reasons outlined in the NEP and in my reply to Warren from
earlier
this year, we do not want to add a missing data singleton to NumPy, instead
leaving it to users to choose the missing data semantics they prefer.
Otherwise I
believe the current draft addresses all outstanding feedback from the last
round of review.

Help me Test the Prototype!
+++

If anyone has time and interest, I would also very much appreciate some
testing
and tire-kicking on the stringdtype prototype, available at
https://github.com/numpy/numpy-user-dtypes.

There is a README with build instructions here:
https://github.com/numpy/numpy-user-dtypes/blob/main/stringdtype/README.md

If you have a Python development environment with a C compiler, it should be
straightforward to build, install, and test the prototype. Note that you
must
have `NUMPY_EXPERIMENTAL_DTYPE_API=1` set in your shell environment or via
`os.environ` to import stringdtype without error.

I'm particularly interested to hear experiences converting code to use
stringdtype. This could be code using fixed-width strings in a situation
where a
variable-length string array makes more sense or code using object string
arrays. Are there pain points that aren't discussed in the NEP or existing
workflows that cannot be adapted to use stringdtype? As far as I'm aware
there
aren't, but more testing will help catch issues before we've stabilized
everything.

My fork of pandas might be a source of inspiration for porting an existing
non-trivial
codebase that used object string arrays:

https://github.com/pandas-dev/pandas/compare/main...ngoldbaum:pandas:stringdtype

Thanks all for your time, attention, and help reviewing the NEP!

-Nathan
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Enhancement: np.convolve(..., mode="normalized")

2023-11-22 Thread Ronald van Elburg
I wonder whether you are looking for the solution in the right direction. Is 
there theory for the shape of the curve? In that case it might be better to see 
the problem as a fitting problem.

Other than that I think option 2 is too ad hoc for scientific work. I would opt 
for simply not showing the smoothed curve where it is not available. The convol 
function you specified here is a very narrow Gaussian, is that the function you 
actually used?

Note: The code you provided can not be executed
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com