[Numpy-discussion] Re: Percentile/Quantile "interpolation" refactor

2021-10-18 Thread Abel AOUN
Hello,
Thanks for the summary of the PR Sebastian.


About the default value of python `quantile` being "exclusive".
They give some explanation about why it is their default, but only in a 
commented bloc above the code of `quantile` and not in the actual documentation.
You can see it here: 
https://github.com/python/cpython/blob/main/Lib/statistics.py#L615
For reference the documentation is here: 
https://docs.python.org/3/library/statistics.html#statistics.quantiles

By the way, it is method 6 which is named "exclusive" and method 7 named 
"inclusive" by python.
And "inclusive" and "exclusive" names seem to come from excel.


Sincerely,
Abel


- Mail original -
De: "Sebastian Berg" 
À: "Discussion of Numerical Python" 
Envoyé: Mercredi 13 Octobre 2021 17:25:19
Objet: [Numpy-discussion] Percentile/Quantile "interpolation" refactor

Hi all,

after a long time Abel has helped us and refactored the quantile and
percentile functions' `interpolation` keyword.

This was long overdue since NumPy implements three (the non-default)
interpolation methods that appear to be very much non-standard.  On the
other hand, NumPy currently has no unbiased methods (i.e. population
estimate).

There are two main questions right now with respect to the API.  First
which names to use for the methods and second, how to deal with
"outliers".


The PR https://github.com/numpy/numpy/pull/19857#issuecomment-939852134
adds the methods and gives them (currently) the following names (sorted
by the R methods) – the names will be used as string identifiers:

1. inverted cdf
2. averaged inverted cdf
3. closest observation
4. interpolated inverted cdf
5. hazen  (name from wolfram)
6. weibull  (name from wolfram)
7. linear  (default!  Better name deferred)
8. median unbiased
9. normal unbiased

And additionally the four ones we currently have:

* lower
* higher
* nearest
* midpoint

Number 5. and 6. are named "exclusive" and "inclusive" by Python in
their `method` keyword argument.  While I like the name `method=` and
may want to move to it, I am not sure I like "inclusive" and
"exclusive".
The current plan was to defer the kwarg rename into a followup,
although it should be discussed before the next release.


The second main question is how to deal with outliers (this does not
affect the default method 7, which finds the sample quantiles and not a
population estimate).  Wikipedia says this:

Packages differ in how they estimate quantiles beyond the lowest
and highest values in the sample, i.e. p < 1/N and p > (N − 1)/N.
Choices include returning an error value, computing linear
extrapolation, or assuming a constant value.

The current choice is clipping (assuming a constant value), but this
could be modified.


Any feedback is appreciated!  Otherwise, this will probably move
forward in the current state for the next release.

Cheers,

Sebastian

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: a...@cerfacs.fr
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Ways to achieve faster np.nanpercentile() calculation?

2022-09-19 Thread Abel AOUN
Hi Aron,

There is an implementation of nanpercentile for ndarray in xclim[0]. See 
`calc_perc`[1]
Bear in mind it's not exposed in the public API, so I would only use it as an 
example implementation.
You may also find a performance script and report (quite poorly written, sorry) 
on a gist[2]. I'm not sure how accurate it is against the latest numpy version.

Plus the interpolation configuration is limited to the 9 methods of R through 
alpha and beta parameter (similar to scipy `mquantiles`).
Thus, you can't use "nearest", "lower", "higher", "midpoint" methods but:
- "linear" (np default) would be alpha=1 and beta=1.
- "median_unbiased" (recommended default [3]) would be alpha=1/3 and beta=1/3.


Cheers,
Abel Aoun

[0] https://github.com/Ouranosinc/xclim
[1] https://github.com/Ouranosinc/xclim/blob/master/xclim/core/utils.py#L240
[2] https://gist.github.com/bzah/2a84d050b8a1aed1b40a2ed1526e1f12 
[3] 
https://www.researchgate.net/publication/222105754_Sample_Quantiles_in_Statistical_Packages

- Mail original -
De: "Aron Gergely" 
À: "Discussion of Numerical Python" 
Envoyé: Vendredi 16 Septembre 2022 10:56:28
Objet: [Numpy-discussion] Ways to achieve faster np.nanpercentile() calculation?

Hi all,

On my system, np.nanpercentile()  is orders of magnitude (>100x) slower 
than np.percentile().
I use numpy 1.23.1

Wondering if there is a way to speed it up.

I came across this workaround for 3D arrays:
https://krstn.eu/np.nanpercentile()-there-has-to-be-a-faster-way/

But I would need a generalized solution that works on N dimensions.
So I started adopting the above - but wondering if I am reinventing the 
wheel here?

Is there already a python package that implements a speedier 
nanpercentile for numpy? (similar idea as the 'Bottleneck' package?)
Or other known workarounds to achieve the same result?

Best regards,
Aron

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: a...@cerfacs.fr
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com