[Numpy-discussion] Proposal: Automatic estimation of number of histogram bins for weighted data

2021-12-12 Thread Jonathan Crall
Hi all, this is my first post on this mailing list.

I'm writing to propose a method for extending the histogram bandwidth
estimators to work with weighted data. I originally submitted this proposal
to seaborn: https://github.com/mwaskom/seaborn/issues/2710 and mwaskom
suggested I take it here.

Currently the unweighted auto heuristic is a combination of
the Freedman-Diaconis and Sturges estimator. For reference, these rules are
as follows:

Sturges: return the peak-to-peak ptp=(i.e. x.max() - x.min()) and number of
data points total=x.size. Then divide ptp by the log of one plus the number
of data points.

ptp / log2(total + 2)

Freedman-Diaconis: Find the interquartile-range of the data
iqr=(np.subtract(*np.percentile(x, [75, 25]))) and the number of data
points total=x.size, then apply the formula:

2.0 * iqr * total ** (-1.0 / 3.0).

Taking a look at these it seems (please correct me if I'm missing something
that makes this not work) that there is a simple extension to weighted
data. If we can find a weighted replacement for p2p, total, and iqr, the
formulas should work exactly the same in the weighted case.

The p2p case seems easy. Even if the data points are weighed, that doesn't
change the min and max. Nothing changes here.

For total, instead of taking the size of the array (which implicitly
assumes each data point has a weight of 1), just sum the weight to get
total=weights.sum().

I believe the IQR is also computable in the weighted case.

import numpy as np
n = 10
rng = np.random.RandomState(12554)


x = rng.rand(n)
w = rng.rand(n)


sorted_idxs = x.argsort()
x_sort = x[sorted_idxs]
w_sort = w[sorted_idxs]


cumtotal = w_sort.cumsum()
quantiles = cumtotal / cumtotal[-1]
idx2, idx1 = np.searchsorted(quantiles, [0.75, 0.25])
iqr_weighted = x_sort[idx2] - x_sort[idx1]
print('iqr_weighted = {!r}'.format(iqr_weighted))


# test this is the roughtly the same for the "unweighted case"
# (wont be exactly the same because this method does not have interpolation)
w = np.ones_like(x)


w_sort = w[sorted_idxs]
cumtotal = w_sort.cumsum()
quantiles = cumtotal / cumtotal[-1]
idx2, idx1 = np.searchsorted(quantiles, [0.75, 0.25])
iqr_weighted = x_sort[idx2] - x_sort[idx1]
iqr_unweighted_repo = x_sort[idx2] - x_sort[idx1]
print('iqr_unweighted_repo = {!r}'.format(iqr_unweighted_repo))


iqr_unweighted_orig = np.subtract(*np.percentile(x, [75, 25]))
print('iqr_unweighted_orig = {!r}'.format(iqr_unweighted_orig))


This quick and dirty method if weighted quantiles give a close result
(which is probably fine for a bandwidth estimator):

iqr_weighted = 0.21964093625695036
iqr_unweighted_repo = 0.36649977003903755
iqr_unweighted_orig = 0.30888312408540963

And I do see there is an open issue / PR for
weighted quantiles/percentiles: https://github.com/numpy/numpy/issues/8935
https://github.com/numpy/numpy/pull/9211 so this code could make use of
that after it lands.

Lastly, I think the most common case (or at least my case) for using a
weighted histogram is to combine multiple histograms. In this case the
number of estimated bins might be greater than the number of weighted data
points, and a simple min condition on that number and the estimated number
of bins should take care of that.

Please let me know: thoughts / opinions / ideas on this topic. I did do
some searching for related discussion, but I may have missed it, so point
me to that if I missed it. Also if the reason this feature does not exist
is because there is some theoretical problem with estimating bandwidth for
weighted data that I'm unaware of, I'd be interested to learn about that
(although I can't see that being the case because these are just heuristics
after all, and I have validated that this works well in my own use-cases).

-- 
-Dr. Jon Crall (him)
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Proposal: Automatic estimation of number of histogram bins for weighted data

2021-12-23 Thread Jonathan Crall
While it does feel like this might be more scipy-ish than numpy-ish, numpy
has an existing histogram method, with existing heuristics for choosing a
number of bins automatically, with existing support for weights. What it is
lacking is support for weights and a heuristic jointly. This proposal is
not a massive new feature for numpy. It is just plugging a hole that exists
in the cross product of possible argument combinations for np.histogram.

Thank you for the pointer about interpretation of weights. That was
something I felt was going to be a nuance of this, but I didn't have the
words to describe it.

Within pure numpy, I think it should be possible to compute multiple
histograms and then aggregate them. That seems to lend itself towards
frequency weights, but it seems to me that probability weights would use
the same procedure to estimate bandwidth.

https://stats.stackexchange.com/questions/354689/are-frequency-weights-and-sampling-weights-in-practice-the-same-thing

And ultimately, this is just an estimator used as a convenience for
programmers. Most real applications will need to define their bins wrt
their problem, but I think if it makes sense for numpy to provide a
heuristic baseline for un-weighted data, then it is natural to assume it
would do so for weighted data as well.




On Mon, Dec 13, 2021 at 4:03 AM Kevin Sheppard 
wrote:

> To me, this feels like it might be a better fit for SciPy or possibly
> statsmodels (but maybe not since neither have histogram functions
> anymore).The challenge with weighted estimators is how the weights should
> be interpreted. Stata covers the most important cases of weights
> https://www.reed.edu/psychology/stata/gs/tutorials/weights.html.  Would
> these be frequency weights?  Stata supports only frequency weights
> https://www.stata.com/manuals/u11.pdf#u11.1.6weight.
>
> Kevin
>
>
> On Sun, Dec 12, 2021 at 9:45 AM Jonathan Crall  wrote:
>
>> Hi all, this is my first post on this mailing list.
>>
>> I'm writing to propose a method for extending the histogram bandwidth
>> estimators to work with weighted data. I originally submitted this proposal
>> to seaborn: https://github.com/mwaskom/seaborn/issues/2710 and mwaskom
>> suggested I take it here.
>>
>> Currently the unweighted auto heuristic is a combination of
>> the Freedman-Diaconis and Sturges estimator. For reference, these rules are
>> as follows:
>>
>> Sturges: return the peak-to-peak ptp=(i.e. x.max() - x.min()) and number
>> of data points total=x.size. Then divide ptp by the log of one plus the
>> number of data points.
>>
>> ptp / log2(total + 2)
>>
>> Freedman-Diaconis: Find the interquartile-range of the data
>> iqr=(np.subtract(*np.percentile(x, [75, 25]))) and the number of data
>> points total=x.size, then apply the formula:
>>
>> 2.0 * iqr * total ** (-1.0 / 3.0).
>>
>> Taking a look at these it seems (please correct me if I'm missing
>> something that makes this not work) that there is a simple extension to
>> weighted data. If we can find a weighted replacement for p2p, total, and
>> iqr, the formulas should work exactly the same in the weighted case.
>>
>> The p2p case seems easy. Even if the data points are weighed, that
>> doesn't change the min and max. Nothing changes here.
>>
>> For total, instead of taking the size of the array (which implicitly
>> assumes each data point has a weight of 1), just sum the weight to get
>> total=weights.sum().
>>
>> I believe the IQR is also computable in the weighted case.
>>
>> import numpy as np
>> n = 10
>> rng = np.random.RandomState(12554)
>>
>>
>> x = rng.rand(n)
>> w = rng.rand(n)
>>
>>
>> sorted_idxs = x.argsort()
>> x_sort = x[sorted_idxs]
>> w_sort = w[sorted_idxs]
>>
>>
>> cumtotal = w_sort.cumsum()
>> quantiles = cumtotal / cumtotal[-1]
>> idx2, idx1 = np.searchsorted(quantiles, [0.75, 0.25])
>> iqr_weighted = x_sort[idx2] - x_sort[idx1]
>> print('iqr_weighted = {!r}'.format(iqr_weighted))
>>
>>
>> # test this is the roughtly the same for the "unweighted case"
>> # (wont be exactly the same because this method does not have
>> interpolation)
>> w = np.ones_like(x)
>>
>>
>> w_sort = w[sorted_idxs]
>> cumtotal = w_sort.cumsum()
>> quantiles = cumtotal / cumtotal[-1]
>> idx2, idx1 = np.searchsorted(quantiles, [0.75, 0.25])
>> iqr_weighted = x_sort[idx2] - x_sort[idx1]
>> iqr_unweighted_repo = x_sort[idx2] - x_sort[idx1]
>> print('iqr_unweighted_repo = {!r}'.format(iqr_unweighted_repo))
>>
>>
>> iqr_unweighted_orig =

[Numpy-discussion] Re: Proposal: Automatic estimation of number of histogram bins for weighted data

2021-12-24 Thread Jonathan Crall
Yes, #9211 <https://github.com/numpy/numpy/pull/9211> is the open PR for
weighted quantiles. Is this something I should make an issue for on the
numpy github? Or is the correct place to discuss it on this mailing list?
I'd like to link to this conversation in two other places on github, but
that's difficult when discussion is on the mailing list. But if it's more
appropriate to talk here, let me know.

On Thu, Dec 23, 2021 at 2:29 PM Joseph Fox-Rabinovitz <
jfoxrabinov...@gmail.com> wrote:

> For what it's worth, I've looked into this a long time ago. The missing
> ingredient has always been weighted quantiles. If I'm not mistaken, the
> interface already exists, but raises an error. I've had it on my back
> burner to provide an O(n) C implementation of weighted introselect, but
> never quite got around to it. I think there has been work to add a O(n log
> n) implementation recently.
>
> - Joe
>
> On Thu, Dec 23, 2021 at 1:19 PM Jonathan Crall  wrote:
>
>> While it does feel like this might be more scipy-ish than numpy-ish,
>> numpy has an existing histogram method, with existing heuristics for
>> choosing a number of bins automatically, with existing support for weights.
>> What it is lacking is support for weights and a heuristic jointly. This
>> proposal is not a massive new feature for numpy. It is just plugging a hole
>> that exists in the cross product of possible argument combinations for
>> np.histogram.
>>
>> Thank you for the pointer about interpretation of weights. That was
>> something I felt was going to be a nuance of this, but I didn't have the
>> words to describe it.
>>
>> Within pure numpy, I think it should be possible to compute multiple
>> histograms and then aggregate them. That seems to lend itself towards
>> frequency weights, but it seems to me that probability weights would use
>> the same procedure to estimate bandwidth.
>>
>>
>> https://stats.stackexchange.com/questions/354689/are-frequency-weights-and-sampling-weights-in-practice-the-same-thing
>>
>> And ultimately, this is just an estimator used as a convenience for
>> programmers. Most real applications will need to define their bins wrt
>> their problem, but I think if it makes sense for numpy to provide a
>> heuristic baseline for un-weighted data, then it is natural to assume it
>> would do so for weighted data as well.
>>
>>
>>
>>
>> On Mon, Dec 13, 2021 at 4:03 AM Kevin Sheppard <
>> kevin.k.shepp...@gmail.com> wrote:
>>
>>> To me, this feels like it might be a better fit for SciPy or possibly
>>> statsmodels (but maybe not since neither have histogram functions
>>> anymore).The challenge with weighted estimators is how the weights should
>>> be interpreted. Stata covers the most important cases of weights
>>> https://www.reed.edu/psychology/stata/gs/tutorials/weights.html.  Would
>>> these be frequency weights?  Stata supports only frequency weights
>>> https://www.stata.com/manuals/u11.pdf#u11.1.6weight.
>>>
>>> Kevin
>>>
>>>
>>> On Sun, Dec 12, 2021 at 9:45 AM Jonathan Crall 
>>> wrote:
>>>
>>>> Hi all, this is my first post on this mailing list.
>>>>
>>>> I'm writing to propose a method for extending the histogram bandwidth
>>>> estimators to work with weighted data. I originally submitted this proposal
>>>> to seaborn: https://github.com/mwaskom/seaborn/issues/2710 and mwaskom
>>>> suggested I take it here.
>>>>
>>>> Currently the unweighted auto heuristic is a combination of
>>>> the Freedman-Diaconis and Sturges estimator. For reference, these rules are
>>>> as follows:
>>>>
>>>> Sturges: return the peak-to-peak ptp=(i.e. x.max() - x.min()) and
>>>> number of data points total=x.size. Then divide ptp by the log of one plus
>>>> the number of data points.
>>>>
>>>> ptp / log2(total + 2)
>>>>
>>>> Freedman-Diaconis: Find the interquartile-range of the data
>>>> iqr=(np.subtract(*np.percentile(x, [75, 25]))) and the number of data
>>>> points total=x.size, then apply the formula:
>>>>
>>>> 2.0 * iqr * total ** (-1.0 / 3.0).
>>>>
>>>> Taking a look at these it seems (please correct me if I'm missing
>>>> something that makes this not work) that there is a simple extension to
>>>> weighted data. If we can find a weighted replacement for p2p, total, and
>>>> iqr, the formulas should