[ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031756#comment-17031756
 ] 

Munendra S N commented on SOLR-11725:
-------------------------------------

Thanks [~hossman] for the review. I will share updated patch with changes entry 
including upgrade notes.

{quote}Wait... the conversation from 2017 wasn't resoved? What do we want to do 
about stddev of singleton sets? Solr currently returns 0.0, and Hoss seemed to 
think this was the right behavior. But the patch here would seem to change the 
behavior to return NaN (but I didn't test it...) . After a quick glance, it 
doesn't look like existing tests cover this case either?{quote}
Thanks [~ysee...@gmail.com] for the review. There are no tests to cover 
singleton case, I'm not sure if sample size is 0 is covered too.
I think changing the current behavior of singleton case should be taken up in 
separate issue as it concerns both the Classical stats and JSON aggregations. 
This patch doesn't change the current behavior, I will add tests to cover these 
cases

> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11725
>                 URL: https://issues.apache.org/jira/browse/SOLR-11725
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Facet Module
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: SOLR-11725.patch, SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
> sample stddev" equation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to