[
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053387#comment-17053387
]
Munendra S N commented on SOLR-11725:
-------------------------------------
I'm planning to commit this weekend (only to master), let me know if there are
any concerns
> json.facet's stddev() function should be changed to use the "Corrected sample
> stddev" formula
> ---------------------------------------------------------------------------------------------
>
> Key: SOLR-11725
> URL: https://issues.apache.org/jira/browse/SOLR-11725
> Project: Solr
> Issue Type: Sub-task
> Components: Facet Module
> Reporter: Chris M. Hostetter
> Priority: Major
> Attachments: SOLR-11725.patch, SOLR-11725.patch, SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}}
> calculations done between the two code paths can be measurably different, and
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count -
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat
> nerds I know online to help me sanity check if these equations (some how)
> reduced to eachother (In which case the discrepancies I was seeing in my
> results might have just been due to the order of intermediate operation
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other,
> and explained that the code JSON Faceting is using is equivalent to the
> "Uncorrected sample stddev" formula, while StatsComponent's code is
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and
> pressed them to explain which one was the "most canonical" (or "most
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a
> Solr result set (or against a sub-set of results defined by a facet
> constraint) is probably to compare that distribution to a different Solr
> result set (or to compare N sub-sets of results defined by N facet
> constraints)
> * the size of the sets of documents (values) can be relatively small when
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected
> sample stddev" equation.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]