Hi Yonik, I am aware of the estimate on the hll. But we don't use the hll as a baseline for comparison. We ask the values for one facet (for example Gender). We store these counts for each bucket. Next we do another request. This time for a facet and a subfacet (for example Gender x Type). We sum all the values of Type with the same Gender and compare these sums with the numbers of previous request. These numbers differ by 60% which is quite worrying. Not always it depends on the facet, but still. Did you get any reports like this?
Thanks Kenny Op 11-nov.-2017 01:47 schreef "Yonik Seeley" <ysee...@gmail.com>: > I do notice you are using hll (hyper-log-log) which is a distributed > cardinality *estimate* : https://en.wikipedia.org/wiki/HyperLogLog > > -Yonik > > > On Fri, Nov 10, 2017 at 11:32 AM, kenny <ke...@ontoforce.com> wrote: > > Hi all, > > > > We are doing some tests in solr 6.6 with json facet api and we get > > completely wrong counts for some combination of facets > > > > Setting: We have a set of fields for 376k documents in our query (total > 120M > > documents). We work with 2 shards. When doing first a faceting over the > > first facet and keeping these numbers, we subsequently do a nested > faceting > > over both facets. > > > > Then we add the numbers of sub-facet and expect to get the > (approximately) > > the same numbers back. Sometimes we get rounding errors of about 1% > > difference. But on other occasions it seems to way off > > > > for example > > > > Gender (3 values) Country (211 values) > > 16226 - 18424 = -2198 (-13.5461604832%) > > 282854 - 464387 = -181533 (-64.1790464338%) > > 40489 - 47902 = -7413 (-18.3086764306%) > > 36672 - 49749 = -13077 (-35.6593586387%) > > > > Gender (3 values) Status (17 Values) > > 16226 - 16273 = -47 (-0.289658572661%) > > 282854 - 435974 = -153120 (-54.1339348215%) > > 40489 - 49925 = -9436 (-23.305095211%) > > 36672 - 54019 = -17347 (-47.3031195462%) > > > > ... > > > > These are the typical requests we submit. So note that we have refine > and an > > overrequest, but we in the case of Gender vs Request we should query all > the > > buckets anyway. > > > > {"wt":"json","rows":0,"json.facet":"{\"Status_sfhll\":\" > hll(Status_sf)\",\"Status_sf\":{\"type\":\"terms\",\"field\" > :\"Status_sf\",\"missing\":true,\"refine\":true,\" > overrequest\":50,\"limit\":50,\"offset\":0}}","q":"*:*","fq" > :["type:\"something\""]} > > > > {"wt":"json","rows":0,"json.facet":"{\"Gender_sf\":{\" > type\":\"terms\",\"field\":\"Gender_sf\",\"missing\":true,\ > "refine\":true,\"overrequest\":10,\"limit\":10,\"offset\":0, > \"facet\":{\"Status_sf\":{\"type\":\"terms\",\"field\":\" > Status_sf\",\"missing\":true,\"refine\":true,\"overrequest\" > :50,\"limit\":50,\"offset\":0}}},\"Gender_sfhll\":\"hll( > Gender_sf)\"}","q":"*:*","fq":["type:\"something\""]} > > > > Is this a known bug? Would switching to old facet api resolve this? Are > > there other parameters we miss? > > > > > > Thanks > > > > > > kenny > > > > >