Hi Yonik,

I am aware of the estimate on the hll. But we don't use the hll as a
baseline for comparison. We ask the values for one facet (for example
Gender). We store these counts for each bucket. Next we do another request.
This time for a facet and a subfacet (for example Gender x Type). We sum
all the values of Type with the same Gender and compare these sums with the
numbers of previous request. These numbers differ by 60% which is quite
worrying. Not always it depends on the facet, but still.
Did you get any reports like this?

Thanks

Kenny

Op 11-nov.-2017 01:47 schreef "Yonik Seeley" <ysee...@gmail.com>:

> I do notice you are using hll (hyper-log-log) which is a distributed
> cardinality *estimate* : https://en.wikipedia.org/wiki/HyperLogLog
>
> -Yonik
>
>
> On Fri, Nov 10, 2017 at 11:32 AM, kenny <ke...@ontoforce.com> wrote:
> > Hi all,
> >
> > We are doing some tests in solr 6.6 with json facet api and we get
> > completely wrong counts for some combination of  facets
> >
> > Setting: We have a set of fields for 376k documents in our query (total
> 120M
> > documents). We work with 2 shards. When doing first a faceting over the
> > first facet and keeping these numbers, we subsequently do a nested
> faceting
> > over both facets.
> >
> > Then we add the numbers of sub-facet and expect to get the
> (approximately)
> > the same numbers back. Sometimes we get rounding errors of about 1%
> > difference. But on other occasions it seems to way off
> >
> > for example
> >
> > Gender (3 values) Country (211 values)
> > 16226 - 18424 = -2198 (-13.5461604832%)
> > 282854 - 464387 = -181533 (-64.1790464338%)
> > 40489 - 47902 = -7413 (-18.3086764306%)
> > 36672 - 49749 = -13077 (-35.6593586387%)
> >
> > Gender (3 values)  Status (17 Values)
> > 16226 - 16273 = -47 (-0.289658572661%)
> > 282854 - 435974 = -153120 (-54.1339348215%)
> > 40489 - 49925 = -9436 (-23.305095211%)
> > 36672 - 54019 = -17347 (-47.3031195462%)
> >
> > ...
> >
> > These are the typical requests we submit. So note that we have refine
> and an
> > overrequest, but we in the case of Gender vs Request we should query all
> the
> > buckets anyway.
> >
> > {"wt":"json","rows":0,"json.facet":"{\"Status_sfhll\":\"
> hll(Status_sf)\",\"Status_sf\":{\"type\":\"terms\",\"field\"
> :\"Status_sf\",\"missing\":true,\"refine\":true,\"
> overrequest\":50,\"limit\":50,\"offset\":0}}","q":"*:*","fq"
> :["type:\"something\""]}
> >
> > {"wt":"json","rows":0,"json.facet":"{\"Gender_sf\":{\"
> type\":\"terms\",\"field\":\"Gender_sf\",\"missing\":true,\
> "refine\":true,\"overrequest\":10,\"limit\":10,\"offset\":0,
> \"facet\":{\"Status_sf\":{\"type\":\"terms\",\"field\":\"
> Status_sf\",\"missing\":true,\"refine\":true,\"overrequest\"
> :50,\"limit\":50,\"offset\":0}}},\"Gender_sfhll\":\"hll(
> Gender_sf)\"}","q":"*:*","fq":["type:\"something\""]}
> >
> > Is this a known bug? Would switching to old facet api resolve this? Are
> > there other parameters we miss?
> >
> >
> > Thanks
> >
> >
> > kenny
> >
> >
>

Reply via email to