Solr json facet API contains option
Hi, I don't seem to find a 'contains' (with or without ignorecase) in the available descriptions of the JSON facet API. Is that because there is none? Or is it just not adequately described. For example in the official ref guide for 6.6 or 7.0 there is no mention of this feature. Is it production ready? Where can I find an up to date description? Right now my only resource is http://yonik.com/json-facet-api/ Thanks Kenny
Solr facets counts deep paged returns inconsistent counts
Hi all, When we run some 'deep' facet counts (eg facet values from 0 to 500 and then from 500 to 1000), we see small but disturbing difference in counts between the two (for example last count on first batch 165, first count on second batch 167) We run this on solr 5.3.1 in cloud mode (3 shards) in non-json facet module Any-one seen ths before? I could not find any bug reported like this. Thanks Kenny
Re: Solr facets counts deep paged returns inconsistent counts
Thanks for the clear explanation. A couple of follow up questions - can we tune overrequesting in json API? - we do see conflicting counts but that's when we have offsets different from 0. We have now already tested it in solr 6.6 with json api. We sometimes get the same value in different offsets: for example the range of constraints [0,500] and [500,1000] might contain the same constraint. Kenny On 20-10-17 17:12, Yonik Seeley wrote: Facet refinement in Solr guarantees that counts for returned constraints are correct, but does not guarantee that the top N returned isn't missing a constraint. Consider the following shard counts (3 shards) for the following constraints (aka facet values): constraintA: 2 0 0 constraintB: 0 2 0 constraintC: 0 0 2 constraintD: 1 1 1 Now for simplicity consider facet.limit=1: Phase 1: retrieve the top 1 facet counts from all 3 shards (this gets back A=2,B=2,C=2) Phase 2: refinement: retrieve counts for A,B,C for any shard that did not contribute to the count in Phase 1: (for example we ask shard2 and shard3 for the count of A) The counts are all correct, but we missed "D" because it never appeared in Phase #1 Solr actually has overrequesting in the first phase to reduce the chances of this happening (i.e. it won't actually happen with the exact scenario above), but it can still happen. You can increase the overrequest amount (see https://lucene.apache.org/solr/guide/6_6/faceting.html) Or use streaming expressions or the SQL that goes on top of that in the latest Solr releases. -Yonik On Fri, Oct 20, 2017 at 10:19 AM, kenny wrote: Hi all, When we run some 'deep' facet counts (eg facet values from 0 to 500 and then from 500 to 1000), we see small but disturbing difference in counts between the two (for example last count on first batch 165, first count on second batch 167) We run this on solr 5.3.1 in cloud mode (3 shards) in non-json facet module Any-one seen ths before? I could not find any bug reported like this. Thanks Kenny -- ONTOFORCE <http://www.ontoforce.com/> Kenny Knecht, PhD CTO and technical lead +32 486 75 66 16 ke...@ontoforce.com <mailto:ke...@ontoforce.com> www.ontoforce.com <http://www.ontoforce.com/> Meetdistrict, Ottergemsesteenweg-Zuid 808, 9000 Gent, Belgium CIC, One Broadway, MA 02142 Cambridge, United States
Nested facet complete wrong counts
Hi all, We are doing some tests in solr 6.6 with json facet api and we get completely wrong counts for some combination of facets Setting: We have a set of fields for 376k documents in our query (total 120M documents). We work with 2 shards. When doing first a faceting over the first facet and keeping these numbers, we subsequently do a nested faceting over both facets. Then we add the numbers of sub-facet and expect to get the (approximately) the same numbers back. Sometimes we get rounding errors of about 1% difference. But on other occasions it seems to way off for example Gender (3 values) Country (211 values) 16226 - 18424 = -2198 (-13.5461604832%) 282854 - 464387 = -181533 (-64.1790464338%) 40489 - 47902 = -7413 (-18.3086764306%) 36672 - 49749 = -13077 (-35.6593586387%) Gender (3 values) Status (17 Values) 16226 - 16273 = -47 (-0.289658572661%) 282854 - 435974 = -153120 (-54.1339348215%) 40489 - 49925 = -9436 (-23.305095211%) 36672 - 54019 = -17347 (-47.3031195462%) ... These are the typical requests we submit. So note that we have refine and an overrequest, but we in the case of Gender vs Request we should query all the buckets anyway. {"wt":"json","rows":0,"json.facet":"{\"Status_sfhll\":\"hll(Status_sf)\",\"Status_sf\":{\"type\":\"terms\",\"field\":\"Status_sf\",\"missing\":true,\"refine\":true,\"overrequest\":50,\"limit\":50,\"offset\":0}}","q":"*:*","fq":["type:\"something\""]} {"wt":"json","rows":0,"json.facet":"{\"Gender_sf\":{\"type\":\"terms\",\"field\":\"Gender_sf\",\"missing\":true,\"refine\":true,\"overrequest\":10,\"limit\":10,\"offset\":0,\"facet\":{\"Status_sf\":{\"type\":\"terms\",\"field\":\"Status_sf\",\"missing\":true,\"refine\":true,\"overrequest\":50,\"limit\":50,\"offset\":0}}},\"Gender_sfhll\":\"hll(Gender_sf)\"}","q":"*:*","fq":["type:\"something\""]} Is this a known bug? Would switching to old facet api resolve this? Are there other parameters we miss? Thanks kenny
Re: Nested facet complete wrong counts
Thank you. But as I showed in my example we used refine and overrequest is not strictly needed because we need all buckets anyway. But that can hardly explain an error of 60%, right? Op 10-nov.-2017 19:29 schreef "Amrit Sarkar" : > Kenny, > > This is a known behavior in multi-sharded collection where the field values > belonging to same facet doesn't reside in same shard. Yonik Seeley has > improved the Json Facet feature by introducing "overrequest" and "refine" > parameters. > > Kindly checkout Jira: > https://issues.apache.org/jira/browse/SOLR-7452 > https://issues.apache.org/jira/browse/SOLR-9432 > > Relevant blog: https://medium.com/@abb67cbb46b/1acfa77cd90c > > On 10 Nov 2017 10:02 p.m., "kenny" wrote: > > > Hi all, > > > > We are doing some tests in solr 6.6 with json facet api and we get > > completely wrong counts for some combination of facets > > > > Setting: We have a set of fields for 376k documents in our query (total > > 120M documents). We work with 2 shards. When doing first a faceting over > > the first facet and keeping these numbers, we subsequently do a nested > > faceting over both facets. > > > > Then we add the numbers of sub-facet and expect to get the > (approximately) > > the same numbers back. Sometimes we get rounding errors of about 1% > > difference. But on other occasions it seems to way off > > > > for example > > > > Gender (3 values) Country (211 values) > > 16226 - 18424 = -2198 (-13.5461604832%) > > 282854 - 464387 = -181533 (-64.1790464338%) > > 40489 - 47902 = -7413 (-18.3086764306%) > > 36672 - 49749 = -13077 (-35.6593586387%) > > > > Gender (3 values) Status (17 Values) > > 16226 - 16273 = -47 (-0.289658572661%) > > 282854 - 435974 = -153120 (-54.1339348215%) > > 40489 - 49925 = -9436 (-23.305095211%) > > 36672 - 54019 = -17347 (-47.3031195462%) > > > > ... > > > > These are the typical requests we submit. So note that we have refine and > > an overrequest, but we in the case of Gender vs Request we should query > all > > the buckets anyway. > > > > {"wt":"json","rows":0,"json.facet":"{\"Status_sfhll\":\"hll( > > Status_sf)\",\"Status_sf\":{\"type\":\"terms\",\"field\":\"S > > tatus_sf\",\"missing\":true,\"refine\":true,\"overrequest\": > > 50,\"limit\":50,\"offset\":0}}","q":"*:*","fq":["type:\"something\""]} > > > > {"wt":"json","rows":0,"json.facet":"{\"Gender_sf\":{\"type\" > > :\"terms\",\"field\":\"Gender_sf\",\"missing\":true,\"refine > > \":true,\"overrequest\":10,\"limit\":10,\"offset\":0,\" > > facet\":{\"Status_sf\":{\"type\":\"terms\",\"field\":\"Statu > > s_sf\",\"missing\":true,\"refine\":true,\"overrequest\":50,\ > > "limit\":50,\"offset\":0}}},\"Gender_sfhll\":\"hll(Gender_ > > sf)\"}","q":"*:*","fq":["type:\"something\""]} > > > > Is this a known bug? Would switching to old facet api resolve this? Are > > there other parameters we miss? > > > > > > Thanks > > > > > > kenny > > > > > > >
Re: Nested facet complete wrong counts
Hi Yonik, I am aware of the estimate on the hll. But we don't use the hll as a baseline for comparison. We ask the values for one facet (for example Gender). We store these counts for each bucket. Next we do another request. This time for a facet and a subfacet (for example Gender x Type). We sum all the values of Type with the same Gender and compare these sums with the numbers of previous request. These numbers differ by 60% which is quite worrying. Not always it depends on the facet, but still. Did you get any reports like this? Thanks Kenny Op 11-nov.-2017 01:47 schreef "Yonik Seeley" : > I do notice you are using hll (hyper-log-log) which is a distributed > cardinality *estimate* : https://en.wikipedia.org/wiki/HyperLogLog > > -Yonik > > > On Fri, Nov 10, 2017 at 11:32 AM, kenny wrote: > > Hi all, > > > > We are doing some tests in solr 6.6 with json facet api and we get > > completely wrong counts for some combination of facets > > > > Setting: We have a set of fields for 376k documents in our query (total > 120M > > documents). We work with 2 shards. When doing first a faceting over the > > first facet and keeping these numbers, we subsequently do a nested > faceting > > over both facets. > > > > Then we add the numbers of sub-facet and expect to get the > (approximately) > > the same numbers back. Sometimes we get rounding errors of about 1% > > difference. But on other occasions it seems to way off > > > > for example > > > > Gender (3 values) Country (211 values) > > 16226 - 18424 = -2198 (-13.5461604832%) > > 282854 - 464387 = -181533 (-64.1790464338%) > > 40489 - 47902 = -7413 (-18.3086764306%) > > 36672 - 49749 = -13077 (-35.6593586387%) > > > > Gender (3 values) Status (17 Values) > > 16226 - 16273 = -47 (-0.289658572661%) > > 282854 - 435974 = -153120 (-54.1339348215%) > > 40489 - 49925 = -9436 (-23.305095211%) > > 36672 - 54019 = -17347 (-47.3031195462%) > > > > ... > > > > These are the typical requests we submit. So note that we have refine > and an > > overrequest, but we in the case of Gender vs Request we should query all > the > > buckets anyway. > > > > {"wt":"json","rows":0,"json.facet":"{\"Status_sfhll\":\" > hll(Status_sf)\",\"Status_sf\":{\"type\":\"terms\",\"field\" > :\"Status_sf\",\"missing\":true,\"refine\":true,\" > overrequest\":50,\"limit\":50,\"offset\":0}}","q":"*:*","fq" > :["type:\"something\""]} > > > > {"wt":"json","rows":0,"json.facet":"{\"Gender_sf\":{\" > type\":\"terms\",\"field\":\"Gender_sf\",\"missing\":true,\ > "refine\":true,\"overrequest\":10,\"limit\":10,\"offset\":0, > \"facet\":{\"Status_sf\":{\"type\":\"terms\",\"field\":\" > Status_sf\",\"missing\":true,\"refine\":true,\"overrequest\" > :50,\"limit\":50,\"offset\":0}}},\"Gender_sfhll\":\"hll( > Gender_sf)\"}","q":"*:*","fq":["type:\"something\""]} > > > > Is this a known bug? Would switching to old facet api resolve this? Are > > there other parameters we miss? > > > > > > Thanks > > > > > > kenny > > > > >
Re: Nested facet complete wrong counts
RRGG - [banging my head against the wall] Of course. You are abolutely right about the multi valuedness Thanks for the 7.0 hint. Gives a reason to upgrade. Need to re-index when upgrading? Kenny [image: ONTOFORCE] <http://www.ontoforce.com/> Kenny Knecht, PhD CTO and technical lead +32 486 75 66 16 <0032498464291> ke...@ontoforce.com www.ontoforce.com Meetdistrict, Ottergemsesteenweg-Zuid 808, 9000 Gent, Belgium CIC, One Broadway, MA 02142 Cambridge, United States On 11 November 2017 at 15:52, Yonik Seeley wrote: > Also, If you're looking at all constraints, you shouldn't need refine:true > But if you do need it, it was only added in Solr 7.0 (and I see you're > using 6.6) > > -Yonik > > > On Sat, Nov 11, 2017 at 9:48 AM, Yonik Seeley wrote: > > On Sat, Nov 11, 2017 at 9:18 AM, Kenny Knecht > wrote: > >> Hi Yonik, > >> > >> I am aware of the estimate on the hll. But we don't use the hll as a > >> baseline for comparison. We ask the values for one facet (for example > >> Gender). We store these counts for each bucket. Next we do another > request. > >> This time for a facet and a subfacet (for example Gender x Type). We sum > >> all the values of Type with the same Gender and compare these sums with > the > >> numbers of previous request. These numbers differ by 60% which is quite > >> worrying. Not always it depends on the facet, but still. > >> Did you get any reports like this? > > > > Nope. The counts for the scenario you describe should add up exactly > > for single-valued fields. Are you sure you're adding in the "missing" > > bucket? > > > > When you some up the sub-facets on Type, do you get a value under or > > over the counts on the parent facet? > > Verify that Type is single-valued. One would not expect facets on a > > multi-valued field to add up in the same way. > > Verify that you're getting all of the Type constraints by using a > > limit of -1on that sub-facet. > > > > -Yonik >
Individual query limits for each search field value
Hi Guys, Let's say i need a query to get the cheapest beef prices in US cities, So, select?q=cities=NYC,LAS,MIA,SFO&limit=50 Problem is there may be more than 50 prices in NYC alone, thus for the rest of the cities like LAS,MIA,SFO, the results might not be returned. Is there a way to limit the results of each city to let's say 10, via solr/lucene query, or would i have to make separate calls for each city, with the limit 10 like select?q=cities=NYC&limit=10 select?q=cities=LAS&limit=10 select?q=cities=MIA&limit=10 select?q=cities=SFO&limit=10 Best Wishes, Kenny