Re: Facet in SOLR Cloud vs Core

Chris Hostetter Thu, 07 Jul 2016 12:02:34 -0700

: My question specifically has to do with Facets in a SOLR 
: cloud/collection (distributed environment). The core I am working with 
        ...
: I am using the following facet query which works fine in more Core based index
: 
: 
http://localhost:8983/solr/gamra/select?q=*:*&rows=0&facet=true&facet.field=dataSourceName
: 
: It returns counts for each distinct dataSourceName as follows (which is the 
desired behavior).
        ...
: I am wondering if this should work fine in the SOLR Cloud as well?  
: Will this method give me accurate counts out of the box in a SOLR Cloud 
: configuration?


Yes it will.

solr uses a two pass aproach for faceting -- in pass #1 the "top" 
constraints are determined from each shard (overrequesting based on your 
original facet.limit), and then aggregated together.  pass #2 is a 
"refinement" step: any terms from the agregated "top" constraints are 
checked to see shich shards (if any) did not include them in the per-shard 
"top" constraints, and those shards are asked to compute a constraint 
count for terms as needed -- these are then added into the aggregated 
counts for each term, and the terms are resorted.

This means that in some pathelogical term distributions, a term may be 
excluded from the list of "top" terms if it isn't returned by *any* shard 
in pass #1, but for any term that is returned to the end client, the count 
is 100% accurate.

(NOTE: this info applies to the default solr faceting, and solr's pivot 
faceting -- but the relatively new "json faceting" does not support these 
multi-pass refinement of the facet counts.

: PS: The reason I ask is because I know there is some estimating 
: performed in certain cases for the Facet "unique" function (as is 
: outlined here: http://yonik.com/solr-count-distinct/ ). So I guess I am 
: wondering why folks wouldn't just do what I have done vs going throught 
: the trouble of using the unique(dataSourceName) function?

what you linked to is addressing a diff problem then simple facet 
counts.  in your case you are getting the "top" terms with their 
document counts, but what that blog post is refering to is counting the 
total number of unique *terms* (ie: in your data set: what is the total 
number of all unique values in the "dataSourceName" field?

distributed counting of unique values in a high cardinality sets is a 
"hard" problem, as the only way to be 100% accurate is to aggregate all 
terms from all shards into a single node to be hashed (or sorted) ... for 
"batch" style analytics this is a trivial map-reduce style job that can 
offload to disk, but in "real time" situations, statistical sampling 
approaches like HyperLogLog (used in solr) make more sense to get 
aproximations w/o exploding ram usage.



-Hoss
http://www.lucidworks.com/

Re: Facet in SOLR Cloud vs Core

Reply via email to