: My question specifically has to do with Facets in a SOLR : cloud/collection (distributed environment). The core I am working with ... : I am using the following facet query which works fine in more Core based index : : http://localhost:8983/solr/gamra/select?q=*:*&rows=0&facet=true&facet.field=dataSourceName : : It returns counts for each distinct dataSourceName as follows (which is the desired behavior). ... : I am wondering if this should work fine in the SOLR Cloud as well? : Will this method give me accurate counts out of the box in a SOLR Cloud : configuration?
Yes it will. solr uses a two pass aproach for faceting -- in pass #1 the "top" constraints are determined from each shard (overrequesting based on your original facet.limit), and then aggregated together. pass #2 is a "refinement" step: any terms from the agregated "top" constraints are checked to see shich shards (if any) did not include them in the per-shard "top" constraints, and those shards are asked to compute a constraint count for terms as needed -- these are then added into the aggregated counts for each term, and the terms are resorted. This means that in some pathelogical term distributions, a term may be excluded from the list of "top" terms if it isn't returned by *any* shard in pass #1, but for any term that is returned to the end client, the count is 100% accurate. (NOTE: this info applies to the default solr faceting, and solr's pivot faceting -- but the relatively new "json faceting" does not support these multi-pass refinement of the facet counts. : PS: The reason I ask is because I know there is some estimating : performed in certain cases for the Facet "unique" function (as is : outlined here: http://yonik.com/solr-count-distinct/ ). So I guess I am : wondering why folks wouldn't just do what I have done vs going throught : the trouble of using the unique(dataSourceName) function? what you linked to is addressing a diff problem then simple facet counts. in your case you are getting the "top" terms with their document counts, but what that blog post is refering to is counting the total number of unique *terms* (ie: in your data set: what is the total number of all unique values in the "dataSourceName" field? distributed counting of unique values in a high cardinality sets is a "hard" problem, as the only way to be 100% accurate is to aggregate all terms from all shards into a single node to be hashed (or sorted) ... for "batch" style analytics this is a trivial map-reduce style job that can offload to disk, but in "real time" situations, statistical sampling approaches like HyperLogLog (used in solr) make more sense to get aproximations w/o exploding ram usage. -Hoss http://www.lucidworks.com/