A little bit of history:

We built a solr-like solution on Lucene.NET and C# about 5 years ago, which 
including faceted search.  In order to get really good facet performance, what 
we did was pre-cache all the facet fields in RAM as efficient compressed data 
structures (either a variable byte encoded list of doc IDs as integers, or as a 
bit array, depending on how many docs that field matches).  Then we sorted 
those sets of facet fields by total document frequency so we enumerate the more 
frequent facet fields first, and we stop looking when we find a facet field 
which has less total document matches than the top N facet counts we are 
looking for.  We also did more efficient intersection algorithm between the 
facet field and the matched doc set using intersection on the internal uint 
fields of the bit array when possible.  This works great for one reason that we 
package all this cached data structure onto a binary file on the master server 
and distribute that file with each new index snapshot to the slaves.  (So the 
heavy lifting of reading the TermEnum and TermDocs happens only on the master 
servers).  The slaves just pre-load that binary structure directly into ram in 
one shot in the background when opening a new snapshot for search.   We have 
200 million docs, 10 shards, about 20 facet fields, some of which contain about 
20,000 unique values.  We show top 10 facets for about 10 different fields in 
results page.   We provide search results with lots of facets and date counts 
in around 200-300ms using this technique.

Currently, we are porting this entire system to SOLR.  For a single core index 
of 8 million docs, using similar documents and facet fields from our production 
indexes, I cant get faceted search to perform anywhere close to 300ms for 
general searches.   More like 1.5-3 seconds.  I adjusted filter cache size to 
10,000, and tried running different facet.method parameters (enum and fc).  But 
still very slow.  I'm running on server with 2 cores, 3.7 GB ram and setting 
JVM to have up to 2.5 GB ram.  I see that SOLR takes quite some time to 
pre-load the filter cache for some of these facet fields when opening a new 
searcher.

Is there anything else that I should look into for getting better facet 
performance?  Given these metrics (200m docs, 20 facet fields, some fields with 
20,000 unique values), what kind of facet search performance should I expect?  
Also we need to issue frequent commits since we are constantly streaming new 
content into the system.

Thanks
Bob

Reply via email to