A data structure like:

fieldId -> BitArray  (for fields with docFreq > 1/9 of total docs)
fieldId -> VIntList (variable byte encoded array of ints, for fields with 
docFreq < 1/9 of total docs)

And the list is sorted top to bottom with most frequent fields at the top 
(highest doc freqs at the top).

We enumerate top to bottom getting intersection against set of docs matching 
search:
* If BitArray, then do special intersection using bit couting of internal 
uint[] structure of BitArray vs. BitArray of doc result set
* If VIntList, then do enumeration of VIntList against BitArray of doc result 
set

We push the counts into priority queue with size of "facet.limit" and when we 
fill that up, if the next fieldId in the structure has docFreq < the min count 
in the priority queue it breaks out of the loop.  Since a lot of our facet 
fields have "power curve" distribution (a long tail of less frequent values), 
breaking out early helps a lot.

Also, we do facet counts in parallel using Parallel.ForEach (in C# not Java), 
so each field in list of "facet.field" is done on its own thread  (I believe 
SOLR is doing something similar now).  We have a lot of cores on our servers so 
it works well.


________________________________________
From: Toke Eskildsen [t...@statsbiblioteket.dk]
Sent: Wednesday, August 07, 2013 7:45 AM
To: solr-user@lucene.apache.org
Subject: Re: poor facet search performance

On Tue, 2013-07-30 at 21:48 +0200, Robert Stewart wrote:

[Custom facet structure]

> Then we sorted those sets of facet fields by total document frequency
> so we enumerate the more frequent facet fields first, and we stop
> looking when we find a facet field which has less total document
> matches than the top N facet counts we are looking for.

So the structure was Facet->docIDs? A bit like enum in Solr? Your top-N
cut-off is an interesting optimization for that.

[...]

> The slaves just pre-load that binary structure directly into ram in one
> shot in the background when opening a new snapshot for search.

We used a similar pre-calculation some years ago but abandoned it as the
cost of "Pre-generate_structure + #duplicates * (distribute_structure +
open_structure)" was just as costly and less flexible than "#duplicates
* generate_structure" for us.

> We have 200 million docs, 10 shards, about 20 facet fields, some of
> which contain about 20,000 unique values.  We show top 10 facets for
> about 10 different fields in results page.   We provide search results
> with lots of facets and date counts in around 200-300ms using this
> technique.
>
> Currently, we are porting this entire system to SOLR.  For a single
> core index of 8 million docs, using similar documents and facet fields
> from our production indexes, I cant get faceted search to perform
> anywhere close to 300ms for general searches.   More like 1.5-3
> seconds.

Solr fc faceting treats each facet independently and in a docID->facet
manner so what happens is

foreach facet {
  foreach docIDinResultSet {
    foreach tagIDinDocument {
      facet.counter[tagID]++
    }
  }
}

With 10 facets, 8M documents and 1 tag/doc/facet, the total loop count
it 80M. That does not normally take 1.5-3 seconds in, so something seems
off. Do you have a lot of facet tags (aka terms) for each document?

> Is there anything else that I should look into for getting better facet
> performance?

Could you list the part of the Solr log with the facet structures? Just
grep for "UnInverted". They look something like this:
UnInverted multi-valued field
{field=lma_long,memSize=42711405,tindexSize=42,time=979,phase1=964,nTerms=23,bigTerms=6,termInstances=1958468,uses=0}

> Given these metrics (200m docs, 20 facet fields, some fields with
> 20,000 unique values), what kind of facet search performance should I
> expect?

Due to the independent faceting handling in Solr, the facet time will
scale a bit worse than linear to the number of documents, relative to
your test setup. With a loop count of 200M*10 (or 20? I am a bit
confused on how many facets you show at a time) = 2G, this will take
multiple seconds. Unless you go experimental (SOLR-2412 to bang my own
drum), your facet count needs to go down or you need to shard with Solr.

> Also we need to issue frequent commits since we are constantly
> streaming new content into the system.

You could use a setup with a smaller "live" shard and multiple stale
ones, but depending on corpus your ranking might suffer.

- Toke Eskildsen, State and University Library, Denmark

Reply via email to