Toke Eskildsen [t...@statsbiblioteket.dk] wrote:

[Solr, 11M documents, 5000 facet fields, 12GB RAM, OOM]

> 5000 fields @ 9 MByte is about 45GB for faceting.

> If you are feeling really adventurous, take a look at
> https://issues.apache.org/jira/browse/SOLR-2412

I tried building a test-index with 11M documents and 5000 fields, each with 200 
different values. Each document had 10 fields: 
- 1 with 1 out of 4 unique values
- 7 selected randomly from the 5000, each with 1 out of the 200 unique values
- 2 contained a random string
Summed up, that is 5000*200 + 2*~11M ~= 20M unique terms and 11M*10 = 110M 
references from documents to terms. The resulting index was 34GB.

The queries were for search-words that hit ~600K randomly distributed documents 
and the faceting was for all 5000 fields, with the top-3 terms returned for 
each.
First call (startup time): 107 seconds
Second call: 3292 ms
Third call: 3290 ms
Fourth call: 4112 ms
The faceting itself took less than 1 second, while the serialization to XML 
took the other 2-3 seconds. The response XML was about 2MB in size. It required 
1500MB of heap to run properly.

Due to <long explanation> I used Lucene instead of Solr for the experiment, but 
as SOLR-2412 is just a wrapper, it should work just as well in Solr. The 
machine was a quite new Xeon server with SSD as storage. I guess that 
performance will be quite worse on spinning drives if the index is not cached 
in RAM: Returning 15K unique values is quite a task if access times are 
measured in milliseconds. If it is of interest to anyone, I'll be happy to move 
the index to spinning drives and measure again.


While the result looks promising, do keep in mind that SOLR-2412 is both 
experimental and not capable of distributed search. It it really only an option 
if it is a hard requirement to do full faceting on 5000 fields with Lucene or 
Solr. I recommend finding a way of not doing faceting on so many fields instead.

Regards,
Toke Eskildsen

Reply via email to