Toke Eskildsen [t...@statsbiblioteket.dk] wrote: [Solr, 11M documents, 5000 facet fields, 12GB RAM, OOM]
> 5000 fields @ 9 MByte is about 45GB for faceting. > If you are feeling really adventurous, take a look at > https://issues.apache.org/jira/browse/SOLR-2412 I tried building a test-index with 11M documents and 5000 fields, each with 200 different values. Each document had 10 fields: - 1 with 1 out of 4 unique values - 7 selected randomly from the 5000, each with 1 out of the 200 unique values - 2 contained a random string Summed up, that is 5000*200 + 2*~11M ~= 20M unique terms and 11M*10 = 110M references from documents to terms. The resulting index was 34GB. The queries were for search-words that hit ~600K randomly distributed documents and the faceting was for all 5000 fields, with the top-3 terms returned for each. First call (startup time): 107 seconds Second call: 3292 ms Third call: 3290 ms Fourth call: 4112 ms The faceting itself took less than 1 second, while the serialization to XML took the other 2-3 seconds. The response XML was about 2MB in size. It required 1500MB of heap to run properly. Due to <long explanation> I used Lucene instead of Solr for the experiment, but as SOLR-2412 is just a wrapper, it should work just as well in Solr. The machine was a quite new Xeon server with SSD as storage. I guess that performance will be quite worse on spinning drives if the index is not cached in RAM: Returning 15K unique values is quite a task if access times are measured in milliseconds. If it is of interest to anyone, I'll be happy to move the index to spinning drives and measure again. While the result looks promising, do keep in mind that SOLR-2412 is both experimental and not capable of distributed search. It it really only an option if it is a hard requirement to do full faceting on 5000 fields with Lucene or Solr. I recommend finding a way of not doing faceting on so many fields instead. Regards, Toke Eskildsen