On Mon, 2013-03-18 at 08:34 +0100, sivaprasad wrote: > We have configured solr for 5000 facet fields as part of request handler.We > have 10811177 docs in the index. > > The solr server machine is quad core with 12 gb of RAM. > > When we are querying with facets, we are getting out of memory error.
Solr's faceting treats each field separately. This makes it flexible, but also means that it has a speed as well as a memory penalty when the number of fields rises. It depends on what you are faceting on, but let's say that you are faceting on Strings and that each field has 200 unique values. For each field, a list with #docs entries of size log2(#unique_values) bits will be maintained. With 11M documents and 200 unique values, this is 11M * 8 = 88MBit ~= 9MByte. There is more overhead than this, but it is irrelevant for this back-on-the-envelope calculation. 5000 fields @ 9 MByte is about 45GB for faceting. If you had a single field with 200 * 5000 unique values, the memory penalty would be 11M * log2(200*5000) bits = 11M * 20 bits ~= 30MB + the some extra overhead. It seems that the way forward is to see if you can somehow reduce your requirements from the heavy "facet on 5000 fields" to something more manageble. Du you always facet on all the fields for each call? If not, you could create a single facet field and prefix all values with the facet: field1/value1a field1/value1b field2/value2a field2/value2b field2/value2c and so on. To perform faceting on field 2, make a facet prefix query for "field2/". If you do need to facet on all 5000 fields each time, you could just repeat the above 5000 times. It will work, take little memory and will likely take far too long. If you are feeling really adventurous, take a look at https://issues.apache.org/jira/browse/SOLR-2412 it creates a single structure for a multi-field request, meaning that only a single 11M entry array will be created for the 11M documents. The full memory overhead should be around the same as with a single field. I haven't tested SOLR-2412 on anything near your corpus, but it is a very interesting test case. > What we observed is , If we have larger number of facets we need to have > larger RAM allocated for JVM. In this case we need to scale up the system as > and when we add more facets. > > To scale out the system, do we need to go with distributed search? That would work if you do not need to facet on all fields all the time. If you do need to facet on all fields on each call, you will need to scale to many machines to get proper performance and the merging overhead will likely be huge. Regards, Toke Eskildsen