On Mon, 2013-03-18 at 08:34 +0100, sivaprasad wrote:
> We have configured solr for 5000 facet fields as part of request handler.We
> have 10811177 docs in the index.
> 
> The solr server machine is quad core with 12 gb of RAM.
> 
> When we are querying with facets, we are getting out of memory error.

Solr's faceting treats each field separately. This makes it flexible,
but also means that it has a speed as well as a memory penalty when the
number of fields rises.

It depends on what you are faceting on, but let's say that you are
faceting on Strings and that each field has 200 unique values. For each
field, a list with #docs entries of size log2(#unique_values) bits will
be maintained. With 11M documents and 200 unique values, this is 11M * 8
= 88MBit ~= 9MByte. There is more overhead than this, but it is
irrelevant for this back-on-the-envelope calculation.

5000 fields @ 9 MByte is about 45GB for faceting.

If you had a single field with 200 * 5000 unique values, the memory
penalty would be 11M * log2(200*5000) bits = 11M * 20 bits ~= 30MB + the
some extra overhead.

It seems that the way forward is to see if you can somehow reduce your
requirements from the heavy "facet on 5000 fields" to something more
manageble.

Du you always facet on all the fields for each call? If not, you could
create a single facet field and prefix all values with the facet:

field1/value1a
field1/value1b
field2/value2a
field2/value2b
field2/value2c

and so on. To perform faceting on field 2, make a facet prefix query for
"field2/".


If you do need to facet on all 5000 fields each time, you could just
repeat the above 5000 times. It will work, take little memory and will
likely take far too long. 

If you are feeling really adventurous, take a look at 
https://issues.apache.org/jira/browse/SOLR-2412
it creates a single structure for a multi-field request, meaning that
only a single 11M entry array will be created for the 11M documents. The
full memory overhead should be around the same as with a single field.

I haven't tested SOLR-2412 on anything near your corpus, but it is a
very interesting test case.

> What we observed is , If we have larger number of facets we need to have
> larger RAM allocated for JVM. In this case we need to scale up the system as
> and when we add more facets.
> 
> To scale out the system, do we need to go with distributed search?

That would work if you do not need to facet on all fields all the time.
If you do need to facet on all fields on each call, you will need to
scale to many machines to get proper performance and the merging
overhead will likely be huge.

Regards,
Toke Eskildsen

Reply via email to