Hello, I'm struggling with large data indexed and searched by Solr.

The schema of the documents consist of date(YYYY-MM-DD), text(tokenized and
indexed with Natural Language Toolkit), and several numerical fields.

Each document is small-sized but but the number of the docs is very large,
which is around 10 million per each date. The server has 32GB of memory and
I allocated around 30GB for Solr JVM.

My Solr server has to return documents sorted by one of the numerical
fields when is requested with specific date and text.(ex.
q=date:YYYY-MM-DD+text:KEYWORD) The problem is that sorting in Lucene
requires lots of Field Cache and Solr can't handle Field Cache well. The
Field Cache is getting larger as more queries are executed and is not
evicted. When the whole memory is filled with Field Cache, Solr server
stops or generates Out of Memory exception.

Solr cannot control Lucene field cache at all so I have a difficult time to
solve this problem. I'm considering these three ways to solve this.

1) Add more memory.
This can relieve the problem but I don't think it can completely solve it.
Anyway the memory would fill up with field cache as the server handles
search requests.
2) Separate numerical data from text data
I find Solr/Lucene isn't suitable for sorting large numerical data.
Therefore I'm thinking of storing numerical data in another DB(HBase,
MongoDB ...), then Solr server will just do some text search.
3) Switching to Elasticsearch
According to this page(
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html)
Elasticsearch can control field cache. I think ES could solve my
problem.

I'm likely to try 2nd, or 3rd way. Are these appropriate solutions? If you
have any better ideas please let me know. I've went through too many
troubles so it's time to make a decision. I want my choices reviewed by
many other excellent Solr users and developers and also want to find better
solutions.
I really appreciate any help you can provide.

Reply via email to