Hi Solr Users,

I'm looking for advice on best practices when indexing large documents
(100's of MB or even 1 to 2 GB text files). I've been hunting around on
google and the mailing list, and have found some suggestions of splitting
the logical document up into multiple solr documents. However, I haven't
been able to find anything that seems like conclusive advice.

Some background...

We've been using solr with great success for some time on a project that is
mostly indexing very structured data - ie. mainly based on ingesting
through DIH.

I've now started a new project and we're trying to make use of solr again -
however, in this project we are indexing mostly unstructured data - pdfs,
powerpoint, word, etc. I've not done much configuration - my solr instance
is very close to the example provided in the distribution aside from some
minor schema changes. Our index is relatively small at this point ( ~3k
documents ), and for initial indexing I am pulling documents from a http
data source, running them through Tika, and then pushing to solr using
solrj. For the most part this is working great... until I hit one of these
huge text files and then OOM on indexing.

I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at
it, but it seems like maybe there's a more robust solution that would scale
better.

Is splitting the logical document into multiple solr documents best
practice here? If so, what are the considerations or pitfalls of doing this
that I should be paying attention to. I guess when querying I always need
to use a group by field to prevent multiple hits for the same document. Are
there issues with term frequency, etc that you need to work around?

Really interested to hear how others are dealing with this.

Thanks everyone!
Stephen

-- 
[This e-mail message may contain privileged, confidential and/or 
proprietary information of H3 Biomedicine. If you believe that it has been 
sent to you in error, please contact the sender immediately and delete the 
message including any attachments, without copying, using, or distributing 
any of the information contained therein. This e-mail message should not be 
interpreted to include a digital or electronic signature that can be used 
to authenticate an agreement, contract or other legal document, nor to 
reflect an intention to be bound to any legally-binding agreement or 
contract.]

Reply via email to