Hi Solr Users, I'm looking for advice on best practices when indexing large documents (100's of MB or even 1 to 2 GB text files). I've been hunting around on google and the mailing list, and have found some suggestions of splitting the logical document up into multiple solr documents. However, I haven't been able to find anything that seems like conclusive advice.
Some background... We've been using solr with great success for some time on a project that is mostly indexing very structured data - ie. mainly based on ingesting through DIH. I've now started a new project and we're trying to make use of solr again - however, in this project we are indexing mostly unstructured data - pdfs, powerpoint, word, etc. I've not done much configuration - my solr instance is very close to the example provided in the distribution aside from some minor schema changes. Our index is relatively small at this point ( ~3k documents ), and for initial indexing I am pulling documents from a http data source, running them through Tika, and then pushing to solr using solrj. For the most part this is working great... until I hit one of these huge text files and then OOM on indexing. I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at it, but it seems like maybe there's a more robust solution that would scale better. Is splitting the logical document into multiple solr documents best practice here? If so, what are the considerations or pitfalls of doing this that I should be paying attention to. I guess when querying I always need to use a group by field to prevent multiple hits for the same document. Are there issues with term frequency, etc that you need to work around? Really interested to hear how others are dealing with this. Thanks everyone! Stephen -- [This e-mail message may contain privileged, confidential and/or proprietary information of H3 Biomedicine. If you believe that it has been sent to you in error, please contact the sender immediately and delete the message including any attachments, without copying, using, or distributing any of the information contained therein. This e-mail message should not be interpreted to include a digital or electronic signature that can be used to authenticate an agreement, contract or other legal document, nor to reflect an intention to be bound to any legally-binding agreement or contract.]