Hi, I think you probably want to split giant documents because you / your users probably want to be able to find smaller sections of those big docs that are best matches to their queries. Imagine querying War and Peace. Almost any regular word your query for will produce a match. Yes, you may want to enable field collapsing aka grouping. I've seen facet counts get messed up when grouping is turned on, but have not confirmed if this is a (known) bug or not.
Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Tue, Mar 18, 2014 at 10:52 PM, Stephen Kottmann < stephen_kottm...@h3biomedicine.com> wrote: > Hi Solr Users, > > I'm looking for advice on best practices when indexing large documents > (100's of MB or even 1 to 2 GB text files). I've been hunting around on > google and the mailing list, and have found some suggestions of splitting > the logical document up into multiple solr documents. However, I haven't > been able to find anything that seems like conclusive advice. > > Some background... > > We've been using solr with great success for some time on a project that is > mostly indexing very structured data - ie. mainly based on ingesting > through DIH. > > I've now started a new project and we're trying to make use of solr again - > however, in this project we are indexing mostly unstructured data - pdfs, > powerpoint, word, etc. I've not done much configuration - my solr instance > is very close to the example provided in the distribution aside from some > minor schema changes. Our index is relatively small at this point ( ~3k > documents ), and for initial indexing I am pulling documents from a http > data source, running them through Tika, and then pushing to solr using > solrj. For the most part this is working great... until I hit one of these > huge text files and then OOM on indexing. > > I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at > it, but it seems like maybe there's a more robust solution that would scale > better. > > Is splitting the logical document into multiple solr documents best > practice here? If so, what are the considerations or pitfalls of doing this > that I should be paying attention to. I guess when querying I always need > to use a group by field to prevent multiple hits for the same document. Are > there issues with term frequency, etc that you need to work around? > > Really interested to hear how others are dealing with this. > > Thanks everyone! > Stephen > > -- > [This e-mail message may contain privileged, confidential and/or > proprietary information of H3 Biomedicine. If you believe that it has been > sent to you in error, please contact the sender immediately and delete the > message including any attachments, without copying, using, or distributing > any of the information contained therein. This e-mail message should not be > interpreted to include a digital or electronic signature that can be used > to authenticate an agreement, contract or other legal document, nor to > reflect an intention to be bound to any legally-binding agreement or > contract.] >