Re: Solr file size limit?

Shawn Heisey Thu, 19 Apr 2012 08:05:02 -0700

On 4/19/2012 7:49 AM, Bram Rongen wrote:

Yesterday I've started indexing again but this time on Solr 3.6.. Again
Solr is failing around the same time, but not exactly (now the largest fdt
file is 4.8G).. It's right after the moment I receive memory-errors at the
Drupal side which make me suspicious that it maybe has something to do with
a huge document.. Is that possible? I was indexing 1500 documents at once
every minute. Drupal builds them all up in memory before submitting them to
Solr. At some point it runs out of memory and I have to switch to 10/20
documents per minute for a while.. then I can switch back to 1000 documents
per minute.


The disk is a software RAID1 over 2 disks. But I've also run into the same
problem at another server.. This was a VM-server with only 1GB ram and 40GB
of disk. With this server the merge-repeat happened at an earlier stage.

I've also let Solr continue with merging for about two days before  (in an
earlier attempt), without submitting new documents. The merging kept
repeating.

Somebody suggested it could be because I'm using Jetty, could that be right?

I am using Jetty for my Solr installation and it handles very largeindexes without a problem. I have created a single index with all mydata (nearly 70 million documents, total index size over 100GB). Asidefrom how long it takes to build and the fact that I don't have enoughRAM to cache it for good performance, Solr handled it just fine. Forproduction I use a distributed index on multiple servers.

I don't know why you are seeing a merge that continually restarts,that's truly odd. I've never used drupal, don't know a lot about it.From my small amount of research just now, I assume that it uses Tika,also another tool that I have no experience with. I am guessing thatyou store the entire text of your documents into solr, and that they areindexed up to a maximum of 10000 tokens (the default value ofmaxFieldLength in solrconfig.xml), based purely on speculation about the"body" field in your schema.

A document that's 100MB in size, if the whole thing gets stored, willcompletely overwhelm a 32MB buffer, and might even be enough tooverwhelm a 256MB buffer as well, because it will basically have tobuild the entire index segment in RAM, with term vectors, indexed data,and stored data for all fields.

With such large documents, you may have to increase the maxFieldLength,or you won't be able to search on the entire document text. Dependingon the content of those documents, it may or may not be a problem thatonly the first 10,000 tokens will get indexed. Large documents tend tobe repetitive and there might not be any search value after theintroduction and initial words. Your documents may be different, soyou'll have to make that decision.

To test whether my current thoughts are right, I recommend that you trywith the following settings during the initial full import:ramBufferSizeMB: 1024 (or maybe higher), autoCommit maxTime: 0,autoCommit maxDocs: 0. This will mean that unless the indexing processissues manual commits (either in the middle of indexing or at the end),you will have to do a manual one. Once you have the initial index builtand it is only doing updates, you will probably be able to go back tousing autoCommit.

It's possible that I have no understanding of the real problem here, andmy recommendation above may result in no improvement. Generalrecommendations, no matter what the current problem might be:

1) Get a lot more RAM. Ideally you want to have enough free memory tocache your entire index. That may not be possible, but you want to getas close to that goal as you can.2) If you can, see what you can do to increase your IOPS. Usingmirrored high RPM SAS is an easy solution, and might be slightly cheaperthan SATA RAID10, which is my solution. SSD is easy and very fast, butexpensive and not redundant -- I am currently not aware of any SSD RAIDsolutions that have OS TRIM support. RAID10 with high RPM SAS would bebest, but very expensive. On the extreme high end, you could go with ahigh performance SAN.


Thanks,
Shawn

Re: Solr file size limit?

Reply via email to