On 4/19/2012 7:49 AM, Bram Rongen wrote:
Yesterday I've started indexing again but this time on Solr 3.6.. Again
Solr is failing around the same time, but not exactly (now the largest fdt
file is 4.8G).. It's right after the moment I receive memory-errors at the
Drupal side which make me suspicious that it maybe has something to do with
a huge document.. Is that possible? I was indexing 1500 documents at once
every minute. Drupal builds them all up in memory before submitting them to
Solr. At some point it runs out of memory and I have to switch to 10/20
documents per minute for a while.. then I can switch back to 1000 documents
per minute.

The disk is a software RAID1 over 2 disks. But I've also run into the same
problem at another server.. This was a VM-server with only 1GB ram and 40GB
of disk. With this server the merge-repeat happened at an earlier stage.

I've also let Solr continue with merging for about two days before  (in an
earlier attempt), without submitting new documents. The merging kept
repeating.

Somebody suggested it could be because I'm using Jetty, could that be right?

I am using Jetty for my Solr installation and it handles very large indexes without a problem. I have created a single index with all my data (nearly 70 million documents, total index size over 100GB). Aside from how long it takes to build and the fact that I don't have enough RAM to cache it for good performance, Solr handled it just fine. For production I use a distributed index on multiple servers.

I don't know why you are seeing a merge that continually restarts, that's truly odd. I've never used drupal, don't know a lot about it. From my small amount of research just now, I assume that it uses Tika, also another tool that I have no experience with. I am guessing that you store the entire text of your documents into solr, and that they are indexed up to a maximum of 10000 tokens (the default value of maxFieldLength in solrconfig.xml), based purely on speculation about the "body" field in your schema.

A document that's 100MB in size, if the whole thing gets stored, will completely overwhelm a 32MB buffer, and might even be enough to overwhelm a 256MB buffer as well, because it will basically have to build the entire index segment in RAM, with term vectors, indexed data, and stored data for all fields.

With such large documents, you may have to increase the maxFieldLength, or you won't be able to search on the entire document text. Depending on the content of those documents, it may or may not be a problem that only the first 10,000 tokens will get indexed. Large documents tend to be repetitive and there might not be any search value after the introduction and initial words. Your documents may be different, so you'll have to make that decision.

To test whether my current thoughts are right, I recommend that you try with the following settings during the initial full import: ramBufferSizeMB: 1024 (or maybe higher), autoCommit maxTime: 0, autoCommit maxDocs: 0. This will mean that unless the indexing process issues manual commits (either in the middle of indexing or at the end), you will have to do a manual one. Once you have the initial index built and it is only doing updates, you will probably be able to go back to using autoCommit.

It's possible that I have no understanding of the real problem here, and my recommendation above may result in no improvement. General recommendations, no matter what the current problem might be:

1) Get a lot more RAM. Ideally you want to have enough free memory to cache your entire index. That may not be possible, but you want to get as close to that goal as you can. 2) If you can, see what you can do to increase your IOPS. Using mirrored high RPM SAS is an easy solution, and might be slightly cheaper than SATA RAID10, which is my solution. SSD is easy and very fast, but expensive and not redundant -- I am currently not aware of any SSD RAID solutions that have OS TRIM support. RAID10 with high RPM SAS would be best, but very expensive. On the extreme high end, you could go with a high performance SAN.

Thanks,
Shawn

Reply via email to