On 4/19/2012 7:49 AM, Bram Rongen wrote:
Yesterday I've started indexing again but this time on Solr 3.6.. Again
Solr is failing around the same time, but not exactly (now the largest fdt
file is 4.8G).. It's right after the moment I receive memory-errors at the
Drupal side which make me suspicious that it maybe has something to do with
a huge document.. Is that possible? I was indexing 1500 documents at once
every minute. Drupal builds them all up in memory before submitting them to
Solr. At some point it runs out of memory and I have to switch to 10/20
documents per minute for a while.. then I can switch back to 1000 documents
per minute.
The disk is a software RAID1 over 2 disks. But I've also run into the same
problem at another server.. This was a VM-server with only 1GB ram and 40GB
of disk. With this server the merge-repeat happened at an earlier stage.
I've also let Solr continue with merging for about two days before (in an
earlier attempt), without submitting new documents. The merging kept
repeating.
Somebody suggested it could be because I'm using Jetty, could that be right?
I am using Jetty for my Solr installation and it handles very large
indexes without a problem. I have created a single index with all my
data (nearly 70 million documents, total index size over 100GB). Aside
from how long it takes to build and the fact that I don't have enough
RAM to cache it for good performance, Solr handled it just fine. For
production I use a distributed index on multiple servers.
I don't know why you are seeing a merge that continually restarts,
that's truly odd. I've never used drupal, don't know a lot about it.
From my small amount of research just now, I assume that it uses Tika,
also another tool that I have no experience with. I am guessing that
you store the entire text of your documents into solr, and that they are
indexed up to a maximum of 10000 tokens (the default value of
maxFieldLength in solrconfig.xml), based purely on speculation about the
"body" field in your schema.
A document that's 100MB in size, if the whole thing gets stored, will
completely overwhelm a 32MB buffer, and might even be enough to
overwhelm a 256MB buffer as well, because it will basically have to
build the entire index segment in RAM, with term vectors, indexed data,
and stored data for all fields.
With such large documents, you may have to increase the maxFieldLength,
or you won't be able to search on the entire document text. Depending
on the content of those documents, it may or may not be a problem that
only the first 10,000 tokens will get indexed. Large documents tend to
be repetitive and there might not be any search value after the
introduction and initial words. Your documents may be different, so
you'll have to make that decision.
To test whether my current thoughts are right, I recommend that you try
with the following settings during the initial full import:
ramBufferSizeMB: 1024 (or maybe higher), autoCommit maxTime: 0,
autoCommit maxDocs: 0. This will mean that unless the indexing process
issues manual commits (either in the middle of indexing or at the end),
you will have to do a manual one. Once you have the initial index built
and it is only doing updates, you will probably be able to go back to
using autoCommit.
It's possible that I have no understanding of the real problem here, and
my recommendation above may result in no improvement. General
recommendations, no matter what the current problem might be:
1) Get a lot more RAM. Ideally you want to have enough free memory to
cache your entire index. That may not be possible, but you want to get
as close to that goal as you can.
2) If you can, see what you can do to increase your IOPS. Using
mirrored high RPM SAS is an easy solution, and might be slightly cheaper
than SATA RAID10, which is my solution. SSD is easy and very fast, but
expensive and not redundant -- I am currently not aware of any SSD RAID
solutions that have OS TRIM support. RAID10 with high RPM SAS would be
best, but very expensive. On the extreme high end, you could go with a
high performance SAN.
Thanks,
Shawn