Now running the tests on a slightly reduced setup (2 machines, quadcore,
8GB ram ...), but that doesnt matter
We see that storing/indexing speed drops when using
IndexWriter.updateDocument in DirectUpdateHandler2.addDoc. But it does
not drop when just using IndexWriter.addDocument (update-requests with
overwrite=false)
Using addDocument:
https://dl.dropboxusercontent.com/u/25718039/AddDocument_2Solr8GB_DocCount.png
Using updateDocument:
https://dl.dropboxusercontent.com/u/25718039/UpdateDocument_2Solr8GB_DocCount.png
We are not too happy about having to use addDocument, because that
allows for duplicates, and we would really want to avoid that (on
Solr/Lucene level)
We have confirmed that doubling amount of total RAM will double the
amount of documents in the index where the indexing-speed starts
dropping (when we use updateDocument)
On
https://dl.dropboxusercontent.com/u/25718039/UpdateDocument_2Solr8GB_DocCount.png
you can see that the speed drops at around 120M documents. Running the
same test, but with Solr machine having 16GB RAM (instead of 8GB) the
speed drops at around 240M documents.
Any comments on why indexing speed drops with IndexWriter.updateDocument
but not with IndexWriter.addDocument?
Regards, Per Steffensen
On 9/12/13 10:14 AM, Per Steffensen wrote:
Seems like the attachments didnt make it through to this mailing list
https://dl.dropboxusercontent.com/u/25718039/doccount.png
https://dl.dropboxusercontent.com/u/25718039/iowait.png
On 9/12/13 8:25 AM, Per Steffensen wrote:
Hi
SolrCloud 4.0: 6 machines, quadcore, 8GB ram, 1T disk, one Solr-node
on each, one collection across the 6 nodes, 4 shards per node
Storing/indexing from 100 threads on external machines, each thread
one doc at the time, full speed (they always have a new doc to
store/index)
See attached images
* iowait.png: Measured I/O wait on the Solr machines
* doccount.png: Measured number of doc in Solr collection
Starting from an empty collection. Things are fine wrt
storing/indexing speed for the first two-three hours (100M docs per
hour), then speed goes down dramatically, to an, for us, unacceptable
level (max 10M per hour). At the same time as speed goes down, we see
that I/O wait increases dramatically. I am not 100% sure, but quick
investigation has shown that this is due to almost constant merging.
What to do about this problem?
Know that you can play around with mergeFactor and commit-rate, but
earlier tests shows that this really do not seem to do the job - it
might postpone the time where the problem occurs, but basically it is
just a matter of time before merging exhaust the system.
Is there a way to totally avoid merging, and keep indexing speed at a
high level, while still making sure that searches will perform fairly
well when data-amounts become big? (guess without merging you will
end up with lots and lots of "small" files, and I guess this is not
good for search response-time)
Regards, Per Steffensen