On 7-10-2010 5:36, Otis Gospodnetic wrote:
Hi Tom,


.Do you use multiple threads for indexing?  Large RAM  buffer size is
also good, but I think perf peaks out mabye around 512  MB (at least
based on past tests)?

We are using Solr, I'm not  sure if Solr uses multiple threads for indexing.
We have 30 "producers"  each sending documents to 1 of 12 Solr shards on a round
robin basis.  So  each shard will get multiple requests.

Solr itself doesn't use multiple threads for indexing, but you can easily do
that on the client side.  SolrJ's StreamingUpdateSolrServer is the simplest
thing to use for this.

As far as I know. It is the simplest, but I found that is uses a 'lot' of CPU overhead because it doesn't support the javabin format for the requests to solr thus marshaling everything to XML. Building our own queue with multiple CommonsHttpSolrServer's with the BinairyRequestWriter set, greatly improved our throughput. As it reduced the CPU load on both the machine that gathered the documents as the machine running the Solr server.

Thijs



Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
From: "Burton-West, Tom"<tburt...@umich.edu>
To: "solr-user@lucene.apache.org"<solr-user@lucene.apache.org>
Sent: Wed, October 6, 2010 9:57:12 PM
Subject: RE: Experience with large merge factors

Hi Mike,

.Do you use multiple threads for indexing?  Large RAM  buffer size is
also good, but I think perf peaks out mabye around 512  MB (at least
based on past tests)?

We are using Solr, I'm not  sure if Solr uses multiple threads for indexing.
We have 30 "producers"  each sending documents to 1 of 12 Solr shards on a round
robin basis.  So  each shard will get multiple requests.

Believe it or not, merging  is typically compute bound.  It's costly to
decode&   re-encode all the vInts.

Sounds like we need to do some monitoring during  merging to see what the cpu
use is and also the io wait during large  merges.

Larger merge factor is good because it means the postings  are copied
fewer times, but, it's bad beacuse you could risk running  out of
descriptors, and, if the OS doesn't have enough RAM, you'll  start to
thin out the readahead that the OS can do (which makes the  merge less
efficient since the disk heads are seeking  more).

Is there a way to estimate the amount of RAM for the  readahead?   Once we
start the re-indexing we will be running 12 shards on  a 16 processor box with
144 GB of memory.

Do you do any  deleting?
Deletes would happen as a byproduct of updating a record.   This shouldn't
happen too frequently during re-indexing, but we update records  when a document
gets re-scanned and re-OCR'd.  This would probably amount  to a few thousand.


Do you use stored fields and/or term  vectors?  If so, try to make
your docs "uniform" if possible, ie  add the same fields in the same
order.  This enables lucene to  use bulk byte copy merging under the hood.

We use 4 or 5 stored  fields.  They are very small compared to our huge OCR
field.  Since we  construct our Solr documents programattically, I'm fairly
certain that they are  always in the same order.  I'll have to look at the code
when I get back to  make sure.

We aren't using term vectors now, but we plan to add them as  well as a number
of fields based on MARC (cataloging) metadata in the  future.

Tom

Reply via email to