Re: Experience with large merge factors

Otis Gospodnetic Wed, 06 Oct 2010 20:37:11 -0700

Hi Tom,


> >.Do you use multiple threads for indexing?  Large RAM  buffer size is
> >>also good, but I think perf peaks out mabye around 512  MB (at least
> >>based on past tests)?
> 
> We are using Solr, I'm not  sure if Solr uses multiple threads for indexing.  
>We have 30 "producers"  each sending documents to 1 of 12 Solr shards on a 
>round 
>robin basis.  So  each shard will get multiple requests.

Solr itself doesn't use multiple threads for indexing, but you can easily do 
that on the client side.  SolrJ's StreamingUpdateSolrServer is the simplest 
thing to use for this.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: "Burton-West, Tom" <tburt...@umich.edu>
> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
> Sent: Wed, October 6, 2010 9:57:12 PM
> Subject: RE: Experience with large merge factors
> 
> Hi Mike,
> 
> >.Do you use multiple threads for indexing?  Large RAM  buffer size is
> >>also good, but I think perf peaks out mabye around 512  MB (at least
> >>based on past tests)?
> 
> We are using Solr, I'm not  sure if Solr uses multiple threads for indexing.  
>We have 30 "producers"  each sending documents to 1 of 12 Solr shards on a 
>round 
>robin basis.  So  each shard will get multiple requests.
> 
> >>Believe it or not, merging  is typically compute bound.  It's costly to
> >>decode &  re-encode all the vInts.
> 
> Sounds like we need to do some monitoring during  merging to see what the cpu 
>use is and also the io wait during large  merges.
> 
> >>Larger merge factor is good because it means the postings  are copied 
> >>fewer times, but, it's bad beacuse you could risk running  out of
> >>descriptors, and, if the OS doesn't have enough RAM, you'll  start to
> >>thin out the readahead that the OS can do (which makes the  merge less
> >>efficient since the disk heads are seeking  more).
> 
> Is there a way to estimate the amount of RAM for the  readahead?   Once we 
>start the re-indexing we will be running 12 shards on  a 16 processor box with 
>144 GB of memory.
> 
> >>Do you do any  deleting?
> Deletes would happen as a byproduct of updating a record.   This shouldn't 
>happen too frequently during re-indexing, but we update records  when a 
>document 
>gets re-scanned and re-OCR'd.  This would probably amount  to a few thousand.
> 
> 
> >>Do you use stored fields and/or term  vectors?  If so, try to make
> >>your docs "uniform" if possible, ie  add the same fields in the same
> >>order.  This enables lucene to  use bulk byte copy merging under the hood.
> 
> We use 4 or 5 stored  fields.  They are very small compared to our huge OCR 
>field.  Since we  construct our Solr documents programattically, I'm fairly 
>certain that they are  always in the same order.  I'll have to look at the 
>code 
>when I get back to  make sure.
> 
> We aren't using term vectors now, but we plan to add them as  well as a 
> number 
>of fields based on MARC (cataloging) metadata in the  future.
> 
> Tom

Re: Experience with large merge factors

Reply via email to