RE: indexing best practices

Burton-West, Tom Mon, 19 Jul 2010 12:44:08 -0700

Hi Ken,

This is all very dependent on your documents, your indexing setup and your 
hardware. Just as an extreme data point, I'll describe our experience.


We run 5 clients on each of 6 machines to send documents to Solr using the 
standard http xml process.  Our documents contain about 10 fields, but one 
field contains OCR for the full text of a book.  The documents are about 700KB 
in size.

Each client sends solr documents to one of 10 solr shards on a round-robin 
basis.  We are running 5 shards on each of two dedicated indexing machines each 
with 144GB of memory and 2 x Quad Core Intel Xeon E5540 2.53GHz processors 
(Nehalem).  What we generally see is that once the index gets large enough for 
significant merging, our producers can send documents to solr faster than it 
can index them.

We suspect that our bottleneck is simply disk I/O for index merging on the Solr 
build machines.  We are currently experimenting with changing the 
maxRAMBufferSize settings and various merge policies/merge factors to see if we 
can speed up the Solr end of the indexing process.   Since we optimize our 
index down to two segments, we are also planning to experiment with using the 
"nomerge" merge policy. I hope to have some results to report on our blog 
sometime in the next  month or so.

Tom Burton-West
www.hathitrust.org/blogs

-----Original Message-----
From: kenf_nc [mailto:ken.fos...@realestate.com] 
Sent: Sunday, July 18, 2010 8:18 AM
To: solr-user@lucene.apache.org
Subject: Re: indexing best practices


No one has done performance analysis? Or has a link to anywhere where it's
been done?

basically fastest way to get documents into Solr. So many options available,
what's the fastest:
1) file import (xml, csv)  vs  DIH  vs POSTing
2) number of concurrent clients   1   vs 10 vs 100 ...is there a diminishing
returns number?

I have 16 million small (8 to 10 fields, no large text fields) docs that get
updated monthly and 2.5 million largish (20 to 30 fields, a couple html text
fields) that get updated monthly. It currently takes about 20 hours to do a
full import. I would like to cut that down as much as possible.
Thanks,
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p976313.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: indexing best practices

Reply via email to