Hi Tom,
> >.Do you use multiple threads for indexing? Large RAM buffer size is > >>also good, but I think perf peaks out mabye around 512 MB (at least > >>based on past tests)? > > We are using Solr, I'm not sure if Solr uses multiple threads for indexing. >We have 30 "producers" each sending documents to 1 of 12 Solr shards on a >round >robin basis. So each shard will get multiple requests. Solr itself doesn't use multiple threads for indexing, but you can easily do that on the client side. SolrJ's StreamingUpdateSolrServer is the simplest thing to use for this. Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ----- Original Message ---- > From: "Burton-West, Tom" <tburt...@umich.edu> > To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> > Sent: Wed, October 6, 2010 9:57:12 PM > Subject: RE: Experience with large merge factors > > Hi Mike, > > >.Do you use multiple threads for indexing? Large RAM buffer size is > >>also good, but I think perf peaks out mabye around 512 MB (at least > >>based on past tests)? > > We are using Solr, I'm not sure if Solr uses multiple threads for indexing. >We have 30 "producers" each sending documents to 1 of 12 Solr shards on a >round >robin basis. So each shard will get multiple requests. > > >>Believe it or not, merging is typically compute bound. It's costly to > >>decode & re-encode all the vInts. > > Sounds like we need to do some monitoring during merging to see what the cpu >use is and also the io wait during large merges. > > >>Larger merge factor is good because it means the postings are copied > >>fewer times, but, it's bad beacuse you could risk running out of > >>descriptors, and, if the OS doesn't have enough RAM, you'll start to > >>thin out the readahead that the OS can do (which makes the merge less > >>efficient since the disk heads are seeking more). > > Is there a way to estimate the amount of RAM for the readahead? Once we >start the re-indexing we will be running 12 shards on a 16 processor box with >144 GB of memory. > > >>Do you do any deleting? > Deletes would happen as a byproduct of updating a record. This shouldn't >happen too frequently during re-indexing, but we update records when a >document >gets re-scanned and re-OCR'd. This would probably amount to a few thousand. > > > >>Do you use stored fields and/or term vectors? If so, try to make > >>your docs "uniform" if possible, ie add the same fields in the same > >>order. This enables lucene to use bulk byte copy merging under the hood. > > We use 4 or 5 stored fields. They are very small compared to our huge OCR >field. Since we construct our Solr documents programattically, I'm fairly >certain that they are always in the same order. I'll have to look at the >code >when I get back to make sure. > > We aren't using term vectors now, but we plan to add them as well as a > number >of fields based on MARC (cataloging) metadata in the future. > > Tom