Thanks for the quick reply. the box has 8 real cpu's. Perhaps a good idea then to reduce the nr of cores to 8 as well. I'm testing out a different scenario with multiple boxes as well, where clients persist docs to multiple cores on multiple boxes. (which is what multicore was invented for after all)
I set maxBufferedDocs this low (and instead of ramBufferedSizeMB) because I was worried for the impact on ram and to get a grip on when docs where persisted to disk . I'm still not sure if it matters much on the big amounts of ram consumed. This can't be all coming from buffering docs can it? On the other hand, maxBufferedDocs (20 ) is set for each core so in total the nrOfBufferedDocs is at max 200. Of course still at the low side, but I got some draconian docs here.. ;-) I will try to use ramBufferedSizeMB and set it higher, but I first have to get a grip why ram usage is maxed all the time, before this will make any difference I guess. Thanks and please let the suggestions coming. Britske. Otis Gospodnetic wrote: > > > Britske, > > Here are a few quick ones: > > - Does that machine really have 10 CPU cores? If it has significantly > less, you may be beyond the "indexing sweet spot" in terms of indexer > threads vs. CPU cores > > - Your maxBufferedDocs is super small. Comment that out anyway. use > ramBufferedSizeMB and set it as high as you can afford. No need to commit > very often, certainly no need to flush or optimize until the end. > > There is a page about indexing performance on either Solr or Lucene Wiki > that will help. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- >> From: Britske <gbr...@gmail.com> >> To: solr-user@lucene.apache.org >> Sent: Wednesday, March 25, 2009 10:05:17 AM >> Subject: speeding up indexing with a LOT of indexed fields >> >> >> hi, >> >> I'm having difficulty indexing a collection of documents in a reasonable >> time. >> it's now going at 20 docs / sec on a c1.xlarge instance of amazon ec2 >> which >> just isnt enough. >> This box has 8GB ram and the equivalent of 20 xeon processors. >> >> these document have a couple of stored, indexed, multi and single-valued >> fields, but the main problem lies in it having about 1500 indexed fields >> of >> type sint. Range [0,10000] (Yes, I know this is a lot) >> >> I'm looking for some guidance as what strategies to try out to improve >> throughput in indexing. I could slam in some more servers (I will) but my >> feeling tells me I can get more out of this. >> >> some additional info: >> - I'm indexing to 10 cores in parallel. This is done because : >> - at query time, 1 particular index will always fullfill all >> requests >> so we can prune the search space to 1/10th of its original size. >> - each document as represented in a core is actually 1/10th of a >> 'conceptual' document (which would contain up to 15000 indexed fields) if >> I >> indexed to 1 core. Indexing as 1 doc containing 15.000 indexed fields >> proved >> to give far worse results in searching and indexing than the solution i'm >> going with now. >> - the alternative of simply putting all docs with 1500 indexed field >> each in the same core isn't really possible either, because this quickly >> results in OOM-errors when sorting on a couple of fields. (even though >> 9/10 >> th of all docs in this case would not have the field sorted on, they >> would >> still end up in a lucene fieldCache for this field) >> >> - to be clear: the 20 docs / second means 2 docs / second / core. Or 2 >> 'conceptual' docs / second overall. >> >> - each core has maxBufferedDocs ~20 and mergeFactor~10 . (I actually set >> them differently for each partition so that merges of different >> partitions >> don't happen altogether. This seemed to help a bit) >> >> - running jvm with -server -Xmx6000M -Xms6000M -XX:+UseParallelGC >> -XX:+CMSPermGenSweepingEnabled -XX:MaxPermSize=128M to leave room for >> diskcaching. >> >> - I'm spreading the 10 indices over 2 physical disks. 5 to /dev/sda1 5 to >> /dev/sdb >> >> >> observations: >> - within minutes after feeding the server reaches it's max ram. >> - until then the processors are running on ~70% >> - although I throw in a commit at random intervals (between 600 to 800 >> secs, >> again so not to commit al partitions at the same time) the jvm just stays >> eating all the ram. >> - not a lot seems to be happening on disk (using dstat) when the ram >> hasn't >> maxed out. Obviously, aftwerwards the disk is flooded with swapping. >> >> questions: >> - is there a good reason why all ram keeps occupied even though I commit >> regularly? Perhaps fieldcaches get populated when indexing? I guess not, >> but >> I'm not sure what else could explain this >> >> - would splitting the 'conceptual docs' in even more partitions help at >> indexing time? from an application standpoint it's possible, it just >> requires some work and it's hard to compare figures so I'd like to know >> if >> it's worth it . >> >> - how is a flush different from a commit and would it help in getting the >> ram-usage down? >> >> - because all 15.000 indexed fields look very similar in structure (they >> are >> all sints [0,10000] to start with, I was looking for more efficient ways >> to >> get them in an index using some low-level indexing operations. For >> example: >> for a given document X and Y, and indexed fields 1,2.., i,...,N if X.a < >> Y.a >> than this ordening in a lot of cases holds for fields 2,...,N. Because of >> these special properties I could possibly create a sorting algorithm that >> takes advantage of this and thus would make indexing faster. >> Would even considering this path be something that may be useful, because >> obviously it would envolve some work to make it work, and presumably a >> lot >> more work to get it to go faster than out of the box ? >> >> - lastly: should I be able to get more out of this box or am I just >> complaining ;-) >> >> Thanks for making it to here, >> and hoping to receive some valuable info, >> >> Cheers, >> Britske >> -- >> View this message in context: >> http://www.nabble.com/speeding-up-indexing-with-a-LOT-of-indexed-fields-tp22702364p22702364.html >> Sent from the Solr - User mailing list archive at Nabble.com. > > > -- View this message in context: http://www.nabble.com/speeding-up-indexing-with-a-LOT-of-indexed-fields-tp22702364p22704710.html Sent from the Solr - User mailing list archive at Nabble.com.