Re: speeding up indexing with a LOT of indexed fields

Britske Wed, 25 Mar 2009 08:56:02 -0700

Thanks for the quick reply.

the box has 8 real cpu's. Perhaps a good idea then to reduce the nr of cores
to 8 as well. I'm testing out a different scenario with multiple boxes as
well, where clients persist docs to multiple cores on multiple boxes. (which
is what multicore was invented for after all)


I set maxBufferedDocs  this low (and instead of ramBufferedSizeMB) because I
was worried for the impact on ram and to get a grip on when docs where
persisted to disk . I'm still not sure if it matters much on the big amounts
of ram consumed. This can't be all coming from buffering docs can it? On the
other hand, maxBufferedDocs (20 ) is set for each core so in total the
nrOfBufferedDocs is at max 200. Of course still at the low side, but I got
some draconian docs here.. ;-) 

I will try to use ramBufferedSizeMB and set it higher, but I first have to
get a grip why ram usage is maxed all the time, before this will make any
difference I guess. 

Thanks and please let the suggestions coming. 

Britske.


Otis Gospodnetic wrote:
> 
> 
> Britske,
> 
> Here are a few quick ones:
> 
> - Does that machine really have 10 CPU cores?  If it has significantly
> less, you may be beyond the "indexing sweet spot" in terms of indexer
> threads vs. CPU cores
> 
> - Your maxBufferedDocs is super small.  Comment that out anyway.  use
> ramBufferedSizeMB and set it as high as you can afford.  No need to commit
> very often, certainly no need to flush or optimize until the end.
> 
> There is a page about indexing performance on either Solr or Lucene Wiki
> that will help.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: Britske <gbr...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Wednesday, March 25, 2009 10:05:17 AM
>> Subject: speeding up indexing with a LOT of indexed fields
>> 
>> 
>> hi, 
>> 
>> I'm having difficulty indexing a collection of documents in a reasonable
>> time. 
>> it's now going at 20 docs / sec on a c1.xlarge instance of amazon ec2
>> which
>> just isnt enough. 
>> This box has 8GB ram and the equivalent of 20 xeon processors.  
>> 
>> these document have a couple of stored, indexed, multi and single-valued
>> fields, but the main problem lies in it having about 1500 indexed fields
>> of
>> type sint.  Range [0,10000] (Yes, I know this is a lot) 
>> 
>> I'm looking for some guidance as what strategies to try out to improve
>> throughput in indexing. I could slam in some more servers (I will) but my
>> feeling tells me I can get more out of this.
>> 
>> some additional info: 
>> - I'm indexing to 10 cores in parallel.  This is done because :
>>       - at query time, 1 particular index will always fullfill all
>> requests
>> so we can prune the search space to 1/10th of its original size. 
>>       - each document as represented in a core is actually 1/10th of a
>> 'conceptual' document (which would contain up to 15000 indexed fields) if
>> I
>> indexed to 1 core. Indexing as 1 doc containing 15.000 indexed fields
>> proved
>> to give far worse results in searching and indexing than the solution i'm
>> going with now. 
>>      - the alternative of simply putting all docs with 1500 indexed field
>> each in the same core isn't really possible either, because this quickly
>> results in OOM-errors when sorting on a couple of fields. (even though
>> 9/10
>> th of all docs in this case would not have the field sorted on, they
>> would
>> still end up in a lucene fieldCache for this field) 
>> 
>> - to be clear: the 20 docs / second means 2 docs / second / core. Or 2
>> 'conceptual' docs / second overall. 
>> 
>> - each core has maxBufferedDocs ~20 and mergeFactor~10 .  (I actually set
>> them differently for each partition so that merges of different
>> partitions
>> don't happen altogether. This seemed to help a bit) 
>> 
>> - running jvm with -server -Xmx6000M -Xms6000M -XX:+UseParallelGC
>> -XX:+CMSPermGenSweepingEnabled -XX:MaxPermSize=128M to leave room for
>> diskcaching. 
>> 
>> - I'm spreading the 10 indices over 2 physical disks. 5 to /dev/sda1 5 to
>> /dev/sdb 
>> 
>> 
>> observations: 
>> - within minutes after feeding the server reaches it's max ram. 
>> - until then the processors are running on ~70%
>> - although I throw in a commit at random intervals (between 600 to 800
>> secs,
>> again so not to commit al partitions at the same time) the jvm just stays
>> eating all the ram. 
>> - not a lot seems to be happening on disk (using dstat) when the ram
>> hasn't
>> maxed out. Obviously, aftwerwards the disk is flooded with swapping. 
>> 
>> questions: 
>> - is there a good reason why all ram keeps occupied even though I commit
>> regularly? Perhaps fieldcaches get populated when indexing? I guess not,
>> but
>> I'm not sure what else could explain this
>> 
>> - would splitting the 'conceptual docs' in even more partitions help at
>> indexing time? from an application standpoint it's possible, it just
>> requires some work and it's hard to compare figures so I'd like to know
>> if
>> it's worth it .
>> 
>> - how is a flush different from a commit and would it help in getting the
>> ram-usage down?
>> 
>> - because all 15.000 indexed fields look very similar in structure (they
>> are
>> all sints [0,10000] to start with, I was looking for more efficient ways
>> to
>> get them in an index using some low-level indexing operations. For
>> example:
>> for a given document X and Y, and indexed fields 1,2.., i,...,N if X.a <
>> Y.a
>> than this ordening in a lot of cases holds for fields 2,...,N. Because of
>> these special properties I could possibly create a sorting algorithm that
>> takes advantage of this and thus would make indexing faster. 
>> Would even considering this path be something that may be useful, because
>> obviously it would envolve some work to make it work, and presumably a
>> lot
>> more work to get it to go faster than out of the box ?
>> 
>> - lastly: should I be able to get more out of this box or am I just
>> complaining ;-) 
>> 
>> Thanks for making it to here, 
>> and hoping to receive some valuable info, 
>> 
>> Cheers, 
>> Britske
>> -- 
>> View this message in context: 
>> http://www.nabble.com/speeding-up-indexing-with-a-LOT-of-indexed-fields-tp22702364p22702364.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/speeding-up-indexing-with-a-LOT-of-indexed-fields-tp22702364p22704710.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: speeding up indexing with a LOT of indexed fields

Reply via email to