Re: speeding up indexing with a LOT of indexed fields

Otis Gospodnetic Wed, 25 Mar 2009 07:41:44 -0700

Britske,

Here are a few quick ones:


- Does that machine really have 10 CPU cores?  If it has significantly less, 
you may be beyond the "indexing sweet spot" in terms of indexer threads vs. CPU 
cores

- Your maxBufferedDocs is super small.  Comment that out anyway.  use 
ramBufferedSizeMB and set it as high as you can afford.  No need to commit very 
often, certainly no need to flush or optimize until the end.

There is a page about indexing performance on either Solr or Lucene Wiki that 
will help.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Britske <gbr...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, March 25, 2009 10:05:17 AM
> Subject: speeding up indexing with a LOT of indexed fields
> 
> 
> hi, 
> 
> I'm having difficulty indexing a collection of documents in a reasonable
> time. 
> it's now going at 20 docs / sec on a c1.xlarge instance of amazon ec2 which
> just isnt enough. 
> This box has 8GB ram and the equivalent of 20 xeon processors.  
> 
> these document have a couple of stored, indexed, multi and single-valued
> fields, but the main problem lies in it having about 1500 indexed fields of
> type sint.  Range [0,10000] (Yes, I know this is a lot) 
> 
> I'm looking for some guidance as what strategies to try out to improve
> throughput in indexing. I could slam in some more servers (I will) but my
> feeling tells me I can get more out of this.
> 
> some additional info: 
> - I'm indexing to 10 cores in parallel.  This is done because :
>       - at query time, 1 particular index will always fullfill all requests
> so we can prune the search space to 1/10th of its original size. 
>       - each document as represented in a core is actually 1/10th of a
> 'conceptual' document (which would contain up to 15000 indexed fields) if I
> indexed to 1 core. Indexing as 1 doc containing 15.000 indexed fields proved
> to give far worse results in searching and indexing than the solution i'm
> going with now. 
>      - the alternative of simply putting all docs with 1500 indexed field
> each in the same core isn't really possible either, because this quickly
> results in OOM-errors when sorting on a couple of fields. (even though 9/10
> th of all docs in this case would not have the field sorted on, they would
> still end up in a lucene fieldCache for this field) 
> 
> - to be clear: the 20 docs / second means 2 docs / second / core. Or 2
> 'conceptual' docs / second overall. 
> 
> - each core has maxBufferedDocs ~20 and mergeFactor~10 .  (I actually set
> them differently for each partition so that merges of different partitions
> don't happen altogether. This seemed to help a bit) 
> 
> - running jvm with -server -Xmx6000M -Xms6000M -XX:+UseParallelGC
> -XX:+CMSPermGenSweepingEnabled -XX:MaxPermSize=128M to leave room for
> diskcaching. 
> 
> - I'm spreading the 10 indices over 2 physical disks. 5 to /dev/sda1 5 to
> /dev/sdb 
> 
> 
> observations: 
> - within minutes after feeding the server reaches it's max ram. 
> - until then the processors are running on ~70%
> - although I throw in a commit at random intervals (between 600 to 800 secs,
> again so not to commit al partitions at the same time) the jvm just stays
> eating all the ram. 
> - not a lot seems to be happening on disk (using dstat) when the ram hasn't
> maxed out. Obviously, aftwerwards the disk is flooded with swapping. 
> 
> questions: 
> - is there a good reason why all ram keeps occupied even though I commit
> regularly? Perhaps fieldcaches get populated when indexing? I guess not, but
> I'm not sure what else could explain this
> 
> - would splitting the 'conceptual docs' in even more partitions help at
> indexing time? from an application standpoint it's possible, it just
> requires some work and it's hard to compare figures so I'd like to know if
> it's worth it .
> 
> - how is a flush different from a commit and would it help in getting the
> ram-usage down?
> 
> - because all 15.000 indexed fields look very similar in structure (they are
> all sints [0,10000] to start with, I was looking for more efficient ways to
> get them in an index using some low-level indexing operations. For example:
> for a given document X and Y, and indexed fields 1,2.., i,...,N if X.a < Y.a
> than this ordening in a lot of cases holds for fields 2,...,N. Because of
> these special properties I could possibly create a sorting algorithm that
> takes advantage of this and thus would make indexing faster. 
> Would even considering this path be something that may be useful, because
> obviously it would envolve some work to make it work, and presumably a lot
> more work to get it to go faster than out of the box ?
> 
> - lastly: should I be able to get more out of this box or am I just
> complaining ;-) 
> 
> Thanks for making it to here, 
> and hoping to receive some valuable info, 
> 
> Cheers, 
> Britske
> -- 
> View this message in context: 
> http://www.nabble.com/speeding-up-indexing-with-a-LOT-of-indexed-fields-tp22702364p22702364.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: speeding up indexing with a LOT of indexed fields

Reply via email to