speeding up indexing with a LOT of indexed fields

Britske Wed, 25 Mar 2009 07:05:48 -0700

hi, 

I'm having difficulty indexing a collection of documents in a reasonable
time. 
it's now going at 20 docs / sec on a c1.xlarge instance of amazon ec2 which
just isnt enough. 
This box has 8GB ram and the equivalent of 20 xeon processors.

these document have a couple of stored, indexed, multi and single-valued
fields, but the main problem lies in it having about 1500 indexed fields of
type sint. Range [0,10000] (Yes, I know this is a lot)

I'm looking for some guidance as what strategies to try out to improve
throughput in indexing. I could slam in some more servers (I will) but my
feeling tells me I can get more out of this.

some additional info:
- I'm indexing to 10 cores in parallel. This is done because :
- at query time, 1 particular index will always fullfill all requests
so we can prune the search space to 1/10th of its original size.
- each document as represented in a core is actually 1/10th of a
'conceptual' document (which would contain up to 15000 indexed fields) if I
indexed to 1 core. Indexing as 1 doc containing 15.000 indexed fields proved
to give far worse results in searching and indexing than the solution i'm
going with now.
- the alternative of simply putting all docs with 1500 indexed field
each in the same core isn't really possible either, because this quickly
results in OOM-errors when sorting on a couple of fields. (even though 9/10
th of all docs in this case would not have the field sorted on, they would
still end up in a lucene fieldCache for this field)

- to be clear: the 20 docs / second means 2 docs / second / core. Or 2
'conceptual' docs / second overall.

- each core has maxBufferedDocs ~20 and mergeFactor~10 . (I actually set
them differently for each partition so that merges of different partitions
don't happen altogether. This seemed to help a bit)

- running jvm with -server -Xmx6000M -Xms6000M -XX:+UseParallelGC
-XX:+CMSPermGenSweepingEnabled -XX:MaxPermSize=128M to leave room for
diskcaching.

- I'm spreading the 10 indices over 2 physical disks. 5 to /dev/sda1 5 to
/dev/sdb

observations:
- within minutes after feeding the server reaches it's max ram.
- until then the processors are running on ~70%
- although I throw in a commit at random intervals (between 600 to 800 secs,
again so not to commit al partitions at the same time) the jvm just stays
eating all the ram.
- not a lot seems to be happening on disk (using dstat) when the ram hasn't
maxed out. Obviously, aftwerwards the disk is flooded with swapping.

questions:
- is there a good reason why all ram keeps occupied even though I commit
regularly? Perhaps fieldcaches get populated when indexing? I guess not, but
I'm not sure what else could explain this

- would splitting the 'conceptual docs' in even more partitions help at
indexing time? from an application standpoint it's possible, it just
requires some work and it's hard to compare figures so I'd like to know if
it's worth it .

- how is a flush different from a commit and would it help in getting the
ram-usage down?

- because all 15.000 indexed fields look very similar in structure (they are
all sints [0,10000] to start with, I was looking for more efficient ways to
get them in an index using some low-level indexing operations. For example:
for a given document X and Y, and indexed fields 1,2.., i,...,N if X.a < Y.a
than this ordening in a lot of cases holds for fields 2,...,N. Because of
these special properties I could possibly create a sorting algorithm that
takes advantage of this and thus would make indexing faster.
Would even considering this path be something that may be useful, because
obviously it would envolve some work to make it work, and presumably a lot
more work to get it to go faster than out of the box ?

- lastly: should I be able to get more out of this box or am I just
complaining ;-)

Thanks for making it to here,
and hoping to receive some valuable info,

Cheers,
Britske
--
View this message in context:
http://www.nabble.com/speeding-up-indexing-with-a-LOT-of-indexed-fields-tp22702364p22702364.html
Sent from the Solr - User mailing list archive at Nabble.com.

speeding up indexing with a LOT of indexed fields

Reply via email to