hi, I'm having difficulty indexing a collection of documents in a reasonable time. it's now going at 20 docs / sec on a c1.xlarge instance of amazon ec2 which just isnt enough. This box has 8GB ram and the equivalent of 20 xeon processors.
these document have a couple of stored, indexed, multi and single-valued fields, but the main problem lies in it having about 1500 indexed fields of type sint. Range [0,10000] (Yes, I know this is a lot) I'm looking for some guidance as what strategies to try out to improve throughput in indexing. I could slam in some more servers (I will) but my feeling tells me I can get more out of this. some additional info: - I'm indexing to 10 cores in parallel. This is done because : - at query time, 1 particular index will always fullfill all requests so we can prune the search space to 1/10th of its original size. - each document as represented in a core is actually 1/10th of a 'conceptual' document (which would contain up to 15000 indexed fields) if I indexed to 1 core. Indexing as 1 doc containing 15.000 indexed fields proved to give far worse results in searching and indexing than the solution i'm going with now. - the alternative of simply putting all docs with 1500 indexed field each in the same core isn't really possible either, because this quickly results in OOM-errors when sorting on a couple of fields. (even though 9/10 th of all docs in this case would not have the field sorted on, they would still end up in a lucene fieldCache for this field) - to be clear: the 20 docs / second means 2 docs / second / core. Or 2 'conceptual' docs / second overall. - each core has maxBufferedDocs ~20 and mergeFactor~10 . (I actually set them differently for each partition so that merges of different partitions don't happen altogether. This seemed to help a bit) - running jvm with -server -Xmx6000M -Xms6000M -XX:+UseParallelGC -XX:+CMSPermGenSweepingEnabled -XX:MaxPermSize=128M to leave room for diskcaching. - I'm spreading the 10 indices over 2 physical disks. 5 to /dev/sda1 5 to /dev/sdb observations: - within minutes after feeding the server reaches it's max ram. - until then the processors are running on ~70% - although I throw in a commit at random intervals (between 600 to 800 secs, again so not to commit al partitions at the same time) the jvm just stays eating all the ram. - not a lot seems to be happening on disk (using dstat) when the ram hasn't maxed out. Obviously, aftwerwards the disk is flooded with swapping. questions: - is there a good reason why all ram keeps occupied even though I commit regularly? Perhaps fieldcaches get populated when indexing? I guess not, but I'm not sure what else could explain this - would splitting the 'conceptual docs' in even more partitions help at indexing time? from an application standpoint it's possible, it just requires some work and it's hard to compare figures so I'd like to know if it's worth it . - how is a flush different from a commit and would it help in getting the ram-usage down? - because all 15.000 indexed fields look very similar in structure (they are all sints [0,10000] to start with, I was looking for more efficient ways to get them in an index using some low-level indexing operations. For example: for a given document X and Y, and indexed fields 1,2.., i,...,N if X.a < Y.a than this ordening in a lot of cases holds for fields 2,...,N. Because of these special properties I could possibly create a sorting algorithm that takes advantage of this and thus would make indexing faster. Would even considering this path be something that may be useful, because obviously it would envolve some work to make it work, and presumably a lot more work to get it to go faster than out of the box ? - lastly: should I be able to get more out of this box or am I just complaining ;-) Thanks for making it to here, and hoping to receive some valuable info, Cheers, Britske -- View this message in context: http://www.nabble.com/speeding-up-indexing-with-a-LOT-of-indexed-fields-tp22702364p22702364.html Sent from the Solr - User mailing list archive at Nabble.com.