These are the reasons why we are thinking on splitting and index via multi-core:
First of all all, we have an index of news which size is about 9G. As we will keep aggregating news forever and ever and let users do free text search on our system, we think that it will be easier for IT crowd to manage fixed size indexes (read-only indexes) giving flexibility to the plattform (i'm wondering how much performance will we lose if read-only indexes live in NFS). Secondly, we plan to store date ranges per core, then, when a federated search is made it filter the cores to query on (we plan to install multiple solr servers as the info growth) 2009/8/26 Chris Hostetter <hossman_luc...@fucit.org>: > > : 1) We found the indexing speed starts dipping once the index grow to a > : certain size - in our case around 50G. We don't optimize, but we have > : to maintain a consistent index speed. The only way we could do that > : was keep creating new cores (on the same box, though we do use > > Hmmm... it seems like ConcurrentMergeScheduler should make it possible to > maintain semi-constant indexing speed by doing merges in background > threads ... the only other issue would be making sure that an individual > segment never got too big ... but that seems like it should be managable > with the config options > > (i'm just hypothisizing, i don't normally worry about indexes of this > size, and when i do i'm not incrementally adding to them as time goes one > ... i guess what i'm asking is if you guys ever looked into these ideas > and dissmissed them for some reason) > > : 2) Be able to drop the whole core for pruning purposes. We didn't want > > that makes a lot of sense ... removing older cores is on of the only > reaosns i could think of for this model to really make a lot of sense for > performance reasons. > > : > One problem is the IT logistics of handling the file set. At 200 million > : > records you have at least 20G of data in one Lucene index. It takes hours > to > : > optimize this, and 10s of minutes to copy the optimized index around to > : > query servers. > > i get that full optimizes become ridiculous at that point, but you could > still do partial optimizes ... and isn't the total disk space with this > strategy still the same? Aren't you still ultimately copying the same > amout of data arround? > > > > -Hoss > > -- Lici