I think managing 100 cores will be too much headache. Also performance of querying 100 cores will not be good (need page_number*page_size from 100 cores, and then merge).
I think having around 10 SOLR instances, each one about 10M docs. Always search all 10 nodes. Index using some hash(doc) to distribute new docs among nodes. Run some nightly/weekly job to delete old docs and force merge (optimize) to some min/max number of segments. I think that will work ok, but not sure about how to handle replication/failover so each node is redundant. If we use SOLR replication it will have problems with replication after optimize for large indexes. Seems to take a long time to move 10M doc index from master to slave (around 100GB in our case). Doing it once per week is probably ok. 2011/12/15 Avni, Itamar <itamar.a...@verint.com>: > What about managing a core for each day? > > This way the deletion/archive is very simple. No "holes" in the index (which > is often when deleting document by document). > The index done against core [today-0]. > The query is done against cores [today-0],[today-1]...[today-99]. Quite a > headache. > > Itamar > > -----Original Message----- > From: Robert Stewart [mailto:bstewart...@gmail.com] > Sent: יום ה 15 דצמבר 2011 16:54 > To: solr-user@lucene.apache.org > Subject: how to setup to archive expired documents? > > We have a large (100M) index where we add about 1M new docs per day. > We want to keep index at a constant size so the oldest ones are removed > and/or archived each day (so index contains around 100 days of data). What > is the best way to do this? We still want to keep older data in some archive > index, not just delete it (so is it possible to export older segments, etc. > into some other index?). If we have some daily job to delete old data, I > assume we'd need to optimize the index to actually remove and free space, but > that will require very large (and slow) replication after optimize which will > probably not work out well for so large an index. Is there some way to shard > the data or other best practice? > > Thanks > Bob > This electronic message may contain proprietary and confidential information > of Verint Systems Inc., its affiliates and/or subsidiaries. > The information is intended to be for the use of the individual(s) or > entity(ies) named above. If you are not the intended recipient (or > authorized to receive this e-mail for the intended recipient), you may not > use, copy, disclose or distribute to anyone this message or any information > contained in this message. If you have received this electronic message in > error, please notify us by replying to this e-mail. >