Hi, Some quick notes, since it's late here. - You'll need to wait for SOLR-303 - there is no way even a big machine will be able to search such a large index in a reasonable amount of time, plus you may simply not have enough RAM for such a large index.
- I'd suggest you wait for Solr 1.3 (or some -dev version that uses the about-to-be-released Lucene 2.3)...performance reasons. - As for avoiding index duplication - how about having a SAN with a single copy of the index that all searchers (and the master) point to? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- From: Phillip Farber <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Friday, January 18, 2008 5:26:21 PM Subject: Solr feasibility with terabyte-scale data Hello everyone, We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple too. A document consists of a numeric id, stored and indexed and a large text field, indexed not stored, containing the OCR typically ~1.4Mb. Some limited faceting or additional metadata fields may be added later. The data in question currently amounts to about 1.1Tb of OCR (about 1M docs) which we expect to increase to 10Tb over time. Pilot tests on the desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of data via HTTP suggest we can index at a rate sufficient to keep up with the inputs (after getting over the 1.1 Tb hump). We envision nightly commits/optimizes. We expect to have low QPS (<10) rate and probably will not need millisecond query response. Our environment makes available Apache on blade servers (Dell 1955 dual dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*, high-performance NAS system over a dedicated (out-of-band) GbE switch (Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting with 2 blades and will add as demands require. While we have a lot of storage, the idea of master/slave Solr Collection Distribution to add more Solr instances clearly means duplicating an immense index. Is it possible to use one instance to update the index on NAS while other instances only read the index and commit to keep their caches warm instead? Should we expect Solr indexing time to slow significantly as we scale up? What kind of query performance could we expect? Is it totally naive even to consider Solr at this kind of scale? Given these parameters is it realistic to think that Solr could handle the task? Any advice/wisdom greatly appreciated, Phil