Hi! I am already using solr 1.2 and happy with it.
In a new project with very tight dead line (10 development days from today) I need to setup a more ambitious system in terms of scale Here is the spec: * I need to index about 60,000,000 documents * Each document is has 11 textual fields to be indexed & stored and 4 more fields to be stored only * Most fields are short (2-14 characters) however 2 indexed fields can be up to 1KB and another stored field is up to 1KB * On average every document is about 0.5 KB to be stored and 0.4KB to be indexed * The SLA for data freshness is a full nightly re-index ( I cannot obtain an incremental update/delete lists of the modified documents) * The SLA for query time is 5 seconds * the number of expected queries is 2-3 queries per second * the queries are simple a combination of Boolean operation and name searches (no fancy fuzzy searches and levinstien distances, no faceting, etc) * I have a 64 bit Dell 2950 4-cpu machine (2 dual cores ) with RAID 10, 200 GB HD space, and 8GB RAM memory * The documents are not given to me explicitly - I am given a raw-documents in RAM - one by one, from which I create my document in RAM. and then I can either http-post is to index it directly or append it to a tsv file for later indexing * Each document has a unique ID I have a few directions I am thinking about The simple approach * Have one solr instance that will index the entire document set (from files). I am afraid this will take too much time Direction 1 * Create TSV files from all the documents - this will take around 3-4 hours * Have all the documents partitioned into several subsets (how many should I choose? ) * Have multiple solr instances on the same machine * Let each solr instance concurrently index the appropriate subset * At the end merge all the indices using the IndexMergeTool - (how much time will it take ?) Direction 2 * Like the previous but instead of using the IndexMergeTool , use distributed search with shards (upgrading to solr 1.3) Direction 3,4 * Like previous directions only avoid using TSV files at all and directly index the documents from RAM Questions: * Which direction do you recommend in order to meet the SLAs in the fastest way? * Since I have RAID on the machine can I gain performance by using multiple solr instances on the same machine or only multiple machines will help me * What's the minimal number of machines I should require (I might get more weaker machines) * How many concurrent indexers are recommended? * Do you agree that the bottle neck is the indexing time? Any help is appreciated Thanks in advance yatir