Advice on deployment

Shawn Heisey Thu, 25 Feb 2010 13:10:19 -0800

We are currently using a commerical indexing product based on Lucene forour indexing needs, and would like to replace it with SOLR. The sourcedatabase for this system has 40 million records, growing by about 30,000items per day. It is a repository for all the metadata relating to anarchive of photos/images, text articles, and recently video content.The metadata also exists in the filesystem along with the actualcontent, and this copy is used when a single item is selected on thewebsite.

Our existing deployment consists of 20 "static" index processes runningon 10 servers, each with up to 2.1 million rows of the index. Inaddition, there is an 11th server that acts as a search broker andhouses another index, called the incremental, where all the new datagoes as it comes in. Each of these servers is a 64-bit CentOS virtualmachine with 5.5GB of RAM allocated. They run on three hosts, each with32GB of RAM, 8 CPU cores at either 2.5 or 2.66Ghz, and four SATA disksin a RAID10 array, and a fourth machine available for redundancy anddevelopment. The data is divided among these indexes by anautoincrement database field called DID (document identifier), theprimary key on the table. The table (75GB of data, 9GB of index) has80+ fields for each document, but only about 60 of them are used in theindex, and about 50 of those are actually stored in the index.

The current SOLR plan is to divide the index into shards, and distributethe documents among the shards by using a modulus function on the DID.We haven't determined yet how documents to put in each shard for thebest performance. As part of the testing for this, I am building thefull index right now with six shards, so each one will have between 6and 7 million documents. For performance reasons, we will continue theincremental concept and use a separate shard for new data.

I actually did build a massive single index with the entire database init. It performed a lot better than I expected - returning resultswithin 1-2 seconds with one query happening at a time - but it ground toa halt when subjected to a very mild load test. At the time, my cacheshad not been tuned, but I still don't think it will scale.

We are currently testing with the pre-packaged version 1.4 running underJetty because we want maximum stability, but we already determined thatthis isn't going to cut it. At the very least, we will need SOLR-1143to keep the system working through problems and maintenance, but today Iwas pointed in the direction of SolrCloud.

I have duplicated the concept of a search broker from the currentsolution by giving each of my VMs four cores - broker, live, build, andtest. The broker core talks to a directory named broker, the otherthree use core0, core1, and core2, so that we can swap cores and nothave name confusion. We'll just use the XML output of the server tokeep track of them.

The broker core has a shards parameter included in solrconfig.xmlpointing at all the 'live' cores, so you can just issue search queriesto that core and get results from the entire system - as long as all theshards are up. The broker core will always have an empty index.

Does anyone have any recommendations about the best way to build infault tolerance, considering long-term viability? I don't think I wantto use the full SolrCloud feature set, but shard replicas are veryintriguing. The drawback there is that I need twice as much hardware soI can run two copies of each shard on separate physical machines, but ithas a plus side - if we lost a whole VM host, recovery would be mucheasier than it is now. I found SOLR-1537, which if I read it right,would let us have shard replicas with the regular codebase.

How stable can I expect things to be if I get a nightly build or grab1.5 and/or SolrCloud from SVN? Are there any particular nightly buildsthat are known to be more stable than others? I'm an admin who flirtswith programming from time to time, but we do have actual Javadevelopers on staff.


Thanks,
Shawn

Advice on deployment

Reply via email to