We are currently using a commerical indexing product based on Lucene for our indexing needs, and would like to replace it with SOLR. The source database for this system has 40 million records, growing by about 30,000 items per day. It is a repository for all the metadata relating to an archive of photos/images, text articles, and recently video content. The metadata also exists in the filesystem along with the actual content, and this copy is used when a single item is selected on the website.

Our existing deployment consists of 20 "static" index processes running on 10 servers, each with up to 2.1 million rows of the index. In addition, there is an 11th server that acts as a search broker and houses another index, called the incremental, where all the new data goes as it comes in. Each of these servers is a 64-bit CentOS virtual machine with 5.5GB of RAM allocated. They run on three hosts, each with 32GB of RAM, 8 CPU cores at either 2.5 or 2.66Ghz, and four SATA disks in a RAID10 array, and a fourth machine available for redundancy and development. The data is divided among these indexes by an autoincrement database field called DID (document identifier), the primary key on the table. The table (75GB of data, 9GB of index) has 80+ fields for each document, but only about 60 of them are used in the index, and about 50 of those are actually stored in the index.

The current SOLR plan is to divide the index into shards, and distribute the documents among the shards by using a modulus function on the DID. We haven't determined yet how documents to put in each shard for the best performance. As part of the testing for this, I am building the full index right now with six shards, so each one will have between 6 and 7 million documents. For performance reasons, we will continue the incremental concept and use a separate shard for new data.

I actually did build a massive single index with the entire database in it. It performed a lot better than I expected - returning results within 1-2 seconds with one query happening at a time - but it ground to a halt when subjected to a very mild load test. At the time, my caches had not been tuned, but I still don't think it will scale.

We are currently testing with the pre-packaged version 1.4 running under Jetty because we want maximum stability, but we already determined that this isn't going to cut it. At the very least, we will need SOLR-1143 to keep the system working through problems and maintenance, but today I was pointed in the direction of SolrCloud.

I have duplicated the concept of a search broker from the current solution by giving each of my VMs four cores - broker, live, build, and test. The broker core talks to a directory named broker, the other three use core0, core1, and core2, so that we can swap cores and not have name confusion. We'll just use the XML output of the server to keep track of them.

The broker core has a shards parameter included in solrconfig.xml pointing at all the 'live' cores, so you can just issue search queries to that core and get results from the entire system - as long as all the shards are up. The broker core will always have an empty index.

Does anyone have any recommendations about the best way to build in fault tolerance, considering long-term viability? I don't think I want to use the full SolrCloud feature set, but shard replicas are very intriguing. The drawback there is that I need twice as much hardware so I can run two copies of each shard on separate physical machines, but it has a plus side - if we lost a whole VM host, recovery would be much easier than it is now. I found SOLR-1537, which if I read it right, would let us have shard replicas with the regular codebase.

How stable can I expect things to be if I get a nightly build or grab 1.5 and/or SolrCloud from SVN? Are there any particular nightly builds that are known to be more stable than others? I'm an admin who flirts with programming from time to time, but we do have actual Java developers on staff.

Thanks,
Shawn

Reply via email to