On 7/29/2013 6:00 AM, Santanu8939967892 wrote: > Hi, > I have a huge volume of DB records, which is close to 250 millions. > I am going to use DIH to index the data into Solr. > I need a best architecture to index and query the data in an efficient > manner. > I am using windows server 2008 with 16 GB RAM, zion processor and Solr 4.4.
Gora and Jack have given you great information. I would add that when you are dealing with an index of this size, you need to be prepared to spend some real money on hardware if you want maximum performance. With 20-30 fields, I would imagine that each document is probably a few KB in size. Even if they will be much smaller than that, with 250 million of them, your index will be pretty large. I'd be VERY surprised if the index is less than 100GB, and something larger than 500GB is probably more likely. For illustration purposes, let's be conservative and say it's 200GB. 16GB of RAM isn't enough for an index that size. An ideal "round" memory size for a 200GB index would be 256GB - 200GB of RAM for the OS disk cache and enough memory for whatever size java heap you might need. In truth, you probably don't need to cache the ENTIRE index ... most searches will involve only certain parts of the index and won't touch the entire thing. A "good enough" memory size might be 128GB which would keep the most relevant parts of the index in RAM at all times. If you were to put a 200GB index onto a disk that's SSD, you could probably get away with 64GB of RAM - 50GB or so for the OS disk cache and the rest for the java heap. If your index will be larger than 200GB, then the numbers I have given you will go up. These numbers also assume that you have your entire index on one server, which is probably not a good idea. http://wiki.apache.org/solr/SolrPerformanceProblems SolrCloud would likely be the best architecture. It would spread out your system requirements and load across multiple machines. If you had 20 machines, each with 16-32GB of RAM, you could do a SolrCloud installation with 10 shards and a replicationFactor of 2, and there wouldn't be any memory problems. Each machine would have 25 million records on it, and you'd have two complete copies of your index so you'd be able to keep running if a machine completely failed -- which DOES happen. The information I've given you is for an ideal setup. You can go smaller, and budget needs might indeed cause you to go smaller. If you don't need extremely good performance from Solr, then you don't need to spend the money required for an architecture like I've described. Thanks, Shawn