Yonik, I already tried with around 200M doc in a desktop type box with 2Gb memory. The simple queries (like getting data for a date range, queries without wild card etc.) are working fine within the level of response time 10-20 secs, provided the number of records hit is low (within couple of 1000 docs). However, sorting does not work there due to memory limitation. And also I'm sure any complex query (involving processing like group by, unique etc.) would be challenging to handle with bad performance.
So given all these I thought exploiting HDFS and Map Reduce capability may be worthwhile where I use Solr/Lucene's indexing power and Hadoop's parallel processing capability. Regards, Sourav -----Original Message----- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: Friday, November 28, 2008 7:08 PM To: solr-user@lucene.apache.org Subject: Re: Using Solr with Hadoop .... Ah sorry, I had misread your original post. 3-6M docs per hour can be challenging. Using the CSV loader, I've indexed 4000 docs per second (14M per hour) on a 2.6GHz Athlon, but they were relatively simple and small docs. On Fri, Nov 28, 2008 at 9:54 PM, souravm <[EMAIL PROTECTED]> wrote: > There is a case where I'm expecting at peak season around 36M doc per day, at > hourly level peaking to 2-3M per hr. Now I need to do some processing of > those docs before I index them. Then based on the performance figure of > indexing I saw in http://wiki.apache.org/solr/SolrPerformanceFactors (the > embedded vs http post section) - it looks like it would take more than 2 hr > index a 3M records using 4 machine. So I thought it would be difficult to > achieve my goal only through Solr I need something else to further increasing > the parallel processing. > > All together the doc size targeted would be around average 3B (the size would > be around 300 Gb). You definitely need distributed search. Don't try to search this on a single box. > The docs would get constantly added and deleted every day basis at an average > rate of 8M per day peak > being 36M. Now considering around 10 boxes, every box need to store around > 250M docs. 250M docs per box is probably too high, even for distributed search, unless your query throughput and latency requirements are very low. -Yonik **************** CAUTION - Disclaimer ***************** This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS******** End of Disclaimer ********INFOSYS***