RE: Using Solr with Hadoop ....

souravm Fri, 28 Nov 2008 19:36:01 -0800

Yonik,

I already tried with around 200M doc in a desktop type box with 2Gb memory. The 
simple queries (like getting data for a date range, queries without wild card 
etc.) are working fine within the level of response time 10-20 secs, provided 
the number of records hit is low (within couple of 1000 docs). However, sorting 
does not work there due to memory limitation. And also I'm sure any complex 
query (involving processing like group by, unique etc.) would be challenging to 
handle with bad performance.

So given all these I thought exploiting HDFS and Map Reduce capability may be 
worthwhile where I use Solr/Lucene's indexing power and Hadoop's parallel 
processing capability.

Regards,
Sourav

-----Original Message-----
From: Yonik Seeley [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 28, 2008 7:08 PM
To: solr-user@lucene.apache.org
Subject: Re: Using Solr with Hadoop ....

Ah sorry, I had misread your original post.  3-6M docs per hour can be
challenging.
Using the CSV loader, I've indexed 4000 docs per second (14M per hour)
on a 2.6GHz Athlon, but they were relatively simple and small docs.

On Fri, Nov 28, 2008 at 9:54 PM, souravm <[EMAIL PROTECTED]> wrote:
> There is a case where I'm expecting at peak season around 36M doc per day, at 
> hourly level peaking to 2-3M per hr. Now I need to do some processing of 
> those docs before I index them. Then based on the performance figure of 
> indexing I saw in http://wiki.apache.org/solr/SolrPerformanceFactors (the 
> embedded vs http post section) - it looks like it would take more than 2 hr 
> index a 3M records using 4 machine. So I thought it would be difficult to 
> achieve my goal only through Solr I need something else to further increasing 
> the parallel processing.
>
> All together the doc size targeted would be around average 3B (the size would 
> be around 300 Gb).

You definitely need distributed search.  Don't try to search this on a
single box.

> The docs would get constantly added and deleted every day basis at an average 
> rate of 8M per day peak
> being 36M. Now considering around 10 boxes, every box need to store around 
> 250M docs.

250M docs per box is probably too high, even for distributed search,
unless your query throughput and latency requirements are very low.

-Yonik

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

RE: Using Solr with Hadoop ....

Reply via email to