The indexing rate you need to achieve should be equal to the rate that new documents are produced. It shouldn't matter much how long it takes to index 3-6M documents the first time (within reason), given that you only need to do it once/occasionally. What is that rate (i.e. why do you think you can't do it on a single box)?
For the scale of documents you are talking about, hadoop sounds like it would complicate things more than simplify them. There is a pending Solr patch for using custom IndexReader factories that could easily open multiple indexes to search across (no optimize needed). Or, it would be relatively trivial to write a Lucene program to merge the indexes. You could also leave the indexes on multiple boxes and use Solr's distributed search to search across them (assuming you really didn't really need everything on a single box). -Yonik On Fri, Nov 28, 2008 at 7:01 PM, souravm <[EMAIL PROTECTED]> wrote: > Hi Yonik, > > Let me explain why I thought using hadoop will help in achieving the parallel > indexing better. > > Here are the set of requirements and constraints - > > 1. The 3-6M documents (around 300 to 600 MB size) would belong to the same > schema > 2. The resulting index of those 3-6M documents has to reside in a single box > (the target box). > 3. I have to use desktop grade servers with limited RAM (say maximum 2 GB) > and single CPU but large enough disk space above 100 GB. > > Now if I try to achieve indexing for 3-6M records by running single thread in > each of those servers then the steps are - > > 1. Create index in all N boxes > 2. Merge those indexes in the target box > 3. Optimize the resulting index in the target box. > > In Hadoop way what I need to do - > > 1. Use those 'N' servers to create the HDFS. > 2. Copy the raw data (3-6M records) to the HDFS. > 3. Then use Map/Reduce for indexing those documents and optimize. > > I this in this way the index merging and optimization time would be less as > those would not be limited by my single server's CPU and memory instead > through Map/Reduce the same would be happening in multiple boxes utilizing > their CPUs and memory6 in parallel. As I know this way Rackspace implemented > Solr's integration with Hadoop and got benefitted. But I realize that this > integration is not available open source way > > Also please let me know if there is other option to reduce indexing time in > my case within Solr given the limited capabilities of the servers. > > Regards, > Sourav > > > > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley > Sent: Friday, November 28, 2008 1:58 PM > To: solr-user@lucene.apache.org > Subject: Re: Using Solr with Hadoop .... > > While future Solr-hadoop integration is a definite possibility (and > will enable other cool stuff), it doesn't necessarily seem needed for > the problem you are trying to solve. > >> indexing them in parallel is not an option as my target doc size per hr >> itself can be very huge (3-6M) > > I'm not sure I understand... the bigger the indexing job, the more it > makes sense to do in parallel. If you're not doing any link inversion > for web search, it doesn't seem like hadoop is needed for parallelism. > If you are doing web crawling, perhaps look to nutch, not hadoop. > > -Yonik > > > On Fri, Nov 28, 2008 at 1:31 PM, souravm <[EMAIL PROTECTED]> wrote: >> Hi All, >> >> I have huge number of documents to index (say per hr) and within a hr I >> cannot compete it using a single machine. Having them distributed in >> multiple boxes and indexing them in parallel is not an option as my target >> doc size per hr itself can be very huge (3-6M). So I am considering using >> HDFS and MapReduce to do the indexing job within time. >> >> In that regard I have following queries regarding using Solr with Hadoop. >> >> 1. After creating the index using Hadoop whether storing them for query >> purpose again in HDFS would mean additional performance overhead (compared >> to storing them in in actual disk in one machine.) ? >> >> 2. What type of change is needed to make Solr wuery read from an index which >> is stored in HDFS ? >> >> Regards, >> Sourav