Marc is referring to the very informative by Ted Dunning from maybe a month or so ago.
For what it's worth, we just used Hadoop Streaming, JRuby, and EmbeddedSolr to speed up indexing by parallelizing it. Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ----- Original Message ---- > From: Marc Sturlese <marc.sturl...@gmail.com> > To: solr-user@lucene.apache.org > Sent: Tue, June 22, 2010 12:43:27 PM > Subject: Re: anyone use hadoop+solr? > > Well, the patch consumes the data from a csv. You have to modify the input > to use TableInputFormat (I don't remember if it's called exaclty like that) > and it will work. Once you've done that, you have to specify as much > reducers as shards you want. I know 2 ways to index using > hadoop method 1 (solr-1301 & nutch): -Map: just get data from the > source and create key-value -Reduce: does the analysis and index the > data So, the index is build on the reducer side method 2 (hadoop > lucene index contrib) -Map: does analysis and open indexWriter to add > docs -Reducer: Merge small indexs build in the map So, indexs are build on > the map side method 2 has no good integration with Solr at the > moment. In the jira (SOLR-1301) there's a good explanation of the > advantages and disadvantages of indexing on the map or reduce side. I > recomend you to read with detail all the comments on the jira to know exactly > how it works. -- View this message in context: > href="http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p914625.html" > > target=_blank > >http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p914625.html Sent > from the Solr - User mailing list archive at Nabble.com.