Hi Erick, Thank you for your response and pointer. What I mean by running Lucene/SOLR on Hadoop is to have Lucene/SOLR index available to be queried using mapreduce or any best practice recommended.
I need to have this mechanism to do large scale row deduplication. Let me elaborate why I need this: 1. I have two data sources with 35 and 40 million records of customer profile - the data come from two systems (SAP and MS CRM) 2. Need to index and compare row by row of the two data sources using name, address, birth date, phone and email field. For birth date and email it will use exact comparison, but for the other fields will use probabilistic comparison. Btw, the data has been normalized before they are being indexed. 3. Each finding will be categorized under same person, and will be deduplicated automatically or under user intervention depending on the score. I usually use it using Lucene index on local filesystem and use term vector, but since this will be repeated task and then challenged by management to do this on top of Hadoop cluster I need to have a framework or best practice to do this. I understand that to have Lucene index on HDFS is not very appropriate since HDFS is designed for large block operation. With that understanding, I use SOLR and hope to query it using http call from mapreduce job. The snippet code is below. url = new URL(SOLR-Query-URL); HttpURLConnection connection = (HttpURLConnection) url.openConnection(); connection.setRequestMethod("GET"); The later method turns out to perform very bad. The simple mapreduce job that only read the data sources and write to hdfs takes 15 minutes, but once I do the http request it takes three hours now and still ongoing. What went wrong? And what will be solution to my problem? Thanks, Dino On Mon, Dec 14, 2015 at 12:30 AM, Erick Erickson <erickerick...@gmail.com> wrote: > First, what do you mean "run Lucene/Solr on Hadoop"? > > You can use the HdfsDirectoryFactory to store Solr/Lucene > indexes on Hadoop, at that point the actual filesystem > that holds the index is transparent to the end user, you just > use Solr as you would if it was using indexes on the local > file system. See: > https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS > > If you want to use Map-Reduce to _build_ indexes, see the > MapReduceIndexerTool in the Solr contrib area. > > Best, > Erick > -- Regards, Dino