Apache Blur (Incubating) has several approaches (hive, spark, m/r) that could probably help with this ranging from very experimental to stable. If you're interested, you can ask over on blur-u...@incubator.apache.org ...
Thanks, --tim On Fri, Dec 25, 2015 at 4:28 AM, Dino Chopins <dino.chop...@gmail.com> wrote: > Hi Erick, > > Thank you for your response and pointer. What I mean by running Lucene/SOLR > on Hadoop is to have Lucene/SOLR index available to be queried using > mapreduce or any best practice recommended. > > I need to have this mechanism to do large scale row deduplication. Let me > elaborate why I need this: > > 1. I have two data sources with 35 and 40 million records of customer > profile - the data come from two systems (SAP and MS CRM) > 2. Need to index and compare row by row of the two data sources using > name, address, birth date, phone and email field. For birth date and email > it will use exact comparison, but for the other fields will use > probabilistic comparison. Btw, the data has been normalized before they are > being indexed. > 3. Each finding will be categorized under same person, and will be > deduplicated automatically or under user intervention depending on the > score. > > I usually use it using Lucene index on local filesystem and use term > vector, but since this will be repeated task and then challenged by > management to do this on top of Hadoop cluster I need to have a framework > or best practice to do this. > > I understand that to have Lucene index on HDFS is not very appropriate > since HDFS is designed for large block operation. With that understanding, > I use SOLR and hope to query it using http call from mapreduce job. The > snippet code is below. > > url = new URL(SOLR-Query-URL); > > HttpURLConnection connection = (HttpURLConnection) > url.openConnection(); > connection.setRequestMethod("GET"); > > The later method turns out to perform very bad. The simple mapreduce job > that only read the data sources and write to hdfs takes 15 minutes, but > once I do the http request it takes three hours now and still ongoing. > > What went wrong? And what will be solution to my problem? > > Thanks, > > Dino > > On Mon, Dec 14, 2015 at 12:30 AM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> First, what do you mean "run Lucene/Solr on Hadoop"? >> >> You can use the HdfsDirectoryFactory to store Solr/Lucene >> indexes on Hadoop, at that point the actual filesystem >> that holds the index is transparent to the end user, you just >> use Solr as you would if it was using indexes on the local >> file system. See: >> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS >> >> If you want to use Map-Reduce to _build_ indexes, see the >> MapReduceIndexerTool in the Solr contrib area. >> >> Best, >> Erick >> > > > > > -- > Regards, > > Dino