Re: Running Lucene/SOR on Hadoop

2016-01-09 Thread Dino Chopins
Hi Steve, I cannot remove deduplication at index time, but rather to find duplicates of the document then inform the duplicate data back to user. Yes, I need to query each document of all 40 million rows. It will be about 10 mapper tasks max. Will try the SolrJ for this purpose. Thanks Steve. Be

Re: Running Lucene/SOR on Hadoop

2016-01-09 Thread Dino Chopins
Hi Tim, Thank you for the great pointer. Will join the group. Thanks, Dino On Tue, Jan 5, 2016 at 2:10 AM, Tim Williams wrote: > Apache Blur (Incubating) has several approaches (hive, spark, m/r) > that could probably help with this ranging from very experimental to > stable. If you're inter

Re: Running Lucene/SOR on Hadoop

2015-12-24 Thread Dino Chopins
Hi Erick, Thank you for your response and pointer. What I mean by running Lucene/SOLR on Hadoop is to have Lucene/SOLR index available to be queried using mapreduce or any best practice recommended. I need to have this mechanism to do large scale row deduplication. Let me elaborate why I need thi

Running Lucene/SOR on Hadoop

2015-12-13 Thread Dino Chopins
Hi, I've tried to figure out how can we run Lucene/SOLR on Hadoop, and found several sources. The last pointer is Apache Blur project and it is an incubating project. Is there any straightforward implementation of Lucene/SOLR on Hadoop? Or best practice of how to incorporate Lucene/SOLR on Hadoop