Hi Steve,
I cannot remove deduplication at index time, but rather to find duplicates
of the document then inform the duplicate data back to user.
Yes, I need to query each document of all 40 million rows. It will be about
10 mapper tasks max. Will try the SolrJ for this purpose. Thanks Steve.
Be
Hi Tim,
Thank you for the great pointer. Will join the group.
Thanks,
Dino
On Tue, Jan 5, 2016 at 2:10 AM, Tim Williams wrote:
> Apache Blur (Incubating) has several approaches (hive, spark, m/r)
> that could probably help with this ranging from very experimental to
> stable. If you're inter
Hi Erick,
Thank you for your response and pointer. What I mean by running Lucene/SOLR
on Hadoop is to have Lucene/SOLR index available to be queried using
mapreduce or any best practice recommended.
I need to have this mechanism to do large scale row deduplication. Let me
elaborate why I need thi
Hi,
I've tried to figure out how can we run Lucene/SOLR on Hadoop, and found
several sources. The last pointer is Apache Blur project and it is an
incubating project.
Is there any straightforward implementation of Lucene/SOLR on Hadoop? Or
best practice of how to incorporate Lucene/SOLR on Hadoop