Hi Steve,
I cannot remove deduplication at index time, but rather to find duplicates
of the document then inform the duplicate data back to user.
Yes, I need to query each document of all 40 million rows. It will be about
10 mapper tasks max. Will try the SolrJ for this purpose. Thanks Steve.
Be
Hi Tim,
Thank you for the great pointer. Will join the group.
Thanks,
Dino
On Tue, Jan 5, 2016 at 2:10 AM, Tim Williams wrote:
> Apache Blur (Incubating) has several approaches (hive, spark, m/r)
> that could probably help with this ranging from very experimental to
> stable. If you're inter
You might consider trying to get the de-duplication done at index time:
https://cwiki.apache.org/confluence/display/solr/De-Duplication that way
the map reduce job wouldn't even be necessary.
When it comes to the map reduce job, you would need to be more specific
with *what* you are doing for peop
Apache Blur (Incubating) has several approaches (hive, spark, m/r)
that could probably help with this ranging from very experimental to
stable. If you're interested, you can ask over on
blur-u...@incubator.apache.org ...
Thanks,
--tim
On Fri, Dec 25, 2015 at 4:28 AM, Dino Chopins wrote:
> Hi Er
Hi Erick,
Thank you for your response and pointer. What I mean by running Lucene/SOLR
on Hadoop is to have Lucene/SOLR index available to be queried using
mapreduce or any best practice recommended.
I need to have this mechanism to do large scale row deduplication. Let me
elaborate why I need thi
First, what do you mean "run Lucene/Solr on Hadoop"?
You can use the HdfsDirectoryFactory to store Solr/Lucene
indexes on Hadoop, at that point the actual filesystem
that holds the index is transparent to the end user, you just
use Solr as you would if it was using indexes on the local
file system
Hi,
I've tried to figure out how can we run Lucene/SOLR on Hadoop, and found
several sources. The last pointer is Apache Blur project and it is an
incubating project.
Is there any straightforward implementation of Lucene/SOLR on Hadoop? Or
best practice of how to incorporate Lucene/SOLR on Hadoop