Re: Running Lucene/SOR on Hadoop

Dino Chopins Sat, 09 Jan 2016 23:11:23 -0800

Hi Steve,

I cannot remove deduplication at index time, but rather to find duplicates
of the document then inform the duplicate data back to user.


Yes, I need to query each document of all 40 million rows. It will be about
10 mapper tasks max. Will try the SolrJ for this purpose. Thanks Steve.

Best,

Dino

On Sun, Jan 10, 2016 at 11:31 AM, Steve Davids <sdav...@gmail.com> wrote:

> You might consider trying to get the de-duplication done at index time:
> https://cwiki.apache.org/confluence/display/solr/De-Duplication that way
> the map reduce job wouldn't even be necessary.
>
> When it comes to the map reduce job, you would need to be more specific
> with *what* you are doing for people to try and help, are you attempting to
> query for every record of all 40 million rows - how many mapper tasks? But
> right off the bat I see you are using Java's HttpURLConnection, you should
> really use SolrJ for querying purposes:
> https://cwiki.apache.org/confluence/display/solr/Using+SolrJ you won't
> need
> to deal with xml parsing and it uses Apache's HttpClient with much more
> reasonable defaults.
>
> -Steve
>




-- 
Regards,

Dino

Re: Running Lucene/SOR on Hadoop

Reply via email to