Hi Erick,

Thank you for your response and pointer. What I mean by running Lucene/SOLR
on Hadoop is to have Lucene/SOLR index available to be queried using
mapreduce or any best practice recommended.

I need to have this mechanism to do large scale row deduplication. Let me
elaborate why I need this:

   1. I have two data sources with 35 and 40 million records of customer
   profile - the data come from two systems (SAP and MS CRM)
   2. Need to index and compare row by row of the two data sources using
   name, address, birth date, phone and email field. For birth date and email
   it will use exact comparison, but for the other fields will use
   probabilistic comparison. Btw, the data has been normalized before they are
   being indexed.
   3. Each finding will be categorized under same person, and will be
   deduplicated automatically or under user intervention depending on the
   score.

I usually use it using Lucene index on local filesystem and use term
vector, but since this will be repeated task and then challenged by
management to do this on top of Hadoop cluster I need to have a framework
or best practice to do this.

I understand that to have Lucene index on HDFS is not very appropriate
since HDFS is designed for large block operation. With that understanding,
I use SOLR and hope to query it using http call from mapreduce job.  The
snippet code is below.

            url = new URL(SOLR-Query-URL);

            HttpURLConnection connection = (HttpURLConnection)
url.openConnection();
            connection.setRequestMethod("GET");

The later method turns out to perform very bad. The simple mapreduce job
that only read the data sources and write to hdfs takes 15 minutes, but
once I do the http request it takes three hours now and still ongoing.

What went wrong? And what will be solution to my problem?

Thanks,

Dino

On Mon, Dec 14, 2015 at 12:30 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> First, what do you mean "run Lucene/Solr on Hadoop"?
>
> You can use the HdfsDirectoryFactory to store Solr/Lucene
> indexes on Hadoop, at that point the actual filesystem
> that holds the index is transparent to the end user, you just
> use Solr as you would if it was using indexes on the local
> file system. See:
> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
>
> If you want to use Map-Reduce to _build_ indexes, see the
> MapReduceIndexerTool in the Solr contrib area.
>
> Best,
> Erick
>




-- 
Regards,

Dino

Reply via email to