Apache Blur (Incubating) has several approaches (hive, spark, m/r)
that could probably help with this ranging from very experimental to
stable.  If you're interested, you can ask over on
blur-u...@incubator.apache.org ...

Thanks,
--tim

On Fri, Dec 25, 2015 at 4:28 AM, Dino Chopins <dino.chop...@gmail.com> wrote:
> Hi Erick,
>
> Thank you for your response and pointer. What I mean by running Lucene/SOLR
> on Hadoop is to have Lucene/SOLR index available to be queried using
> mapreduce or any best practice recommended.
>
> I need to have this mechanism to do large scale row deduplication. Let me
> elaborate why I need this:
>
>    1. I have two data sources with 35 and 40 million records of customer
>    profile - the data come from two systems (SAP and MS CRM)
>    2. Need to index and compare row by row of the two data sources using
>    name, address, birth date, phone and email field. For birth date and email
>    it will use exact comparison, but for the other fields will use
>    probabilistic comparison. Btw, the data has been normalized before they are
>    being indexed.
>    3. Each finding will be categorized under same person, and will be
>    deduplicated automatically or under user intervention depending on the
>    score.
>
> I usually use it using Lucene index on local filesystem and use term
> vector, but since this will be repeated task and then challenged by
> management to do this on top of Hadoop cluster I need to have a framework
> or best practice to do this.
>
> I understand that to have Lucene index on HDFS is not very appropriate
> since HDFS is designed for large block operation. With that understanding,
> I use SOLR and hope to query it using http call from mapreduce job.  The
> snippet code is below.
>
>             url = new URL(SOLR-Query-URL);
>
>             HttpURLConnection connection = (HttpURLConnection)
> url.openConnection();
>             connection.setRequestMethod("GET");
>
> The later method turns out to perform very bad. The simple mapreduce job
> that only read the data sources and write to hdfs takes 15 minutes, but
> once I do the http request it takes three hours now and still ongoing.
>
> What went wrong? And what will be solution to my problem?
>
> Thanks,
>
> Dino
>
> On Mon, Dec 14, 2015 at 12:30 AM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> First, what do you mean "run Lucene/Solr on Hadoop"?
>>
>> You can use the HdfsDirectoryFactory to store Solr/Lucene
>> indexes on Hadoop, at that point the actual filesystem
>> that holds the index is transparent to the end user, you just
>> use Solr as you would if it was using indexes on the local
>> file system. See:
>> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
>>
>> If you want to use Map-Reduce to _build_ indexes, see the
>> MapReduceIndexerTool in the Solr contrib area.
>>
>> Best,
>> Erick
>>
>
>
>
>
> --
> Regards,
>
> Dino

Reply via email to