Re: Running Lucene/SOR on Hadoop

2016-01-04 Thread Tim Williams
Apache Blur (Incubating) has several approaches (hive, spark, m/r)
that could probably help with this ranging from very experimental to
stable.  If you're interested, you can ask over on
blur-u...@incubator.apache.org ...

Thanks,
--tim

On Fri, Dec 25, 2015 at 4:28 AM, Dino Chopins  wrote:
> Hi Erick,
>
> Thank you for your response and pointer. What I mean by running Lucene/SOLR
> on Hadoop is to have Lucene/SOLR index available to be queried using
> mapreduce or any best practice recommended.
>
> I need to have this mechanism to do large scale row deduplication. Let me
> elaborate why I need this:
>
>1. I have two data sources with 35 and 40 million records of customer
>profile - the data come from two systems (SAP and MS CRM)
>2. Need to index and compare row by row of the two data sources using
>name, address, birth date, phone and email field. For birth date and email
>it will use exact comparison, but for the other fields will use
>probabilistic comparison. Btw, the data has been normalized before they are
>being indexed.
>3. Each finding will be categorized under same person, and will be
>deduplicated automatically or under user intervention depending on the
>score.
>
> I usually use it using Lucene index on local filesystem and use term
> vector, but since this will be repeated task and then challenged by
> management to do this on top of Hadoop cluster I need to have a framework
> or best practice to do this.
>
> I understand that to have Lucene index on HDFS is not very appropriate
> since HDFS is designed for large block operation. With that understanding,
> I use SOLR and hope to query it using http call from mapreduce job.  The
> snippet code is below.
>
> url = new URL(SOLR-Query-URL);
>
> HttpURLConnection connection = (HttpURLConnection)
> url.openConnection();
> connection.setRequestMethod("GET");
>
> The later method turns out to perform very bad. The simple mapreduce job
> that only read the data sources and write to hdfs takes 15 minutes, but
> once I do the http request it takes three hours now and still ongoing.
>
> What went wrong? And what will be solution to my problem?
>
> Thanks,
>
> Dino
>
> On Mon, Dec 14, 2015 at 12:30 AM, Erick Erickson 
> wrote:
>
>> First, what do you mean "run Lucene/Solr on Hadoop"?
>>
>> You can use the HdfsDirectoryFactory to store Solr/Lucene
>> indexes on Hadoop, at that point the actual filesystem
>> that holds the index is transparent to the end user, you just
>> use Solr as you would if it was using indexes on the local
>> file system. See:
>> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
>>
>> If you want to use Map-Reduce to _build_ indexes, see the
>> MapReduceIndexerTool in the Solr contrib area.
>>
>> Best,
>> Erick
>>
>
>
>
>
> --
> Regards,
>
> Dino


child doc filter

2016-11-03 Thread Tim Williams
I'm using the BlockJoinQuery to query child docs and return the
parent.  I'd like to have the equivalent of a filter that applies to
child docs and I don't see a way to do that with the BlockJoin stuffs.
It looks like I could modify it to accept some childFilter param and
add a QueryWrapperFilter right after the child query is created[1] but
before I did that, I wanted to see if there's a built-in way to
achieve the same behavior?

Thanks,
--tim

[1] - 
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/search/join/BlockJoinParentQParser.java#L69


Re: REST calls

2010-06-30 Thread Tim Williams
On Wed, Jun 30, 2010 at 12:39 AM, Don Werve  wrote:
> 2010/6/27 Jason Chaffee 
>
>> The solr docs say it is RESTful, yet it seems that it doesn't use http
>> headers in a RESTful way.  For example, it doesn't seem to use the Accept:
>> request header to determine the media-type to be returned.  Instead, it
>> requires a query parameter to be used in the URL.  Also, it doesn't seem to
>> use return 304 Not Modified if the request header "if-modified-since" is
>> used.
>>
>
> The summary:
>
> Solr is restful, and does a very good job of it.

I'm not so sure...

> The long version:
>
> There is no official 'REST' standard that dictates the behavior of the
> implementation; rather, REST is a set of guidelines on building APIs that
> are both discoverable and easily usable without having to resort to
> third-party libraries.
>
> Generally speaking, an application is RESTful if it provides an API that
> accepts arguments passed as HTTP form variables, returns results in an open
> format (XML, JSON, YAML, etc.), and respects certain semantics relating to
> HTTP verbs; e.g., GET/HEAD return the resource without modification, DELETEs
> are destructive, POST creates a resource, PUT alters it.
>
> Solr meets all of these requirements.

With fairly limited knowledge of Solr (I'm a lucene user), I'd like to
offer an alternate view.

- Solr seems to violate the hypermedia-driven constraint.  (e.g. it
seems not to be hypertext driven at all) [1]

- Solr seems to violate the uniform interface constraint and the
identification of resources constraint.(E.g. by having "commands" in
the entity body instead of exposing resources with state that is
manipulated through the standard methods and, I gather, overloading
methods instead of using standard ones (e.g. deletes).

I'd conclude Solr is not RESTful.

The representation argument is a bit of a red-herring, btw.  Not using
Accept for conneg isn't the problem, using agent-driven negotiation
without being hypertext driven is [from a REST pov].

--tim

[1] - http://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven


Re: REST calls

2010-06-30 Thread Tim Williams
On Wed, Jun 30, 2010 at 9:17 AM, Jak Akdemir  wrote:
> On Wed, Jun 30, 2010 at 7:39 AM, Don Werve  wrote:
>
>> 2010/6/27 Jason Chaffee 
>>
>> > The solr docs say it is RESTful, yet it seems that it doesn't use http
>> > headers in a RESTful way.  For example, it doesn't seem to use the
>> Accept:
>> > request header to determine the media-type to be returned.  Instead, it
>> > requires a query parameter to be used in the URL.  Also, it doesn't seem
>> to
>> > use return 304 Not Modified if the request header "if-modified-since" is
>> > used.
>> >
>>
>> The summary:
>>
>> Solr is restful, and does a very good job of it.
>>
>> The long version:
>>
>> There is no official 'REST' standard that dictates the behavior of the
>> implementation; rather, REST is a set of guidelines on building APIs that
>> are both discoverable and easily usable without having to resort to
>> third-party libraries.
>>
>> Generally speaking, an application is RESTful if it provides an API that
>> accepts arguments passed as HTTP form variables, returns results in an open
>> format (XML, JSON, YAML, etc.), and respects certain semantics relating to
>> HTTP verbs; e.g., GET/HEAD return the resource without modification,
>> DELETEs
>> are destructive, POST creates a resource, PUT alters it.
>>
>>
> Actually it is not a constraint to use all of four *GET*, *PUT*, *POST*, *
> DELETE.*
> To define RESTful, using Get and Post requests are enough as Roy Fielding
> offered.
> http://roy.gbiv.com/untangled/2009/it-is-okay-to-use-post

In Roy's post, I'd point out: "POST only becomes an issue when it is
used in a situation for which some other method is ideally suited"
(e.g. DELETE to delete).

Also, GET and POST *could* be enough if and only if you took care to
design your resources properly[1].

--tim

[1] - http://www.amundsen.com/blog/archives/1063


Re: job ads ok?

2009-05-28 Thread Tim Williams
You might send it to j...@apache.org too.

--tim

On Thu, May 28, 2009 at 1:44 PM, Jodi Showers  wrote:
> I'll take that as a positive.
>
> Homestars is looking for a contract developer to aid us with our present
> solr (localsolr) install.
>
> The problems relate to boosting the resulting sort order. In addition to
> this specific challenge, there are features we'd like to add to the current
> localsolr api.
>
> If you think you could help us please email me : jodi at homestars.com
>
> thanks.
> Jodi
>
> On 28-May-09, at 1:24 PM, Bill Fowler wrote:
>
>> No, that would be a violation of netiquette.  Please just send them
>> directly
>> to me.
>>
>> On Thu, May 28, 2009 at 10:06 AM, Jodi Showers  wrote:
>>
>>> Greetings,
>>>
>>> Is is ok to post solr job ads here?
>>>
>>> thanks.
>>> Jodi
>>>
>
>