RE: Solr 6 Distributed Join

Akiel Ahmed Tue, 22 Dec 2015 02:57:01 -0800

Hi,

I tried a straight forward join against something that is connected to 
many things but didn't get the results I expected - I wanted to check 
whether my expectations are off, and whether I can do anything in Solr to 
do what I want. So given the data:


id,type,e1,e2,text
1,ABC,,,John Smith
2,ABC,,,Jane Doe
3,DEF,1,2,1
4,DEF,1,2,2
5,DEF,1,2,4
6,DEF,1,2,8

and the query

http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted
, fl="id", q=text:John, sort="id 
asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted, 
fl="id,e1", q=type:DEF, sort="id 
asc",zkHost="localhost:9983",qt="/export"), on="id=e1")

I expected

{"result-set":{"docs":[
{"e1":"1","id":"3"},
{"e1":"1","id":"4"},
{"e1":"1","id":"5"},
{"e1":"1","id":"6"},
{"EOF":true,"RESPONSE_TIME":56}]}}

but instead I got 

{"result-set":{"docs":[
{"e1":"1","id":"3"},
{"EOF":true,"RESPONSE_TIME":58}]}}

Deleting the document with id 3, and rerunning the query (see above) 
returned 

{"result-set":{"docs":[
{"e1":"1","id":"4"},
{"EOF":true,"RESPONSE_TIME":56}]}}

So it looks like the join finds the first thing to join on. Is this 
expected behaviour? If so, is there anyway I can do to convince Solr to 
return all the things it is connected to?

Cheers

Akiel
----- Forwarded by Akiel Ahmed/UK/IBM on 22/12/2015 10:47 -----

From:   Akiel Ahmed/UK/IBM
To:     [email protected]
Date:   21/12/2015 11:16
Subject:        Re: Solr 6 Distributed Join


Thank you for the help. 

I am working through what I want to do with the join - will let you know 
if I hit any issues.



From:   Joel Bernstein <[email protected]>
To:     [email protected]
Date:   17/12/2015 15:40
Subject:        Re: Solr 6 Distributed Join



One thing to note about the hashJoin is that it requires the search 
results
from the hashed query to fit entirely in memory.

The innerJoin does not have this requirement as it performs a streaming
merge join.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Dec 17, 2015 at 10:33 AM, Joel Bernstein <[email protected]> 
wrote:

> Below is an example of nested joins where the innerJoin is done in
> parallel using the parallel function. The partitionKeys parameter needs 
to
> be added to the searches when the parallel function is used to partition
> the results across worker nodes.
>
> hashJoin(
>                 parallel(workerCollection,
>                             innerJoin(
>                                             search(users, q="*:*",
> fl="userId, full_name, hometown", sort="userId asc", zkHost="zk2:2345",
> qt="/export" partitionKeys="userId"),
>                                             search(reviews, q="*:*",
> fl="userId, review, score", sort="userId asc", zkHost="zk1:2345",
> qt="/export" partitionKeys="userId"),
>                                             on="userId"
>                                             ),
>                              workers="20",
>                              zkHost="zk1:2345",
>                              sort="userId asc"
>                              ),
>                hashed=search(restaurants, q="city:nyc", 
fl="restaurantId, restaurantName",
> sort="restaurantId asc", zkHost="zk1:2345", qt="/export"),
>                on="restaurantId"
> )
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Dec 17, 2015 at 10:29 AM, Joel Bernstein <[email protected]>
> wrote:
>
>> The innerJoin joins two streams sorted by the same join keys (merge
>> join). If third stream has the same join keys you can nest innerJoins. 
But
>> all three tables need to be sorted by the same join keys to nest 
innerJoins
>> (merge joins).
>>
>> innerJoin(innerJoin(...),
>>                 search(...),
>>                 on...)
>>
>> If the third stream is joined on a different key you can nest inside a
>> hashJoin which doesn't require streams to be sorted on the join key. 
For
>> example:
>>
>> hashJoin(innerJoin(...),
>>                 hashed=search(...),
>>                 on..)
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Thu, Dec 17, 2015 at 9:28 AM, Akiel Ahmed <[email protected]> 
wrote:
>>
>>> Hi again,
>>>
>>> I got the join to work. A team mate pointed out that one of the search
>>> functions in the innerJoin query was missing a field in the join - 
adding
>>> the e1 field to the fl parameter of the second search function gave 
the
>>> result I expected:
>>>
>>>
>>> 
http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted

>>>
>>> , fl="id", q=text:John, sort="id
>>> asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted,
>>> fl="id,e1", q=text:Friends, sort="id
>>> asc",zkHost="localhost:9983",qt="/export"), on="id=e1")
>>>
>>> I am still interested in whether we can specify a join, using an
>>> arbitrary
>>> number of searches.
>>>
>>> Cheers
>>>
>>> Akiel
>>>
>>>
>>>
>>> From:   Akiel Ahmed/UK/IBM@IBMGB
>>> To:     [email protected]
>>> Date:   16/12/2015 17:05
>>> Subject:        Re: Solr 6 Distributed Join
>>>
>>>
>>>
>>> Hi Dennis,
>>>
>>> Thank you for your help. I used your explanation to construct an
>>> innerJoin
>>>
>>> query; I think I am getting further but didn't get the results I
>>> expected.
>>>
>>> The following describes what I did – is there any chance you can tell
>>> where I am going wrong:
>>>
>>> Solr 6 Developer Builds: #2738 and #2743
>>>
>>> 1. Modified server/solr/configsets/basic_configs/conf/managed-schema 
so
>>> it
>>>
>>> reads:
>>>
>>> <?xml version="1.0" encoding="UTF-8" ?>
>>> <schema name="search" version="1.5">
>>>   <uniqueKey>id</uniqueKey>
>>>   <field name="id" type="id" indexed="true" stored="true" 
required="true"
>>> multiValued="false" docValues="true"/>
>>>   <field name="_version_" type="solr_version" indexed="true"
>>> stored="true"
>>>
>>> required="false" multiValued="false" docValues="true"/>
>>>   <field name="type" type="id" indexed="true" stored="true"
>>> required="false" multiValued="false" docValues="true"/>
>>>   <field name="e1" type="id" indexed="true" stored="true"
>>> required="false"
>>>
>>> multiValued="false" docValues="true"/>
>>>   <field name="e2" type="id" indexed="true" stored="true"
>>> required="false"
>>>
>>> multiValued="false" docValues="true"/>
>>>   <field name="text" type="free_text" indexed="true" stored="true"
>>> required="false" multiValued="false"/>
>>>   <fieldType name="id" class="solr.StrField" sortMissingLast="true"/>
>>>   <fieldType name="solr_version" class="solr.TrieLongField"
>>> precisionStep="0" positionIncrementGap="0"/>
>>>   <fieldType name="free_text" class="solr.TextField"
>>> positionIncrementGap="100">
>>>     <analyzer>
>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
>>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="lang/stopwords_en.txt"/>
>>>     </analyzer>
>>>   </fieldType>
>>> </schema>
>>>
>>> 2. Modified server/solr/configsets/basic_configs/conf/solrconfig.xml,
>>> adding the following near the bottom of the file so it is the last
>>> request
>>>
>>> handler
>>>
>>>   <requestHandler name="/stream" class="solr.StreamHandler">
>>>         <lst name="invariants">
>>>                 <str name="wt">json</str>
>>>                 <str name="distrib">false</str>
>>>         </lst>
>>>   </requestHandler>
>>>
>>> 3. Used solr -e cloud to setup a solr cloud instance, picking all the
>>> defaults except I chose basic_configs
>>>
>>> 4. After solr is running I ingested the following data via the Solr 
Web
>>> UI
>>>
>>> (/update handler, Document Type = CSV)
>>> id,type,e1,e2,text
>>> 1,ABC,,,John Smith
>>> 2,ABC,,,Jane Smith
>>> 3,ABC,,,MiKe Smith
>>> 4,ABC,,,John Doe
>>> 5,ABC,,,Jane Doe
>>> 6,ABC,,,MiKe Doe
>>> 7,ABC,,,John Smith
>>> 8,DEF,,,Chicken Burger
>>> 9,DEF,,,Veggie Burger
>>> 10,DEF,,,Beef Burger
>>> 11,DEF,,,Chicken Donar
>>> 12,DEF,,,Chips
>>> 13,DEF,,,Drink
>>> 20,GHI,1,2,Friends
>>> 21,GHI,3,4,Friends
>>> 22,GHI,5,6,Friends
>>> 23,GHI,7,6,Friends
>>> 24,GHI,6,4,Friends
>>> 25,JKL,1,8,Order
>>> 26,JKL,2,9,Order
>>> 27,JKL,3,10,Order
>>> 28,JKL,4,11,Order
>>> 29,JKL,5,12,Order
>>> 30,JKL,6,13,Order
>>>
>>> 5. Navigating to the following URL in a browser returned an expected
>>> result:
>>> http://localhost:8983/solr/gettingstarted/select?q={!join from=id
>>> to=e1}text:John&fl="id"
>>>
>>> <response>
>>> ...
>>>   <result>
>>>     <doc>
>>>       <str name="id">20</str>
>>>       <str name="e1">1</str>
>>>       <str name="e2">2</str>
>>>       ...
>>>     </doc>
>>>     <doc>
>>>       <str name="id">28</str>
>>>       <str name="e1">4</str>
>>>       <str name="e2">11</str>
>>>       ...
>>>     </doc>
>>>     <doc>
>>>       <str name="id">23</str>
>>>       <str name="e1">7</str>
>>>       <str name="e2">6</str>
>>>       ...
>>>     </doc>
>>>   </result>
>>> </response>
>>>
>>> 6. Navigating to the following URL in a browser does NOT return what I
>>> expected:
>>>
>>> 
http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted

>>>
>>> , fl="id", q=text:John, sort="id
>>> asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted,
>>> fl="id", q=text:Friends, sort="id
>>> asc",zkHost="localhost:9983",qt="/export"), on="id=e1")
>>>
>>> {"result-set":{"docs":[
>>> {"EOF":true,"RESPONSE_TIME":124}]}}
>>>
>>>
>>> I also have a join related question. Is there any chance I can specify 
a
>>> query and join for more than 2 things. For example:
>>>
>>> innerJoin(search(gettingstarted, fl="id", q=text:John, ...) as s1,
>>>           search(gettingstarted, fl="id", q=text:Chicken, ...) as s2
>>>           search(gettingstarted, fl="id", q=text:Friends, ...) as s3)
>>>           on="s1.id=s3.e1",
>>>           on="s2.id=s3.e2")
>>>
>>> Sorry if the query does not make sense, but given the data above my
>>> intention is to find a single result made up of 3 documents:
>>> s1.id=1,s2.id=8,s3.id=25
>>> Is that possible? If yes, will Solr 6 support an arbitrary number of
>>> queries and associated joins?
>>>
>>> Cheers
>>>
>>> Akiel
>>>
>>>
>>>
>>> From:   Dennis Gove <[email protected]>
>>> To:     Akiel Ahmed/UK/IBM@IBMGB, [email protected]
>>> Date:   11/12/2015 15:34
>>> Subject:        Re: Solr 6 Distributed Join
>>>
>>>
>>>
>>> Akiel,
>>>
>>> Without seeing your full url I assume that you're missing the
>>> stream=innerJoin(.....) part of it. A full sample url would look like
>>> this
>>> 
http://localhost:8983/solr/careers/stream?stream=innerJoin(search(careers
>>> ,
>>> fl="personId,companyId,title", q=companyId:*, sort="companyId
>>> asc",zkHost="localhost:2181",qt="/export"),search(companies,
>>> fl="id,companyName", q=*:*, sort="id
>>> asc",zkHost="localhost:2181",qt="/export"),on="companyId=id")
>>>
>>> This example will return a join of career records with the company 
name
>>> for
>>> all career records with a non-null companyId.
>>>
>>> And the pieces have the following meaning:
>>> http://localhost:8983/solr/careers/stream?  - you have a collection
>>> called
>>> careers available on localhost:8983 and you're hitting its stream 
handler
>>> ?stream=  - you are passing the stream parameter to the stream handler
>>> zkHost="localhost:2181"  - there is a zk instance running on
>>> localhost:2181
>>> where solr can get clusterstate information. Note, that since you're
>>> sending the request to the careers collection this param is not 
required
>>> in
>>> the search(careers....) part but is required in the 
search(companies....)
>>> part. For simplicity I usually just provide it for all.
>>> qt="/export"  - tells solr to use the export handler. this assumes all
>>> your
>>> fields are in docValues. if you'd rather not use the export handler 
then
>>> you probably want to provide the rows=##### param to tell solr to 
return
>>> a
>>> large # of rows for each underlying search. Without it solr will 
default
>>> to, I believe, 10 rows.
>>>
>>> CCing the user list so others can see this as well.
>>>
>>> We're working on additional documentation for Streaming Aggregation 
and
>>> Expressions. The page can be found at
>>> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
>>> but
>>> it's missing a lot of things we've added recently.
>>>
>>> - Dennis
>>>
>>> On Fri, Dec 11, 2015 at 9:51 AM, Akiel Ahmed <[email protected]>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > Sorry, this is out of the blue - I have joined the Solr mailing 
list,
>>> but
>>> > I don't know if that it is the correct place to ask my question. If 
you
>>> are
>>> > not the best person to talk to can you please point me in the right
>>> > direction.
>>> >
>>> > I want to try using the Solr 6 distributed joins but cant find 
enough
>>> > material on the web to make it work. I have added the stream handler 
to
>>> my
>>> > solrconfig.xml (see below) and when issuing an inner join query (see
>>> below)
>>> > I get a an error - the localparm named stream is missing so I get a
>>> > NullPointerException. Is there a way to play with the join via the 
Solr
>>> web
>>> > UI, or if not do you have a code snippet via a SolrJ client that
>>> performs a
>>> > join?
>>> >
>>> > solrconfig.xml
>>> >
>>> > <requestHandler name="/stream" class="solr.StreamHandler">
>>> >         <lst name="invariants">
>>> >                 <str name="wt">json</str>
>>> >                 <str name="distrib">false</str>
>>> >         </lst>
>>> > </requestHandler>
>>> >
>>> > query
>>> > innerJoin(
>>> >         search(getting_started, _search_field:john),
>>> >         search(getting_started, _search_field:friends),
>>> >         on="id=_link_from_id")
>>> >
>>> > Cheers
>>> >
>>> > Akiel
>>> > Unless stated otherwise above:
>>> > IBM United Kingdom Limited - Registered in England and Wales with
>>> number
>>> > 741598.
>>> > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire 
PO6
>>> 3AU
>>> >
>>>
>>>
>>> Unless stated otherwise above:
>>> IBM United Kingdom Limited - Registered in England and Wales with 
number
>>> 741598.
>>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
>>> 3AU
>>>
>>>
>>>
>>> Unless stated otherwise above:
>>> IBM United Kingdom Limited - Registered in England and Wales with 
number
>>> 741598.
>>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
>>> 3AU
>>>
>>>
>>
>


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

RE: Solr 6 Distributed Join

Reply via email to