Re: Solr 6 Distributed Join

Akiel Ahmed Thu, 24 Dec 2015 00:28:01 -0800

Hi

Did you get a chance to check whether one-to-many joins were covered in 
your tests? If yes, can you make any suggestions for what I could be doing 
wrong?


Cheers

Akiel



From:   Joel Bernstein <joels...@gmail.com>
To:     solr-user@lucene.apache.org
Date:   22/12/2015 13:03
Subject:        Re: Solr 6 Distributed Join



Just did a quick review of the InnerJoinStream and it appears that it
should handle one-to-one, one-to-many, many-to-one and many-to-many joins.
It will take a closer review of the tests to see if all these cases are
covered. So the innerJoin is designed to handle the case you describe. If
it doesn't work properly it makes sense to file a bug report.

Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Dec 22, 2015 at 5:55 AM, Akiel Ahmed <ahmed...@uk.ibm.com> wrote:

> Hi,
>
> I tried a straight forward join against something that is connected to
> many things but didn't get the results I expected - I wanted to check
> whether my expectations are off, and whether I can do anything in Solr 
to
> do what I want. So given the data:
>
> id,type,e1,e2,text
> 1,ABC,,,John Smith
> 2,ABC,,,Jane Doe
> 3,DEF,1,2,1
> 4,DEF,1,2,2
> 5,DEF,1,2,4
> 6,DEF,1,2,8
>
> and the query
>
>
> 
http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted

> , fl="id", q=text:John, sort="id
> asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted,
> fl="id,e1", q=type:DEF, sort="id
> asc",zkHost="localhost:9983",qt="/export"), on="id=e1")
>
> I expected
>
> {"result-set":{"docs":[
> {"e1":"1","id":"3"},
> {"e1":"1","id":"4"},
> {"e1":"1","id":"5"},
> {"e1":"1","id":"6"},
> {"EOF":true,"RESPONSE_TIME":56}]}}
>
> but instead I got
>
> {"result-set":{"docs":[
> {"e1":"1","id":"3"},
> {"EOF":true,"RESPONSE_TIME":58}]}}
>
> Deleting the document with id 3, and rerunning the query (see above)
> returned
>
> {"result-set":{"docs":[
> {"e1":"1","id":"4"},
> {"EOF":true,"RESPONSE_TIME":56}]}}
>
> So it looks like the join finds the first thing to join on. Is this
> expected behaviour? If so, is there anyway I can do to convince Solr to
> return all the things it is connected to?
>
> Cheers
>
> Akiel
> ----- Forwarded by Akiel Ahmed/UK/IBM on 22/12/2015 10:47 -----
>
> From:   Akiel Ahmed/UK/IBM
> To:     solr-user@lucene.apache.org
> Date:   21/12/2015 11:16
> Subject:        Re: Solr 6 Distributed Join
>
>
> Thank you for the help.
>
> I am working through what I want to do with the join - will let you know
> if I hit any issues.
>
>
>
> From:   Joel Bernstein <joels...@gmail.com>
> To:     solr-user@lucene.apache.org
> Date:   17/12/2015 15:40
> Subject:        Re: Solr 6 Distributed Join
>
>
>
> One thing to note about the hashJoin is that it requires the search
> results
> from the hashed query to fit entirely in memory.
>
> The innerJoin does not have this requirement as it performs a streaming
> merge join.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Dec 17, 2015 at 10:33 AM, Joel Bernstein <joels...@gmail.com>
> wrote:
>
> > Below is an example of nested joins where the innerJoin is done in
> > parallel using the parallel function. The partitionKeys parameter 
needs
> to
> > be added to the searches when the parallel function is used to 
partition
> > the results across worker nodes.
> >
> > hashJoin(
> >                 parallel(workerCollection,
> >                             innerJoin(
> >                                             search(users, q="*:*",
> > fl="userId, full_name, hometown", sort="userId asc", 
zkHost="zk2:2345",
> > qt="/export" partitionKeys="userId"),
> >                                             search(reviews, q="*:*",
> > fl="userId, review, score", sort="userId asc", zkHost="zk1:2345",
> > qt="/export" partitionKeys="userId"),
> >                                             on="userId"
> >                                             ),
> >                              workers="20",
> >                              zkHost="zk1:2345",
> >                              sort="userId asc"
> >                              ),
> >                hashed=search(restaurants, q="city:nyc",
> fl="restaurantId, restaurantName",
> > sort="restaurantId asc", zkHost="zk1:2345", qt="/export"),
> >                on="restaurantId"
> > )
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Dec 17, 2015 at 10:29 AM, Joel Bernstein <joels...@gmail.com>
> > wrote:
> >
> >> The innerJoin joins two streams sorted by the same join keys (merge
> >> join). If third stream has the same join keys you can nest 
innerJoins.
> But
> >> all three tables need to be sorted by the same join keys to nest
> innerJoins
> >> (merge joins).
> >>
> >> innerJoin(innerJoin(...),
> >>                 search(...),
> >>                 on...)
> >>
> >> If the third stream is joined on a different key you can nest inside 
a
> >> hashJoin which doesn't require streams to be sorted on the join key.
> For
> >> example:
> >>
> >> hashJoin(innerJoin(...),
> >>                 hashed=search(...),
> >>                 on..)
> >>
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/
> >>
> >> On Thu, Dec 17, 2015 at 9:28 AM, Akiel Ahmed <ahmed...@uk.ibm.com>
> wrote:
> >>
> >>> Hi again,
> >>>
> >>> I got the join to work. A team mate pointed out that one of the 
search
> >>> functions in the innerJoin query was missing a field in the join -
> adding
> >>> the e1 field to the fl parameter of the second search function gave
> the
> >>> result I expected:
> >>>
> >>>
> >>>
>
> 
http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted

>
> >>>
> >>> , fl="id", q=text:John, sort="id
> >>> asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted,
> >>> fl="id,e1", q=text:Friends, sort="id
> >>> asc",zkHost="localhost:9983",qt="/export"), on="id=e1")
> >>>
> >>> I am still interested in whether we can specify a join, using an
> >>> arbitrary
> >>> number of searches.
> >>>
> >>> Cheers
> >>>
> >>> Akiel
> >>>
> >>>
> >>>
> >>> From:   Akiel Ahmed/UK/IBM@IBMGB
> >>> To:     solr-user@lucene.apache.org
> >>> Date:   16/12/2015 17:05
> >>> Subject:        Re: Solr 6 Distributed Join
> >>>
> >>>
> >>>
> >>> Hi Dennis,
> >>>
> >>> Thank you for your help. I used your explanation to construct an
> >>> innerJoin
> >>>
> >>> query; I think I am getting further but didn't get the results I
> >>> expected.
> >>>
> >>> The following describes what I did – is there any chance you can 
tell
> >>> where I am going wrong:
> >>>
> >>> Solr 6 Developer Builds: #2738 and #2743
> >>>
> >>> 1. Modified server/solr/configsets/basic_configs/conf/managed-schema
> so
> >>> it
> >>>
> >>> reads:
> >>>
> >>> <?xml version="1.0" encoding="UTF-8" ?>
> >>> <schema name="search" version="1.5">
> >>>   <uniqueKey>id</uniqueKey>
> >>>   <field name="id" type="id" indexed="true" stored="true"
> required="true"
> >>> multiValued="false" docValues="true"/>
> >>>   <field name="_version_" type="solr_version" indexed="true"
> >>> stored="true"
> >>>
> >>> required="false" multiValued="false" docValues="true"/>
> >>>   <field name="type" type="id" indexed="true" stored="true"
> >>> required="false" multiValued="false" docValues="true"/>
> >>>   <field name="e1" type="id" indexed="true" stored="true"
> >>> required="false"
> >>>
> >>> multiValued="false" docValues="true"/>
> >>>   <field name="e2" type="id" indexed="true" stored="true"
> >>> required="false"
> >>>
> >>> multiValued="false" docValues="true"/>
> >>>   <field name="text" type="free_text" indexed="true" stored="true"
> >>> required="false" multiValued="false"/>
> >>>   <fieldType name="id" class="solr.StrField" 
sortMissingLast="true"/>
> >>>   <fieldType name="solr_version" class="solr.TrieLongField"
> >>> precisionStep="0" positionIncrementGap="0"/>
> >>>   <fieldType name="free_text" class="solr.TextField"
> >>> positionIncrementGap="100">
> >>>     <analyzer>
> >>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>       <filter class="solr.LowerCaseFilterFactory"/>
> >>>       <filter class="solr.WordDelimiterFilterFactory"
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
> >>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>> words="lang/stopwords_en.txt"/>
> >>>     </analyzer>
> >>>   </fieldType>
> >>> </schema>
> >>>
> >>> 2. Modified 
server/solr/configsets/basic_configs/conf/solrconfig.xml,
> >>> adding the following near the bottom of the file so it is the last
> >>> request
> >>>
> >>> handler
> >>>
> >>>   <requestHandler name="/stream" class="solr.StreamHandler">
> >>>         <lst name="invariants">
> >>>                 <str name="wt">json</str>
> >>>                 <str name="distrib">false</str>
> >>>         </lst>
> >>>   </requestHandler>
> >>>
> >>> 3. Used solr -e cloud to setup a solr cloud instance, picking all 
the
> >>> defaults except I chose basic_configs
> >>>
> >>> 4. After solr is running I ingested the following data via the Solr
> Web
> >>> UI
> >>>
> >>> (/update handler, Document Type = CSV)
> >>> id,type,e1,e2,text
> >>> 1,ABC,,,John Smith
> >>> 2,ABC,,,Jane Smith
> >>> 3,ABC,,,MiKe Smith
> >>> 4,ABC,,,John Doe
> >>> 5,ABC,,,Jane Doe
> >>> 6,ABC,,,MiKe Doe
> >>> 7,ABC,,,John Smith
> >>> 8,DEF,,,Chicken Burger
> >>> 9,DEF,,,Veggie Burger
> >>> 10,DEF,,,Beef Burger
> >>> 11,DEF,,,Chicken Donar
> >>> 12,DEF,,,Chips
> >>> 13,DEF,,,Drink
> >>> 20,GHI,1,2,Friends
> >>> 21,GHI,3,4,Friends
> >>> 22,GHI,5,6,Friends
> >>> 23,GHI,7,6,Friends
> >>> 24,GHI,6,4,Friends
> >>> 25,JKL,1,8,Order
> >>> 26,JKL,2,9,Order
> >>> 27,JKL,3,10,Order
> >>> 28,JKL,4,11,Order
> >>> 29,JKL,5,12,Order
> >>> 30,JKL,6,13,Order
> >>>
> >>> 5. Navigating to the following URL in a browser returned an expected
> >>> result:
> >>> http://localhost:8983/solr/gettingstarted/select?q={!join from=id
> >>> to=e1}text:John&fl="id"
> >>>
> >>> <response>
> >>> ...
> >>>   <result>
> >>>     <doc>
> >>>       <str name="id">20</str>
> >>>       <str name="e1">1</str>
> >>>       <str name="e2">2</str>
> >>>       ...
> >>>     </doc>
> >>>     <doc>
> >>>       <str name="id">28</str>
> >>>       <str name="e1">4</str>
> >>>       <str name="e2">11</str>
> >>>       ...
> >>>     </doc>
> >>>     <doc>
> >>>       <str name="id">23</str>
> >>>       <str name="e1">7</str>
> >>>       <str name="e2">6</str>
> >>>       ...
> >>>     </doc>
> >>>   </result>
> >>> </response>
> >>>
> >>> 6. Navigating to the following URL in a browser does NOT return what 
I
> >>> expected:
> >>>
> >>>
>
> 
http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted

>
> >>>
> >>> , fl="id", q=text:John, sort="id
> >>> asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted,
> >>> fl="id", q=text:Friends, sort="id
> >>> asc",zkHost="localhost:9983",qt="/export"), on="id=e1")
> >>>
> >>> {"result-set":{"docs":[
> >>> {"EOF":true,"RESPONSE_TIME":124}]}}
> >>>
> >>>
> >>> I also have a join related question. Is there any chance I can 
specify
> a
> >>> query and join for more than 2 things. For example:
> >>>
> >>> innerJoin(search(gettingstarted, fl="id", q=text:John, ...) as s1,
> >>>           search(gettingstarted, fl="id", q=text:Chicken, ...) as s2
> >>>           search(gettingstarted, fl="id", q=text:Friends, ...) as 
s3)
> >>>           on="s1.id=s3.e1",
> >>>           on="s2.id=s3.e2")
> >>>
> >>> Sorry if the query does not make sense, but given the data above my
> >>> intention is to find a single result made up of 3 documents:
> >>> s1.id=1,s2.id=8,s3.id=25
> >>> Is that possible? If yes, will Solr 6 support an arbitrary number of
> >>> queries and associated joins?
> >>>
> >>> Cheers
> >>>
> >>> Akiel
> >>>
> >>>
> >>>
> >>> From:   Dennis Gove <dpg...@gmail.com>
> >>> To:     Akiel Ahmed/UK/IBM@IBMGB, solr-user@lucene.apache.org
> >>> Date:   11/12/2015 15:34
> >>> Subject:        Re: Solr 6 Distributed Join
> >>>
> >>>
> >>>
> >>> Akiel,
> >>>
> >>> Without seeing your full url I assume that you're missing the
> >>> stream=innerJoin(.....) part of it. A full sample url would look 
like
> >>> this
> >>>
> 
http://localhost:8983/solr/careers/stream?stream=innerJoin(search(careers
> >>> ,
> >>> fl="personId,companyId,title", q=companyId:*, sort="companyId
> >>> asc",zkHost="localhost:2181",qt="/export"),search(companies,
> >>> fl="id,companyName", q=*:*, sort="id
> >>> asc",zkHost="localhost:2181",qt="/export"),on="companyId=id")
> >>>
> >>> This example will return a join of career records with the company
> name
> >>> for
> >>> all career records with a non-null companyId.
> >>>
> >>> And the pieces have the following meaning:
> >>> http://localhost:8983/solr/careers/stream?  - you have a collection
> >>> called
> >>> careers available on localhost:8983 and you're hitting its stream
> handler
> >>> ?stream=  - you are passing the stream parameter to the stream 
handler
> >>> zkHost="localhost:2181"  - there is a zk instance running on
> >>> localhost:2181
> >>> where solr can get clusterstate information. Note, that since you're
> >>> sending the request to the careers collection this param is not
> required
> >>> in
> >>> the search(careers....) part but is required in the
> search(companies....)
> >>> part. For simplicity I usually just provide it for all.
> >>> qt="/export"  - tells solr to use the export handler. this assumes 
all
> >>> your
> >>> fields are in docValues. if you'd rather not use the export handler
> then
> >>> you probably want to provide the rows=##### param to tell solr to
> return
> >>> a
> >>> large # of rows for each underlying search. Without it solr will
> default
> >>> to, I believe, 10 rows.
> >>>
> >>> CCing the user list so others can see this as well.
> >>>
> >>> We're working on additional documentation for Streaming Aggregation
> and
> >>> Expressions. The page can be found at
> >>> 
https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
> >>> but
> >>> it's missing a lot of things we've added recently.
> >>>
> >>> - Dennis
> >>>
> >>> On Fri, Dec 11, 2015 at 9:51 AM, Akiel Ahmed <ahmed...@uk.ibm.com>
> >>> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> > Sorry, this is out of the blue - I have joined the Solr mailing
> list,
> >>> but
> >>> > I don't know if that it is the correct place to ask my question. 
If
> you
> >>> are
> >>> > not the best person to talk to can you please point me in the 
right
> >>> > direction.
> >>> >
> >>> > I want to try using the Solr 6 distributed joins but cant find
> enough
> >>> > material on the web to make it work. I have added the stream 
handler
> to
> >>> my
> >>> > solrconfig.xml (see below) and when issuing an inner join query 
(see
> >>> below)
> >>> > I get a an error - the localparm named stream is missing so I get 
a
> >>> > NullPointerException. Is there a way to play with the join via the
> Solr
> >>> web
> >>> > UI, or if not do you have a code snippet via a SolrJ client that
> >>> performs a
> >>> > join?
> >>> >
> >>> > solrconfig.xml
> >>> >
> >>> > <requestHandler name="/stream" class="solr.StreamHandler">
> >>> >         <lst name="invariants">
> >>> >                 <str name="wt">json</str>
> >>> >                 <str name="distrib">false</str>
> >>> >         </lst>
> >>> > </requestHandler>
> >>> >
> >>> > query
> >>> > innerJoin(
> >>> >         search(getting_started, _search_field:john),
> >>> >         search(getting_started, _search_field:friends),
> >>> >         on="id=_link_from_id")
> >>> >
> >>> > Cheers
> >>> >
> >>> > Akiel
> >>> > Unless stated otherwise above:
> >>> > IBM United Kingdom Limited - Registered in England and Wales with
> >>> number
> >>> > 741598.
> >>> > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire
> PO6
> >>> 3AU
> >>> >
> >>>
> >>>
> >>> Unless stated otherwise above:
> >>> IBM United Kingdom Limited - Registered in England and Wales with
> number
> >>> 741598.
> >>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire 
PO6
> >>> 3AU
> >>>
> >>>
> >>>
> >>> Unless stated otherwise above:
> >>> IBM United Kingdom Limited - Registered in England and Wales with
> number
> >>> 741598.
> >>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire 
PO6
> >>> 3AU
> >>>
> >>>
> >>
> >
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 
3AU
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 
3AU
>
>


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Re: Solr 6 Distributed Join

Reply via email to