Hi Joel, Sorry there was an error between my chair and keyboard; there isn't a bug - the right hand stream was not ordered by the joined-on field. So, the following query does what I expected:
http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted ,fl="id",q=text:John,sort="id asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted,fl="id,e1",q=type:DEF,sort="e1 asc",zkHost="localhost:9983",qt="/export"), on="id=e1") Do you know if on the release of Solr 6, the stream handler will contain validation code which does a syntax check as well as checking if appropriate fields have been used in the fl and sort properties? For example, for the above query, I am joining the id field on the e1 field, so the id field needs to be in the fl and sort property of the left-hand stream, and e1 needs to be in the fl and sort property in the right-hand stream foe the join to work. Cheers Akiel From: Joel Bernstein <joels...@gmail.com> To: solr-user@lucene.apache.org Date: 24/12/2015 15:51 Subject: Re: Solr 6 Distributed Join I haven't had a chance to review. If you have a reproducible failure on a one-to-many join go ahead and create a jira ticket. Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Dec 24, 2015 at 3:25 AM, Akiel Ahmed <ahmed...@uk.ibm.com> wrote: > Hi > > Did you get a chance to check whether one-to-many joins were covered in > your tests? If yes, can you make any suggestions for what I could be doing > wrong? > > Cheers > > Akiel > > > > From: Joel Bernstein <joels...@gmail.com> > To: solr-user@lucene.apache.org > Date: 22/12/2015 13:03 > Subject: Re: Solr 6 Distributed Join > > > > Just did a quick review of the InnerJoinStream and it appears that it > should handle one-to-one, one-to-many, many-to-one and many-to-many joins. > It will take a closer review of the tests to see if all these cases are > covered. So the innerJoin is designed to handle the case you describe. If > it doesn't work properly it makes sense to file a bug report. > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Tue, Dec 22, 2015 at 5:55 AM, Akiel Ahmed <ahmed...@uk.ibm.com> wrote: > > > Hi, > > > > I tried a straight forward join against something that is connected to > > many things but didn't get the results I expected - I wanted to check > > whether my expectations are off, and whether I can do anything in Solr > to > > do what I want. So given the data: > > > > id,type,e1,e2,text > > 1,ABC,,,John Smith > > 2,ABC,,,Jane Doe > > 3,DEF,1,2,1 > > 4,DEF,1,2,2 > > 5,DEF,1,2,4 > > 6,DEF,1,2,8 > > > > and the query > > > > > > > > http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted > > > , fl="id", q=text:John, sort="id > > asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted, > > fl="id,e1", q=type:DEF, sort="id > > asc",zkHost="localhost:9983",qt="/export"), on="id=e1") > > > > I expected > > > > {"result-set":{"docs":[ > > {"e1":"1","id":"3"}, > > {"e1":"1","id":"4"}, > > {"e1":"1","id":"5"}, > > {"e1":"1","id":"6"}, > > {"EOF":true,"RESPONSE_TIME":56}]}} > > > > but instead I got > > > > {"result-set":{"docs":[ > > {"e1":"1","id":"3"}, > > {"EOF":true,"RESPONSE_TIME":58}]}} > > > > Deleting the document with id 3, and rerunning the query (see above) > > returned > > > > {"result-set":{"docs":[ > > {"e1":"1","id":"4"}, > > {"EOF":true,"RESPONSE_TIME":56}]}} > > > > So it looks like the join finds the first thing to join on. Is this > > expected behaviour? If so, is there anyway I can do to convince Solr to > > return all the things it is connected to? > > > > Cheers > > > > Akiel > > ----- Forwarded by Akiel Ahmed/UK/IBM on 22/12/2015 10:47 ----- > > > > From: Akiel Ahmed/UK/IBM > > To: solr-user@lucene.apache.org > > Date: 21/12/2015 11:16 > > Subject: Re: Solr 6 Distributed Join > > > > > > Thank you for the help. > > > > I am working through what I want to do with the join - will let you know > > if I hit any issues. > > > > > > > > From: Joel Bernstein <joels...@gmail.com> > > To: solr-user@lucene.apache.org > > Date: 17/12/2015 15:40 > > Subject: Re: Solr 6 Distributed Join > > > > > > > > One thing to note about the hashJoin is that it requires the search > > results > > from the hashed query to fit entirely in memory. > > > > The innerJoin does not have this requirement as it performs a streaming > > merge join. > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Thu, Dec 17, 2015 at 10:33 AM, Joel Bernstein <joels...@gmail.com> > > wrote: > > > > > Below is an example of nested joins where the innerJoin is done in > > > parallel using the parallel function. The partitionKeys parameter > needs > > to > > > be added to the searches when the parallel function is used to > partition > > > the results across worker nodes. > > > > > > hashJoin( > > > parallel(workerCollection, > > > innerJoin( > > > search(users, q="*:*", > > > fl="userId, full_name, hometown", sort="userId asc", > zkHost="zk2:2345", > > > qt="/export" partitionKeys="userId"), > > > search(reviews, q="*:*", > > > fl="userId, review, score", sort="userId asc", zkHost="zk1:2345", > > > qt="/export" partitionKeys="userId"), > > > on="userId" > > > ), > > > workers="20", > > > zkHost="zk1:2345", > > > sort="userId asc" > > > ), > > > hashed=search(restaurants, q="city:nyc", > > fl="restaurantId, restaurantName", > > > sort="restaurantId asc", zkHost="zk1:2345", qt="/export"), > > > on="restaurantId" > > > ) > > > > > > > > > Joel Bernstein > > > http://joelsolr.blogspot.com/ > > > > > > On Thu, Dec 17, 2015 at 10:29 AM, Joel Bernstein <joels...@gmail.com> > > > wrote: > > > > > >> The innerJoin joins two streams sorted by the same join keys (merge > > >> join). If third stream has the same join keys you can nest > innerJoins. > > But > > >> all three tables need to be sorted by the same join keys to nest > > innerJoins > > >> (merge joins). > > >> > > >> innerJoin(innerJoin(...), > > >> search(...), > > >> on...) > > >> > > >> If the third stream is joined on a different key you can nest inside > a > > >> hashJoin which doesn't require streams to be sorted on the join key. > > For > > >> example: > > >> > > >> hashJoin(innerJoin(...), > > >> hashed=search(...), > > >> on..) > > >> > > >> > > >> Joel Bernstein > > >> http://joelsolr.blogspot.com/ > > >> > > >> On Thu, Dec 17, 2015 at 9:28 AM, Akiel Ahmed <ahmed...@uk.ibm.com> > > wrote: > > >> > > >>> Hi again, > > >>> > > >>> I got the join to work. A team mate pointed out that one of the > search > > >>> functions in the innerJoin query was missing a field in the join - > > adding > > >>> the e1 field to the fl parameter of the second search function gave > > the > > >>> result I expected: > > >>> > > >>> > > >>> > > > > > > http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted > > > > > >>> > > >>> , fl="id", q=text:John, sort="id > > >>> asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted, > > >>> fl="id,e1", q=text:Friends, sort="id > > >>> asc",zkHost="localhost:9983",qt="/export"), on="id=e1") > > >>> > > >>> I am still interested in whether we can specify a join, using an > > >>> arbitrary > > >>> number of searches. > > >>> > > >>> Cheers > > >>> > > >>> Akiel > > >>> > > >>> > > >>> > > >>> From: Akiel Ahmed/UK/IBM@IBMGB > > >>> To: solr-user@lucene.apache.org > > >>> Date: 16/12/2015 17:05 > > >>> Subject: Re: Solr 6 Distributed Join > > >>> > > >>> > > >>> > > >>> Hi Dennis, > > >>> > > >>> Thank you for your help. I used your explanation to construct an > > >>> innerJoin > > >>> > > >>> query; I think I am getting further but didn't get the results I > > >>> expected. > > >>> > > >>> The following describes what I did – is there any chance you can > tell > > >>> where I am going wrong: > > >>> > > >>> Solr 6 Developer Builds: #2738 and #2743 > > >>> > > >>> 1. Modified server/solr/configsets/basic_configs/conf/managed-schema > > so > > >>> it > > >>> > > >>> reads: > > >>> > > >>> <?xml version="1.0" encoding="UTF-8" ?> > > >>> <schema name="search" version="1.5"> > > >>> <uniqueKey>id</uniqueKey> > > >>> <field name="id" type="id" indexed="true" stored="true" > > required="true" > > >>> multiValued="false" docValues="true"/> > > >>> <field name="_version_" type="solr_version" indexed="true" > > >>> stored="true" > > >>> > > >>> required="false" multiValued="false" docValues="true"/> > > >>> <field name="type" type="id" indexed="true" stored="true" > > >>> required="false" multiValued="false" docValues="true"/> > > >>> <field name="e1" type="id" indexed="true" stored="true" > > >>> required="false" > > >>> > > >>> multiValued="false" docValues="true"/> > > >>> <field name="e2" type="id" indexed="true" stored="true" > > >>> required="false" > > >>> > > >>> multiValued="false" docValues="true"/> > > >>> <field name="text" type="free_text" indexed="true" stored="true" > > >>> required="false" multiValued="false"/> > > >>> <fieldType name="id" class="solr.StrField" > sortMissingLast="true"/> > > >>> <fieldType name="solr_version" class="solr.TrieLongField" > > >>> precisionStep="0" positionIncrementGap="0"/> > > >>> <fieldType name="free_text" class="solr.TextField" > > >>> positionIncrementGap="100"> > > >>> <analyzer> > > >>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > >>> <filter class="solr.LowerCaseFilterFactory"/> > > >>> <filter class="solr.WordDelimiterFilterFactory" > > >>> generateWordParts="1" generateNumberParts="1" catenateWords="1" > > >>> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/> > > >>> <filter class="solr.StopFilterFactory" ignoreCase="true" > > >>> words="lang/stopwords_en.txt"/> > > >>> </analyzer> > > >>> </fieldType> > > >>> </schema> > > >>> > > >>> 2. Modified > server/solr/configsets/basic_configs/conf/solrconfig.xml, > > >>> adding the following near the bottom of the file so it is the last > > >>> request > > >>> > > >>> handler > > >>> > > >>> <requestHandler name="/stream" class="solr.StreamHandler"> > > >>> <lst name="invariants"> > > >>> <str name="wt">json</str> > > >>> <str name="distrib">false</str> > > >>> </lst> > > >>> </requestHandler> > > >>> > > >>> 3. Used solr -e cloud to setup a solr cloud instance, picking all > the > > >>> defaults except I chose basic_configs > > >>> > > >>> 4. After solr is running I ingested the following data via the Solr > > Web > > >>> UI > > >>> > > >>> (/update handler, Document Type = CSV) > > >>> id,type,e1,e2,text > > >>> 1,ABC,,,John Smith > > >>> 2,ABC,,,Jane Smith > > >>> 3,ABC,,,MiKe Smith > > >>> 4,ABC,,,John Doe > > >>> 5,ABC,,,Jane Doe > > >>> 6,ABC,,,MiKe Doe > > >>> 7,ABC,,,John Smith > > >>> 8,DEF,,,Chicken Burger > > >>> 9,DEF,,,Veggie Burger > > >>> 10,DEF,,,Beef Burger > > >>> 11,DEF,,,Chicken Donar > > >>> 12,DEF,,,Chips > > >>> 13,DEF,,,Drink > > >>> 20,GHI,1,2,Friends > > >>> 21,GHI,3,4,Friends > > >>> 22,GHI,5,6,Friends > > >>> 23,GHI,7,6,Friends > > >>> 24,GHI,6,4,Friends > > >>> 25,JKL,1,8,Order > > >>> 26,JKL,2,9,Order > > >>> 27,JKL,3,10,Order > > >>> 28,JKL,4,11,Order > > >>> 29,JKL,5,12,Order > > >>> 30,JKL,6,13,Order > > >>> > > >>> 5. Navigating to the following URL in a browser returned an expected > > >>> result: > > >>> http://localhost:8983/solr/gettingstarted/select?q={!join from=id > > >>> to=e1}text:John&fl="id" > > >>> > > >>> <response> > > >>> ... > > >>> <result> > > >>> <doc> > > >>> <str name="id">20</str> > > >>> <str name="e1">1</str> > > >>> <str name="e2">2</str> > > >>> ... > > >>> </doc> > > >>> <doc> > > >>> <str name="id">28</str> > > >>> <str name="e1">4</str> > > >>> <str name="e2">11</str> > > >>> ... > > >>> </doc> > > >>> <doc> > > >>> <str name="id">23</str> > > >>> <str name="e1">7</str> > > >>> <str name="e2">6</str> > > >>> ... > > >>> </doc> > > >>> </result> > > >>> </response> > > >>> > > >>> 6. Navigating to the following URL in a browser does NOT return what > I > > >>> expected: > > >>> > > >>> > > > > > > http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted > > > > > >>> > > >>> , fl="id", q=text:John, sort="id > > >>> asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted, > > >>> fl="id", q=text:Friends, sort="id > > >>> asc",zkHost="localhost:9983",qt="/export"), on="id=e1") > > >>> > > >>> {"result-set":{"docs":[ > > >>> {"EOF":true,"RESPONSE_TIME":124}]}} > > >>> > > >>> > > >>> I also have a join related question. Is there any chance I can > specify > > a > > >>> query and join for more than 2 things. For example: > > >>> > > >>> innerJoin(search(gettingstarted, fl="id", q=text:John, ...) as s1, > > >>> search(gettingstarted, fl="id", q=text:Chicken, ...) as s2 > > >>> search(gettingstarted, fl="id", q=text:Friends, ...) as > s3) > > >>> on="s1.id=s3.e1", > > >>> on="s2.id=s3.e2") > > >>> > > >>> Sorry if the query does not make sense, but given the data above my > > >>> intention is to find a single result made up of 3 documents: > > >>> s1.id=1,s2.id=8,s3.id=25 > > >>> Is that possible? If yes, will Solr 6 support an arbitrary number of > > >>> queries and associated joins? > > >>> > > >>> Cheers > > >>> > > >>> Akiel > > >>> > > >>> > > >>> > > >>> From: Dennis Gove <dpg...@gmail.com> > > >>> To: Akiel Ahmed/UK/IBM@IBMGB, solr-user@lucene.apache.org > > >>> Date: 11/12/2015 15:34 > > >>> Subject: Re: Solr 6 Distributed Join > > >>> > > >>> > > >>> > > >>> Akiel, > > >>> > > >>> Without seeing your full url I assume that you're missing the > > >>> stream=innerJoin(.....) part of it. A full sample url would look > like > > >>> this > > >>> > > > http://localhost:8983/solr/careers/stream?stream=innerJoin(search(careers > > >>> , > > >>> fl="personId,companyId,title", q=companyId:*, sort="companyId > > >>> asc",zkHost="localhost:2181",qt="/export"),search(companies, > > >>> fl="id,companyName", q=*:*, sort="id > > >>> asc",zkHost="localhost:2181",qt="/export"),on="companyId=id") > > >>> > > >>> This example will return a join of career records with the company > > name > > >>> for > > >>> all career records with a non-null companyId. > > >>> > > >>> And the pieces have the following meaning: > > >>> http://localhost:8983/solr/careers/stream? - you have a collection > > >>> called > > >>> careers available on localhost:8983 and you're hitting its stream > > handler > > >>> ?stream= - you are passing the stream parameter to the stream > handler > > >>> zkHost="localhost:2181" - there is a zk instance running on > > >>> localhost:2181 > > >>> where solr can get clusterstate information. Note, that since you're > > >>> sending the request to the careers collection this param is not > > required > > >>> in > > >>> the search(careers....) part but is required in the > > search(companies....) > > >>> part. For simplicity I usually just provide it for all. > > >>> qt="/export" - tells solr to use the export handler. this assumes > all > > >>> your > > >>> fields are in docValues. if you'd rather not use the export handler > > then > > >>> you probably want to provide the rows=##### param to tell solr to > > return > > >>> a > > >>> large # of rows for each underlying search. Without it solr will > > default > > >>> to, I believe, 10 rows. > > >>> > > >>> CCing the user list so others can see this as well. > > >>> > > >>> We're working on additional documentation for Streaming Aggregation > > and > > >>> Expressions. The page can be found at > > >>> > https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions > > >>> but > > >>> it's missing a lot of things we've added recently. > > >>> > > >>> - Dennis > > >>> > > >>> On Fri, Dec 11, 2015 at 9:51 AM, Akiel Ahmed <ahmed...@uk.ibm.com> > > >>> wrote: > > >>> > > >>> > Hi, > > >>> > > > >>> > Sorry, this is out of the blue - I have joined the Solr mailing > > list, > > >>> but > > >>> > I don't know if that it is the correct place to ask my question. > If > > you > > >>> are > > >>> > not the best person to talk to can you please point me in the > right > > >>> > direction. > > >>> > > > >>> > I want to try using the Solr 6 distributed joins but cant find > > enough > > >>> > material on the web to make it work. I have added the stream > handler > > to > > >>> my > > >>> > solrconfig.xml (see below) and when issuing an inner join query > (see > > >>> below) > > >>> > I get a an error - the localparm named stream is missing so I get > a > > >>> > NullPointerException. Is there a way to play with the join via the > > Solr > > >>> web > > >>> > UI, or if not do you have a code snippet via a SolrJ client that > > >>> performs a > > >>> > join? > > >>> > > > >>> > solrconfig.xml > > >>> > > > >>> > <requestHandler name="/stream" class="solr.StreamHandler"> > > >>> > <lst name="invariants"> > > >>> > <str name="wt">json</str> > > >>> > <str name="distrib">false</str> > > >>> > </lst> > > >>> > </requestHandler> > > >>> > > > >>> > query > > >>> > innerJoin( > > >>> > search(getting_started, _search_field:john), > > >>> > search(getting_started, _search_field:friends), > > >>> > on="id=_link_from_id") > > >>> > > > >>> > Cheers > > >>> > > > >>> > Akiel > > >>> > Unless stated otherwise above: > > >>> > IBM United Kingdom Limited - Registered in England and Wales with > > >>> number > > >>> > 741598. > > >>> > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire > > PO6 > > >>> 3AU > > >>> > > > >>> > > >>> > > >>> Unless stated otherwise above: > > >>> IBM United Kingdom Limited - Registered in England and Wales with > > number > > >>> 741598. > > >>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire > PO6 > > >>> 3AU > > >>> > > >>> > > >>> > > >>> Unless stated otherwise above: > > >>> IBM United Kingdom Limited - Registered in England and Wales with > > number > > >>> 741598. > > >>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire > PO6 > > >>> 3AU > > >>> > > >>> > > >> > > > > > > > > > Unless stated otherwise above: > > IBM United Kingdom Limited - Registered in England and Wales with number > > 741598. > > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 > 3AU > > > > Unless stated otherwise above: > > IBM United Kingdom Limited - Registered in England and Wales with number > > 741598. > > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 > 3AU > > > > > > > Unless stated otherwise above: > IBM United Kingdom Limited - Registered in England and Wales with number > 741598. > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU > > Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU