Re: Solr 6 Distributed Join

Akiel Ahmed Thu, 17 Dec 2015 06:29:35 -0800

Hi again,

I got the join to work. A team mate pointed out that one of the search 
functions in the innerJoin query was missing a field in the join - adding 
the e1 field to the fl parameter of the second search function gave the 
result I expected:

http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted

, fl="id", q=text:John, sort="id 
asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted, 
fl="id,e1", q=text:Friends, sort="id 
asc",zkHost="localhost:9983",qt="/export"), on="id=e1")

I am still interested in whether we can specify a join, using an arbitrary 
number of searches.

Cheers

Akiel

From:   Akiel Ahmed/UK/IBM@IBMGB
To:     solr-user@lucene.apache.org
Date:   16/12/2015 17:05
Subject:        Re: Solr 6 Distributed Join

Hi Dennis,

Thank you for your help. I used your explanation to construct an innerJoin 

query; I think I am getting further but didn't get the results I expected. 

The following describes what I did – is there any chance you can tell 
where I am going wrong:

Solr 6 Developer Builds: #2738 and #2743

1. Modified server/solr/configsets/basic_configs/conf/managed-schema so it 

reads:

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="search" version="1.5">
  <uniqueKey>id</uniqueKey>
  <field name="id" type="id" indexed="true" stored="true" required="true" 
multiValued="false" docValues="true"/>
  <field name="_version_" type="solr_version" indexed="true" stored="true" 

required="false" multiValued="false" docValues="true"/>
  <field name="type" type="id" indexed="true" stored="true" 
required="false" multiValued="false" docValues="true"/>
  <field name="e1" type="id" indexed="true" stored="true" required="false" 

multiValued="false" docValues="true"/>
  <field name="e2" type="id" indexed="true" stored="true" required="false" 

multiValued="false" docValues="true"/>
  <field name="text" type="free_text" indexed="true" stored="true" 
required="false" multiValued="false"/>
  <fieldType name="id" class="solr.StrField" sortMissingLast="true"/>
  <fieldType name="solr_version" class="solr.TrieLongField" 
precisionStep="0" positionIncrementGap="0"/>
  <fieldType name="free_text" class="solr.TextField" 
positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="lang/stopwords_en.txt"/>
    </analyzer>
  </fieldType>
</schema>

2. Modified server/solr/configsets/basic_configs/conf/solrconfig.xml, 
adding the following near the bottom of the file so it is the last request 

handler

  <requestHandler name="/stream" class="solr.StreamHandler"> 
        <lst name="invariants"> 
                <str name="wt">json</str> 
                <str name="distrib">false</str> 
        </lst> 
  </requestHandler>

3. Used solr -e cloud to setup a solr cloud instance, picking all the 
defaults except I chose basic_configs

4. After solr is running I ingested the following data via the Solr Web UI 

(/update handler, Document Type = CSV)
id,type,e1,e2,text
1,ABC,,,John Smith
2,ABC,,,Jane Smith
3,ABC,,,MiKe Smith
4,ABC,,,John Doe
5,ABC,,,Jane Doe
6,ABC,,,MiKe Doe
7,ABC,,,John Smith
8,DEF,,,Chicken Burger
9,DEF,,,Veggie Burger
10,DEF,,,Beef Burger
11,DEF,,,Chicken Donar
12,DEF,,,Chips
13,DEF,,,Drink
20,GHI,1,2,Friends
21,GHI,3,4,Friends
22,GHI,5,6,Friends
23,GHI,7,6,Friends
24,GHI,6,4,Friends
25,JKL,1,8,Order
26,JKL,2,9,Order
27,JKL,3,10,Order
28,JKL,4,11,Order
29,JKL,5,12,Order
30,JKL,6,13,Order

5. Navigating to the following URL in a browser returned an expected 
result:
http://localhost:8983/solr/gettingstarted/select?q={!join from=id 
to=e1}text:John&fl="id"

<response>
...
  <result>
    <doc>
      <str name="id">20</str>
      <str name="e1">1</str>
      <str name="e2">2</str>
      ...
    </doc>
    <doc>
      <str name="id">28</str>
      <str name="e1">4</str>
      <str name="e2">11</str>
      ...
    </doc>
    <doc>
      <str name="id">23</str>
      <str name="e1">7</str>
      <str name="e2">6</str>
      ...
    </doc>
  </result>
</response>

6. Navigating to the following URL in a browser does NOT return what I 
expected:
http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted

, fl="id", q=text:John, sort="id 
asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted, 
fl="id", q=text:Friends, sort="id 
asc",zkHost="localhost:9983",qt="/export"), on="id=e1")

{"result-set":{"docs":[
{"EOF":true,"RESPONSE_TIME":124}]}}

I also have a join related question. Is there any chance I can specify a 
query and join for more than 2 things. For example:

innerJoin(search(gettingstarted, fl="id", q=text:John, ...) as s1, 
          search(gettingstarted, fl="id", q=text:Chicken, ...) as s2
          search(gettingstarted, fl="id", q=text:Friends, ...) as s3)
          on="s1.id=s3.e1", 
          on="s2.id=s3.e2")

Sorry if the query does not make sense, but given the data above my 
intention is to find a single result made up of 3 documents: 
s1.id=1,s2.id=8,s3.id=25
Is that possible? If yes, will Solr 6 support an arbitrary number of 
queries and associated joins?

Cheers

Akiel

From:   Dennis Gove <dpg...@gmail.com>
To:     Akiel Ahmed/UK/IBM@IBMGB, solr-user@lucene.apache.org
Date:   11/12/2015 15:34
Subject:        Re: Solr 6 Distributed Join

Akiel,

Without seeing your full url I assume that you're missing the
stream=innerJoin(.....) part of it. A full sample url would look like this
http://localhost:8983/solr/careers/stream?stream=innerJoin(search(careers,
fl="personId,companyId,title", q=companyId:*, sort="companyId
asc",zkHost="localhost:2181",qt="/export"),search(companies,
fl="id,companyName", q=*:*, sort="id
asc",zkHost="localhost:2181",qt="/export"),on="companyId=id")

This example will return a join of career records with the company name 
for
all career records with a non-null companyId.

And the pieces have the following meaning:
http://localhost:8983/solr/careers/stream?  - you have a collection called
careers available on localhost:8983 and you're hitting its stream handler
?stream=  - you are passing the stream parameter to the stream handler
zkHost="localhost:2181"  - there is a zk instance running on 
localhost:2181
where solr can get clusterstate information. Note, that since you're
sending the request to the careers collection this param is not required 
in
the search(careers....) part but is required in the search(companies....)
part. For simplicity I usually just provide it for all.
qt="/export"  - tells solr to use the export handler. this assumes all 
your
fields are in docValues. if you'd rather not use the export handler then
you probably want to provide the rows=##### param to tell solr to return a
large # of rows for each underlying search. Without it solr will default
to, I believe, 10 rows.

CCing the user list so others can see this as well.

We're working on additional documentation for Streaming Aggregation and
Expressions. The page can be found at
https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions but
it's missing a lot of things we've added recently.

- Dennis

On Fri, Dec 11, 2015 at 9:51 AM, Akiel Ahmed <ahmed...@uk.ibm.com> wrote:

> Hi,
>
> Sorry, this is out of the blue - I have joined the Solr mailing list, 
but
> I don't know if that it is the correct place to ask my question. If you 
are
> not the best person to talk to can you please point me in the right
> direction.
>
> I want to try using the Solr 6 distributed joins but cant find enough
> material on the web to make it work. I have added the stream handler to 
my
> solrconfig.xml (see below) and when issuing an inner join query (see 
below)
> I get a an error - the localparm named stream is missing so I get a
> NullPointerException. Is there a way to play with the join via the Solr 
web
> UI, or if not do you have a code snippet via a SolrJ client that 
performs a
> join?
>
> solrconfig.xml
>
> <requestHandler name="/stream" class="solr.StreamHandler">
>         <lst name="invariants">
>                 <str name="wt">json</str>
>                 <str name="distrib">false</str>
>         </lst>
> </requestHandler>
>
> query
> innerJoin(
>         search(getting_started, _search_field:john),
>         search(getting_started, _search_field:friends),
>         on="id=_link_from_id")
>
> Cheers
>
> Akiel
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 
3AU
>

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Re: Solr 6 Distributed Join

Reply via email to