A couple of other things:

1) Your innerJoin can parallelized across workers to improve performance.
Take a look at the docs on the parallel function for the details.

2) It looks like you might be doing graph operations with joins. You might
to take a look at the gatherNodes function coming in 6.1:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62693238

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, May 13, 2016 at 5:57 PM, Joel Bernstein <joels...@gmail.com> wrote:

> When doing things that require all the results (like joins) you need to
> specify the /export handler in the search function.
>
> qt="/export"
>
> The search function defaults to the /select handler which is designed to
> return the top N results. The /export handler always returns all results
> that match the query. Also keep in mind that the /export handler requires
> that sort fields and fl fields have docValues set.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, May 13, 2016 at 5:36 PM, Ryan Cutter <ryancut...@gmail.com> wrote:
>
>> Question #1:
>>
>> triple_type collection has a few hundred docs and triple has 25M docs.
>>
>> When I search for a particular subject_id in triple which I know has 14
>> results and do not pass in 'rows' params, it returns 0 results:
>>
>> innerJoin(
>>     search(triple, q=subject_id:1656521,
>> fl="triple_id,subject_id,type_id",
>> sort="type_id asc"),
>>     search(triple_type, q=*:*, fl="triple_type_id,triple_type_label",
>> sort="triple_type_id asc"),
>>     on="type_id=triple_type_id"
>> )
>>
>> When I do the same search with rows=10000, it returns 14 results:
>>
>> innerJoin(
>>     search(triple, q=subject_id:1656521,
>> fl="triple_id,subject_id,type_id",
>> sort="type_id asc", rows=10000),
>>     search(triple_type, q=*:*, fl="triple_type_id,triple_type_label",
>> sort="triple_type_id asc", rows=10000),
>>     on="type_id=triple_type_id"
>> )
>>
>> Am I doing this right?  Is there a magic number to pass into rows which
>> says "give me all the results which match this query"?
>>
>>
>> Question #2:
>>
>> Perhaps related to the first question but I want to run the innerJoin()
>> without the subject_id - rather have it use the results of another query.
>> But this does not return any results.  I'm saying "search for this entity
>> based on id then use that result's entity_id as the subject_id to look
>> through the triple/triple_type collections:
>>
>> hashJoin(
>>     innerJoin(
>>         search(triple, q=*:*, fl="triple_id,subject_id,type_id",
>> sort="type_id asc"),
>>         search(triple_type, q=*:*, fl="triple_type_id,triple_type_label",
>> sort="triple_type_id asc"),
>>         on="type_id=triple_type_id"
>>     ),
>>     hashed=search(entity,
>> q=id:"urn:sid:entity:455dfa1aa27eedad21ac2115797c1580bb3b3b4e",
>> fl="entity_id,entity_label", sort="entity_id asc"),
>>     on="subject_id=entity_id"
>> )
>>
>> Am I using doing this hashJoin right?
>>
>> Thanks very much, Ryan
>>
>
>

Reply via email to