Re: joining across sharded collection

chris Sun, 10 Dec 2017 06:55:10 -0800

Hi Erick,
No, we have not yet looked at the streaming functionality.� But we've started 
to explore it, so we'll look at that.
I briefly considered denormalizing the data but the sources documents have ~200 
fields so it seems to me that the index size would explode.� (The
items documents have 65 fields)
Thank you for your help.
�
Chris
�
---------------------------- Original Message ----------------------------
Subject: Re: joining across sharded collection

From: "Erick Erickson" <erickerick...@gmail.com>

Date: Sat, December 9, 2017 10:16 pm

To: "solr-user" <solr-user@lucene.apache.org>

--------------------------------------------------------------------------



> Have you looked at the streaming functionality (StreamingExpressions

> and ParllelSQL in particular)? While it has some restrictions, it

> easily handles cross-collection joins. It's generally intended for

> analytic-type queries, but at your scale that may be what you need.

>

> At that scale denoramlizing the data doesn't seem feasible....

>

> Best,

> Erick

>

> On Sat, Dec 9, 2017 at 6:02 PM, <ch...@yeeplusplus.com> wrote:

>>

>>

>> I'm trying to figure out how to structure this query.

>>

>> I have two types of documents: items and sources. Previously, they were all 
>> in the same collection. I'm now testing a cluster with separate collections.

>>

>> The items collection has 38,034,895,527 documents, and the sources 
>> collection has 417,618,443 documents.

>>

>> I have all of the documents in the same collection in a solr cluster running 
>> version 6.0.1, with 100 shards and replication factor 1.

>>

>> The following query works as expected:

>>

>> q=type:source&fq={!join from=source_id 
>> to=source_id}item_category:abc&rows=0&stats=true&stats.field={!tag=pv1 
>> count=true}source_id&facet=true&facet.pivot={!stats=pv1}source_factory&facet.sort=index&facet.limit=-1

>>

>> In the source documents, the source_id identifies the source. In the items 
>> documents, the source_id identifies the unique source document related to 
>> it. There is a 1:many relationship between sources and items.

>>

>> The above query gets the sources that are associated with items that have 
>> item_category "abc", and then facets on the sources' source_factory field.

>>

>>

>> Now, I'm testing a separate cluster that has the same data, but organized 
>> into two collections: items and sources.

>>

>> In order to do the same query, I have to use a cross-collection join, which 
>> requires the FROM collection to be unsharded. However, in this case, the 
>> FROM collection is the items collection, which due to its size cannot be 
>> unsharded.

>>

>> I'm hoping there's an easy way to restructure my data / query to accomplish 
>> the faceting I need.

>>

>> The data set is static so can be re-indexed and reconfigured as needed. It's 
>> also not under any load yet.

>>

>
Re: joining across sharded collection

Reply via email to