Hi Erick,
No, we have not yet looked at the streaming functionality.� But we've started
to explore it, so we'll look at that.
I briefly considered denormalizing the data but the sources documents have ~200
fields so it seems to me that the index size would explode.� (The
items documents have 65 fields)
Thank you for your help.
�
Chris
�
---------------------------- Original Message ----------------------------
Subject: Re: joining across sharded collection
From: "Erick Erickson" <erickerick...@gmail.com>
Date: Sat, December 9, 2017 10:16 pm
To: "solr-user" <solr-user@lucene.apache.org>
--------------------------------------------------------------------------
> Have you looked at the streaming functionality (StreamingExpressions
> and ParllelSQL in particular)? While it has some restrictions, it
> easily handles cross-collection joins. It's generally intended for
> analytic-type queries, but at your scale that may be what you need.
>
> At that scale denoramlizing the data doesn't seem feasible....
>
> Best,
> Erick
>
> On Sat, Dec 9, 2017 at 6:02 PM, <ch...@yeeplusplus.com> wrote:
>>
>>
>> I'm trying to figure out how to structure this query.
>>
>> I have two types of documents: items and sources. Previously, they were all
>> in the same collection. I'm now testing a cluster with separate collections.
>>
>> The items collection has 38,034,895,527 documents, and the sources
>> collection has 417,618,443 documents.
>>
>> I have all of the documents in the same collection in a solr cluster running
>> version 6.0.1, with 100 shards and replication factor 1.
>>
>> The following query works as expected:
>>
>> q=type:source&fq={!join from=source_id
>> to=source_id}item_category:abc&rows=0&stats=true&stats.field={!tag=pv1
>> count=true}source_id&facet=true&facet.pivot={!stats=pv1}source_factory&facet.sort=index&facet.limit=-1
>>
>> In the source documents, the source_id identifies the source. In the items
>> documents, the source_id identifies the unique source document related to
>> it. There is a 1:many relationship between sources and items.
>>
>> The above query gets the sources that are associated with items that have
>> item_category "abc", and then facets on the sources' source_factory field.
>>
>>
>> Now, I'm testing a separate cluster that has the same data, but organized
>> into two collections: items and sources.
>>
>> In order to do the same query, I have to use a cross-collection join, which
>> requires the FROM collection to be unsharded. However, in this case, the
>> FROM collection is the items collection, which due to its size cannot be
>> unsharded.
>>
>> I'm hoping there's an easy way to restructure my data / query to accomplish
>> the faceting I need.
>>
>> The data set is static so can be re-indexed and reconfigured as needed. It's
>> also not under any load yet.
>>
>