Have you looked at the streaming functionality (StreamingExpressions and ParllelSQL in particular)? While it has some restrictions, it easily handles cross-collection joins. It's generally intended for analytic-type queries, but at your scale that may be what you need.
At that scale denoramlizing the data doesn't seem feasible.... Best, Erick On Sat, Dec 9, 2017 at 6:02 PM, <ch...@yeeplusplus.com> wrote: > > > I'm trying to figure out how to structure this query. > > I have two types of documents: items and sources. Previously, they were all > in the same collection. I'm now testing a cluster with separate collections. > > The items collection has 38,034,895,527 documents, and the sources collection > has 417,618,443 documents. > > I have all of the documents in the same collection in a solr cluster running > version 6.0.1, with 100 shards and replication factor 1. > > The following query works as expected: > > q=type:source&fq={!join from=source_id > to=source_id}item_category:abc&rows=0&stats=true&stats.field={!tag=pv1 > count=true}source_id&facet=true&facet.pivot={!stats=pv1}source_factory&facet.sort=index&facet.limit=-1 > > In the source documents, the source_id identifies the source. In the items > documents, the source_id identifies the unique source document related to it. > There is a 1:many relationship between sources and items. > > The above query gets the sources that are associated with items that have > item_category "abc", and then facets on the sources' source_factory field. > > > Now, I'm testing a separate cluster that has the same data, but organized > into two collections: items and sources. > > In order to do the same query, I have to use a cross-collection join, which > requires the FROM collection to be unsharded. However, in this case, the > FROM collection is the items collection, which due to its size cannot be > unsharded. > > I'm hoping there's an easy way to restructure my data / query to accomplish > the faceting I need. > > The data set is static so can be re-indexed and reconfigured as needed. It's > also not under any load yet. >