I have a large 7.2 index with nested documents and many shards.
For each result (parent doc) in a query, I want to gather a relevance-ranked 
subset of the child documents. It seemed like the subquery transformer would be 
ideal: 
https://lucene.apache.org/solr/guide/7_2/transforming-result-documents.html#TransformingResultDocuments-_subquery_
(the [child] transformer allows for a filter, but the results have an 
effectively random sort)

So maybe something like this:
q=<something>
fl=id,subquery:[subquery]
subquery.q=<something>
subquery.fq={!cache=false} +{!terms f=_root_ v=$row.id}

This actually works fine, but there’s a lot more work going on than necessary. 
Say we have X shards and get N documents back:

Query http requests = 1 top-level query + X distributed shard-requests
Subquery http requests = N rows + N * X distributed shard-requests
So with N=10 results and X=50 shards, that is: 1+50+10+500 = 561 http requests 
through the cluster.

Some of that is unavoidable, of course, but it occurs to me that all the child 
docs are indexed in the same shard (segment) that the parent doc is. Meaning 
that if you know the parent doc id, (and I do) you can use the document routing 
to know exactly which shard to send the subquery request to. This would save 
490 of the http requests in the scenario above.

Is there any form of query that allows for explicitly following the document 
routing rules for a given document ID?

I’m aware of the “distrib=false” and “shards=foo” parameters, but using those 
would require me to recreate the document routing in the client.
There’s also the “fl=[shard]” thing, but that would still require me to handle 
the subqueries in the client.


Reply via email to