gsmiller commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1148846237

   OK, I've (somewhat) caught up on the conversation here and will follow up on 
my original questions/comments (but am not going to jump in right now on the 
latest API discussion).
   
   1. I like this "facet set" naming approach along with providing specific 
implementations for "exact match" cases and "range" cases. I think we should 
stick to these two for now. If a user wants to "mix and match" (some dims are 
exact matches and some are ranges), they can use the more general "range" 
implementation (with some dim ranges containing common values for min/max). Or 
they could of course implement their own. I don't think we need the complexity 
of an OOTB "mix and match" solution (for now at leat).
   2. As far as solving for use-cases where users want to "fix" the n-1 dims 
and then get top values for the nth dim, I don't think we need to solve for 
that (yet). The existing "range" facet counting doesn't solve for this, and 
requires users to fully describe the ranges they care about. So for the sake of 
"progress not perfection", I see no issue with following a similar pattern here.
   3. If users _do_ need to implement the above use-case (no. 2 above), there's 
actually a different way to go about it. Because `LongValueFacetCounts` allows 
users to provide a `LongValuesSource`, users can implement their own 
`LongValueSource` that provides values for the dimension they want to count, 
but pre-filters to only the points that match the n-1 filtering dims. So in the 
above example, if users wanted the top year values for movies that received the 
"Oscar+Drama" award, they can implement a `LongValuesSource` on top of the 
binary doc value field (the packed points) that "emits" the year value for each 
point, but only if it the other dims meet the "Oscar+Drama" criteria. I've 
actually done this in practice. We could certainly make this easier for users 
to do, but they have all the primitives to do this on their own (especially 
with the addition of the proposed `LongPointDocValuesField`).
   4. I think there's actually a nice future optimization that's a bit easier 
with modeling the "exact match" and "range" cases separately. If the user has 
many points or "hyperrectangles" specified, we might want to use some sort of 
space-partitioning data structure to make determining the matching 
points/hyperrectangles more efficient as we iterate the doc points (instead of 
doing an exhaustive search every time). These data structures will be different 
for these two cases (one is probably some sort of KD-tree for the "exact match" 
case and the other might be some sort of R-tree for the "hyperrectangle" case). 
So having these separate implementations might actually set us up for a nice 
performance improvement too, where if we modeled everything as 
"hyperrectangles", we could end up just stuffing a bunch of points into an 
R-tree which is a little weird.
   5. I look forward to seeing the "range" implementation sketched out :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to