gsmiller commented on PR #841: URL: https://github.com/apache/lucene/pull/841#issuecomment-1148846237
OK, I've (somewhat) caught up on the conversation here and will follow up on my original questions/comments (but am not going to jump in right now on the latest API discussion). 1. I like this "facet set" naming approach along with providing specific implementations for "exact match" cases and "range" cases. I think we should stick to these two for now. If a user wants to "mix and match" (some dims are exact matches and some are ranges), they can use the more general "range" implementation (with some dim ranges containing common values for min/max). Or they could of course implement their own. I don't think we need the complexity of an OOTB "mix and match" solution (for now at leat). 2. As far as solving for use-cases where users want to "fix" the n-1 dims and then get top values for the nth dim, I don't think we need to solve for that (yet). The existing "range" facet counting doesn't solve for this, and requires users to fully describe the ranges they care about. So for the sake of "progress not perfection", I see no issue with following a similar pattern here. 3. If users _do_ need to implement the above use-case (no. 2 above), there's actually a different way to go about it. Because `LongValueFacetCounts` allows users to provide a `LongValuesSource`, users can implement their own `LongValueSource` that provides values for the dimension they want to count, but pre-filters to only the points that match the n-1 filtering dims. So in the above example, if users wanted the top year values for movies that received the "Oscar+Drama" award, they can implement a `LongValuesSource` on top of the binary doc value field (the packed points) that "emits" the year value for each point, but only if it the other dims meet the "Oscar+Drama" criteria. I've actually done this in practice. We could certainly make this easier for users to do, but they have all the primitives to do this on their own (especially with the addition of the proposed `LongPointDocValuesField`). 4. I think there's actually a nice future optimization that's a bit easier with modeling the "exact match" and "range" cases separately. If the user has many points or "hyperrectangles" specified, we might want to use some sort of space-partitioning data structure to make determining the matching points/hyperrectangles more efficient as we iterate the doc points (instead of doing an exhaustive search every time). These data structures will be different for these two cases (one is probably some sort of KD-tree for the "exact match" case and the other might be some sort of R-tree for the "hyperrectangle" case). So having these separate implementations might actually set us up for a nice performance improvement too, where if we modeled everything as "hyperrectangles", we could end up just stuffing a bunch of points into an R-tree which is a little weird. 5. I look forward to seeing the "range" implementation sketched out :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org