shaie commented on PR #841: URL: https://github.com/apache/lucene/pull/841#issuecomment-1142206707
> Longer-term, I'd be more in favor of keeping the new field more generic (e.g., a generic "points" doc values field that can be used for faceting but also "slow" queries) There are two sides to FacetSets: indexing and aggregation. Well actually three: drill-down too. For indexing I think a generic `long... values` field make sense for FacetSets and perhaps other use cases unrelated to faceting. As long as one can use its encoding as `LongPoint.pack()` or any other future scheme that will be _generic_, then I don't see why not. However if that's all what this field will currently do, then I prefer that we start without it in the `facet` package. We can always extend this field in the future, rather than BDV, right? (as long as it keeps the same order of the values) For aggregation though, as I thought about the API more, I realized that a generic "aggregator" makes no sense. On the file-system it's just a `long[]` but how you aggregate this series is totally use-case dependent: * For the "Automotive Parts Store" example, you'd want a "Matching Facet Counts" which are given a list of `FacetSet` objects and count documents which have an exact matching set. * For the HyperRectangle faceting (what's currently proposed in this PR) you'd want a different Facet Counts object which ensures that a given point falls within a pair of `long` values. So here the underlying `long[]` is actually pair of longs, really both belong to the same "dimension" maybe, and however a "FacetSet" is matched is a different function than for the one above. * For N-dimensional FacetSets, a'la the "Movie Awards" example, one might index 5 dimensions for a Movie award, but choose to aggregate only on 3 of them. In that case you'll want a Facet matcher which knows in advance which dimensions it picks from the set to determine whether there's a match. * And lastly for the "Movie Actor DB" example, if you'd want to actually compute a matrix, then our `Facets` API is completely not suitable, since it's a flat single-dimension result type. I don't propose to extend it, only saying that if we wanted to compute N-dimensional results, we'd need another API. * Actually, this feature by itself is very cool, since it's an interesting challenge to compute a matrix of "interesting" values. You can't just pick the top "cells" since that will produce fragmented rows, so you have to actually pick "interesting rows/columns". But that's for a different PR. Therefore, while it looks like a generic `long[]` DV (unsorted!) field might be enough for indexing facet sets, for aggregations I think we'll have few APIs. > If we end up proposing this as a "sandbox" module feature for now, then I'd +1 this idea of a "facet set" API. I am personally for developing it directly in the facet module. If we add `@lucene.experimental` to the classes, along with a "NOTE: this feature is still under heavy development. If you use it you should expect changes to the API as well the on-disk structures until it finalized and optimized". WDYT? > If we propose adding this to the core functionality though, I'd like to further discuss the pros/cons of how specific we make this new doc values field. At this point I really think we can continue with BDV. Only if someone has prior knowledge about the data and can encode each column separately, would it (perhaps) make sense to consider a different encoding? We can also consider including a `VERSION` identifier on-disk which we'll use to tell how to read the data during aggregation. It can just be another "long" we add to the beginning of the list. Yes, it's less optimal than a dedicated DV type, since it's repeating the same number for every field in every document, but it's something we can consider if we worry about it. Frankly though, I feel like we can relax the requirement of this feature at the start and not worry about it until after we're ready to remove that `NOTE:` from the javadocs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org