[GitHub] [lucene] shaie commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

GitBox Tue, 31 May 2022 07:25:07 -0700


shaie commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1142206707


   > Longer-term, I'd be more in favor of keeping the new field more generic 
(e.g., a generic "points" doc values field that can be used for faceting but 
also "slow" queries)
   
   There are two sides to FacetSets: indexing and aggregation. Well actually 
three: drill-down too. For indexing I think a generic `long... values` field 
make sense for FacetSets and perhaps other use cases unrelated to faceting. As 
long as one can use its encoding as `LongPoint.pack()` or any other future 
scheme that will be _generic_, then I don't see why not. However if that's all 
what this field will currently do, then I prefer that we start without it in 
the `facet` package. We can always extend this field in the future, rather than 
BDV, right? (as long as it keeps the same order of the values)
   
   For aggregation though, as I thought about the API more, I realized that a 
generic "aggregator" makes no sense. On the file-system it's just a `long[]` 
but how you aggregate this series is totally use-case dependent:
   
   * For the "Automotive Parts Store" example, you'd want a "Matching Facet 
Counts" which are given a list of `FacetSet` objects and count documents which 
have an exact matching set.
   * For the HyperRectangle faceting (what's currently proposed in this PR) 
you'd want a different Facet Counts object which ensures that a given point 
falls within a pair of `long` values. So here the underlying `long[]` is 
actually pair of longs, really both belong to the same "dimension" maybe, and 
however a "FacetSet" is matched is a different function than for the one above.
   * For N-dimensional FacetSets, a'la the "Movie Awards" example, one might 
index 5 dimensions for a Movie award, but choose to aggregate only on 3 of 
them. In that case you'll want a Facet matcher which knows in advance which 
dimensions it picks from the set to determine whether there's a match.
   * And lastly for the "Movie Actor DB" example, if you'd want to actually 
compute a matrix, then our `Facets` API is completely not suitable, since it's 
a flat single-dimension result type. I don't propose to extend it, only saying 
that if we wanted to compute N-dimensional results, we'd need another API.
     * Actually, this feature by itself is very cool, since it's an interesting 
challenge to compute a matrix of "interesting" values. You can't just pick the 
top "cells" since that will produce fragmented rows, so you have to actually 
pick "interesting rows/columns". But that's for a different PR.
   
   Therefore, while it looks like a generic `long[]` DV (unsorted!) field might 
be enough for indexing facet sets, for aggregations I think we'll have few APIs.
   
   > If we end up proposing this as a "sandbox" module feature for now, then 
I'd +1 this idea of a "facet set" API.
   
   I am personally for developing it directly in the facet module. If we add 
`@lucene.experimental` to the classes, along with a "NOTE: this feature is 
still under heavy development. If you use it you should expect changes to the 
API as well the on-disk structures until it finalized and optimized". WDYT?
   
   > If we propose adding this to the core functionality though, I'd like to 
further discuss the pros/cons of how specific we make this new doc values field.
   
   At this point I really think we can continue with BDV. Only if someone has 
prior knowledge about the data and can encode each column separately, would it 
(perhaps) make sense to consider a different encoding?
   
   We can also consider including a `VERSION` identifier on-disk which we'll 
use to tell how to read the data during aggregation. It can just be another 
"long" we add to the beginning of the list. Yes, it's less optimal than a 
dedicated DV type, since it's repeating the same number for every field in 
every document, but it's something we can consider if we worry about it. 
Frankly though, I feel like we can relax the requirement of this feature at the 
start and not worry about it until after we're ready to remove that `NOTE:` 
from the javadocs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shaie commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

Reply via email to