[ https://issues.apache.org/jira/browse/LUCENE-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565944#comment-17565944 ]
Marc D'Mello commented on LUCENE-10628: --------------------------------------- Thanks for taking a look! As for the answer to your question - I'm not sure, it's really up to the user. The facet set matchers can theoretically be an unlimited amount of dimensions. Just to make sure we are on the same page, I'll just define the relevant parts of the {{facetset}} package API (apologize if I'm just repeating information that you already know here): Essentially we store multi-dim points into a {{BinaryDocValues}} field, so for example a list like {{(1, 2, 3), (2, 3, 4), (3, 4, 5)...}}. We have an {{ExactFacetSetMatcher}} that represents a single point of the same dimension as the field, and we will count how many points in the BDV match that point represented by the {{ExactFacetSetMatcher}}. We can put multiple of these {{ExactFacetSetMatcher}}'s into a group in {{MatchingFacetSetCounts}} and count how points in the BDV matched each {{ExactFacetSetMatcher}}. Currently, we linearly scan each point through each {{ExactFacetMatcher}} to get the counts, which is the part I want to optimize by putting the {{ExactFacetSetMatcher}}'s into a space partitioning data structure (either putting these into a KD tree, or as you suggested an interval tree). We also have {{RangeFacetSetMatcher}} which is similar to {{ExactFacetSetMatcher}}, except you can define ranges per dimension, for example something like {{(1 - 3, 3 - 4, 4 - 6)}} which would match if all of a point's dimensions lie within the range. I imagine you can put a group of these {{RangeFacetSetMatcher}}'s into an R tree to avoid linear scanning. So I'd imagine most use cases would be with a low dimensionality, but there might be some use cases that require higher dimensions. In the higher dimension case, would it just be best to resort to linear scanning then rather than building a KD tree? For the {{RangeFacetSetMatcher}} case, would bulk adding these into an R tree also be too complex? In the common use case, there would be many more points in the index than {{FacetSetMatcher}}'s, so another approach could also be to index the points in a KD tree. Sorry for all the questions! But I would be really interested in any suggestions you have here as I am inexperienced with these kind of data structures. > Enable MatchingFacetSetCounts to use space partitioning data structures > ----------------------------------------------------------------------- > > Key: LUCENE-10628 > URL: https://issues.apache.org/jira/browse/LUCENE-10628 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Marc D'Mello > Priority: Minor > > Currently, {{MatchingFacetSetCounts}} iterates over {{FacetSetMatcher}} > instances passed into it linearly. While this is fine in some cases, if we > have a large amount of {{FacetSetMatcher}}'s, this can be inefficient. We > should provide the option to users to enable the use of space partitioning > data structures (namely R trees and KD trees) so we can potentially scan over > these {{FacetSetMatcher}}'s in sub-linear time. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org