[jira] [Commented] (LUCENE-10628) Enable MatchingFacetSetCounts to use space partitioning data structures

Marc D'Mello (Jira) Tue, 12 Jul 2022 13:00:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565944#comment-17565944
 ]


Marc D'Mello commented on LUCENE-10628:
---------------------------------------

Thanks for taking a look! As for the answer to your question - I'm not sure, 
it's really up to the user. The facet set matchers can theoretically be an 
unlimited amount of dimensions. Just to make sure we are on the same page, I'll 
just define the relevant parts of the {{facetset}} package API (apologize if 
I'm just repeating information that you already know here):

Essentially we store multi-dim points into a {{BinaryDocValues}} field, so for 
example a list like {{(1, 2, 3), (2, 3, 4), (3, 4, 5)...}}. We have an 
{{ExactFacetSetMatcher}} that represents a single point of the same dimension 
as the field, and we will count how many points in the BDV match that point 
represented by the {{ExactFacetSetMatcher}}. We can put multiple of these 
{{ExactFacetSetMatcher}}'s into a group in {{MatchingFacetSetCounts}} and count 
how points in the BDV matched each {{ExactFacetSetMatcher}}. Currently, we 
linearly scan each point through each {{ExactFacetMatcher}} to get the counts, 
which is the part I want to optimize by putting the {{ExactFacetSetMatcher}}'s 
into a space partitioning data structure (either putting these into a KD tree, 
or as you suggested an interval tree). We also have {{RangeFacetSetMatcher}} 
which is similar to {{ExactFacetSetMatcher}}, except you can define ranges per 
dimension, for example something like {{(1 - 3, 3 - 4, 4 - 6)}} which would 
match if all of a point's dimensions lie within the range. I imagine you can 
put a group of these {{RangeFacetSetMatcher}}'s into an R tree to avoid linear 
scanning.

So I'd imagine most use cases would be with a low dimensionality, but there 
might be some use cases that require higher dimensions. In the higher dimension 
case, would it just be best to resort to linear scanning then rather than 
building a KD tree? For the {{RangeFacetSetMatcher}} case, would bulk adding 
these into an R tree also be too complex? In the common use case, there would 
be many more points in the index than {{FacetSetMatcher}}'s, so another 
approach could also be to index the points in a KD tree. Sorry for all the 
questions! But I would be really interested in any suggestions you have here as 
I am inexperienced with these kind of data structures. 

> Enable MatchingFacetSetCounts to use space partitioning data structures
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-10628
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10628
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Marc D'Mello
>            Priority: Minor
>
> Currently, {{MatchingFacetSetCounts}} iterates over {{FacetSetMatcher}} 
> instances passed into it linearly. While this is fine in some cases, if we 
> have a large amount of {{FacetSetMatcher}}'s, this can be inefficient. We 
> should provide the option to users to enable the use of space partitioning 
> data structures (namely R trees and KD trees) so we can potentially scan over 
> these {{FacetSetMatcher}}'s in sub-linear time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10628) Enable MatchingFacetSetCounts to use space partitioning data structures

Reply via email to