[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

GitBox Tue, 21 Jun 2022 21:49:44 -0700


shaie commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r903278541



##########
lucene/facet/docs/FacetSets.adoc:
##########
@@ -0,0 +1,130 @@
+= FacetSets Overview
+:toc:
+
+This document describes the `FacetSets` capability, which allows to aggregate 
on multidimensional values. It starts
+with outlining a few example use cases to showcase the motivation for this 
capability and follows with an API
+walk through.
+
+== Motivation
+
+[#movie-actors]
+=== Movie Actors DB
+
+Suppose that you want to build a search engine for movie actors which allows 
you to search for actors by name and see
+movie titles they appeared in. You might want to index standard fields such as 
`actorName`, `genre` and `releaseYear`
+which will let you search by the actor's name or see all actors who appeared 
in movies during 2021. Similarly, you can
+index facet fields that will let you aggregate by “Genre” and “Year” so that 
you can show how many actors appeared in
+each year or genre. Few example documents:
+
+[source]
+----
+{ "name": "Tom Hanks", "genre": ["Comedy", "Drama", …], "year": [1988, 2000,…] 
}
+{ "name": "Harrison Ford", "genre": ["Action", "Adventure", …], "year": [1977, 
1981, …] }
+----
+
+However, these facet fields do not allow you to show the following aggregation:
+
+.Number of Actors performing in movies by Genre and Year
+[cols="4*"]
+|===
+|           | 2020 | 2021 | 2022
+| Thriller  | 121  | 43   | 97
+| Action    | 145  | 52   | 130
+| Adventure | 87   | 21   | 32
+|===
+
+The reason is that each “genre” or “releaseYear” facet field is indexed in its 
own data structure, and therefore if an
+actor appeared in a "Thriller" movie in "2020" and "Action" movie in "2021", 
there's no way for you to tell that they
+didn't appear in an "Action" movie in "2020".
+
+[#automotive-parts]
+=== Automotive Parts Store
+
+Say you're building a search engine for an automotive parts store where 
customers can search for different car parts.
+For simplicity let's assume that each item in the catalog contains a 
searchable “type” field and “car model” it fits
+which consists of two separate fields: “manufacturer” and “year”. This lets 
you search for parts by their type as well
+as filter parts that fit only a certain manufacturer or year. Few example 
documents:
+
+[source]
+----
+{
+  "type": "Wiper Blades V1",
+  "models": [
+    { "manufaturer": "Ford", "year": 2010 },
+    { "manufacturer": "Chevy", "year": 2011 }
+  ]
+}
+{
+  "type": "Wiper Blades V2",
+  "models": [
+    { "manufaturer": "Ford", "year": 2011 },
+    { "manufacturer": "Chevy", "year": 2010 }
+  ]
+}
+----
+
+By breaking up the "models" field into its sub-fields "manufacturer" and 
"year", you can easily aggregate on parts that
+fit a certain manufacturer or year. However, if a user would like to aggregate 
on parts that can fit either a "Ford
+2010" or "Chevy 2011", then aggregating on the sub-fields will lead to a wrong 
count of 2 (in the above example) instead
+of 1.
+
+[#movie-awards]
+=== Movie Awards
+
+To showcase a 3-D multidimensional aggregation, lets expand the 
<<movie-actors>> example with awards an actor has
+received over the years. For this aggregation we will use four dimensions: 
Award Type ("Oscar", "Grammy", "Emmy"),
+Award Category ("Best Actor", "Best Supporting Actress"), Year and Genre. One 
interesting aggregation is to show how
+many "Best Actor" vs "Best Supporting Actor" awards one has received in the 
"Oscar" or "Emmy" for each year. Another
+aggregation is slicing the number of these awards by Genre over all the years.
+
+Building on these examples, one might be able to come up with an interesting 
use case for an N-dimensional aggregation
+(where `N > 3`). The higher `N` is, the harder it is to aggregate all the 
dimensions correctly and efficiently without
+`FacetSets`.
+
+== FacetSets API
+
+The `facetset` package consists of few components which allow you to index and 
aggregate multidimensional facet sets:
+
+=== FacetSet
+
+Holds a set of facet dimension values. Implementations are required to convert 
the dimensions into comparable long
+representation, as well can implement how the values are packed (encoded). The 
package offers four implementations:
+`Int/Float/Long/DoubleFacetSet` for `int`, `float`, `long` and `double` values 
respectively. You can also look at
+`org.apache.lucene.demo.facet.CustomFacetSetExample` in the `lucene/demo` 
package for a custom implementation of a
+`FacetSet`.
+
+=== FacetSetsField
+
+A `BinaryDocValues` field which lets you index a list of `FacetSet`. This 
field can be added to a document only once, so
+you will need to construct all the facet sets in advance.
+
+=== FacetSetMatcher
+
+Responsible for matching an encoded `FacetSet` against a given criteria. For 
example, `ExactFacetSetMatcher` only
+considers an encoded facet set as a match if all dimension values are equal to 
a given one. `RangeFacetSetMatcher`
+considers an encoded facet set as a match if all dimension values fall within 
predefined ranges. You can also look at
+`org.apache.lucene.demo.facet.CustomFacetSetExample` in the `lucene/demo` 
package for a custom implementation of a
+`FacetSetMatcher`.
+
+=== FacetSetDecoder

Review Comment:
   Good idea, done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

Reply via email to