[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

GitBox Tue, 14 Jun 2022 21:54:26 -0700


shaie commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r897533011



##########
lucene/facet/docs/FacetSets.adoc:
##########
@@ -0,0 +1,90 @@
+= FacetSets Overview
+:toc:
+
+This document describes the `FacetSets` capability, which allows to aggregate 
on multi dimensional values. It starts
+with outlining a few example use cases to showcase the motivation for this 
capability and follows with an API
+walk through.
+
+== Motivation
+
+[#movie-actors]
+=== Movie Actors DB
+
+Suppose that you want to build a search engine for movie actors which allows 
you to search for actors by name and see
+movie titles they appeared in. You might want to index standard fields such as 
`actorName`, `genre` and `releaseYear`
+which will let you search by the actor's name or see all actors who appeared 
in movies during 2021. Similarly, you can
+index facet fields that will let you aggregate by “Genre” and “Year” so that 
you can show how many actors appeared in
+each year or genre. Few example documents:
+
+[source]
+----
+{ "name": "Tom Hanks", "genre": ["Comedy", "Drama", …], "year": [1988, 2000,…] 
}
+{ "name": "Harrison Ford", "genre": ["Action", "Adventure", …], "year": [1977, 
1981, …] }
+----
+
+However, these facet fields do not allow you to show the following aggregation:
+
+.Number of Actors performing in movies by Genre and Year
+[cols="4*"]
+|===
+| | 2020 | 2021 | 2022
+| Thriller | 121 | 43 | 97
+| Action | 145 | 52 | 130
+| Adventure | 87 | 21 | 32
+|===
+
+The reason is that each “genre” or “releaseYear” facet field is indexed in its 
own data structure, and therefore if an
+actor appeared in a "Thriller" movie in "2020" and "Action" movie in "2021", 
there's no way for you to tell that they
+didn't appear in an "Action" movie in "2020".
+
+[#automotive-parts]
+=== Automotive Parts Store
+
+Say you're building a search engine for an automotive parts store where 
customers can search for different car parts.
+For simplicity let's assume that each item in the catalog contains a 
searchable “type” field and “car model” it fits
+which consists of two separate fields: “manufacturer” and “year”. This lets 
you search for parts by their type as well
+as filter parts that fit only a certain manufacturer or year. Few example 
documents:
+
+[source]
+----
+{
+  "type": "Wiper Blades V1",
+  "models": [
+    { "manufaturer": "Ford", "year": 2010 },
+    { "manufacturer": "Chevy", "year": 2011 }
+  ]
+}
+{
+  "type": "Wiper Blades V2",
+  "models": [
+    { "manufaturer": "Ford", "year": 2011 },
+    { "manufacturer": "Chevy", "year": 2010 }
+  ]
+}
+----
+
+By breaking up the "models" field into its sub-fields "manufacturer" and 
"year", you can easily aggregate on parts that
+fit a certain manufacturer or year. However, if a user would like to aggregate 
on parts that can fit either a "Ford
+2010" or "Chevy 2011", then aggregating on the sub-fields will lead to a wrong 
count of 2 (in the above example) instead
+of 1.
+
+[#movie-awards]
+=== Movie Awards
+
+To showcase a 3-D multi-dimensional aggregation, lets expand the 
<<movie-actors>> example with awards an actor has
+received over the years. For this aggregation we will use four dimensions: 
Award Type ("Oscar", "Grammy", "Emmy"),
+Award Category ("Best Actor", "Best Supporting Actress"), Year and Genre. One 
interesting aggregation is to show how
+many "Best Actor" vs "Best Supporting Actor" awards one has received in the 
"Oscar" or "Emmy" for each year. Another
+aggregation is slicing the number of these awards by Genre over all the years.
+
+Building on these examples, one might be able to come up with an interesting 
use case for an N-dimensional aggregation
+(where `N > 3`). The higher `N` is, the harder it is to aggregate all the 
dimensions correctly and efficiently without
+`FacetSets`.
+
+== FacetSets API
+
+TBD
+
+== FacetSets Under the Hood
+
+TBD

Review Comment:
   I intended to do that, just wanted us to finalize the API first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

Reply via email to