On Thu, Dec 16, 2021 at 4:56 PM Robert Muir <[email protected]> wrote: > > On Thu, Dec 16, 2021 at 5:57 PM Greg Miller <[email protected]> wrote: > > > > This is separate from adding hierarchical support. I'm probably not > > communicating the current state well, but here's where SSDV faceting > > does ordinal lookups: > > https://github.com/apache/lucene/blob/c64e5fe84c4990968844193e3a62f4ebbba638ea/lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java#L148 > > > > So this is done for every returned value, which as you describe, > > scales with the requested top-n. For getAllDims, this logic is > > executed for every dimension. > > > > I don't think these lookups are avoidable since we provide the path > > for each returned value, and in order to get the path, we need to > > dereference the ordinal. > > > > OK I get it. I think the strangeness (compared to e.g. solr faceting) > is that we're mixing ordinals from different fields ("dims") all into > one DV field? And then we have a trappy method to do top-N for all > possible dims in this single packed field (what if there are > thousands???).
Right. The "get all dims" functionality does have a bit of a "trappy" feel to it for this reason. I think there are situations where "dim mixing" can be beneficial; if you actually do need facet counts for most (or all) of your dims, I can see the benefit of iterating the FacetsCollector once, counting everything in the SSDV field in the same pass, then getting the results you need. But this is very suboptimal if you have a large number of dims and only need faceting on a small number of them (burn a bunch of up-front cost counting dims you don't care about). I tried to address this by providing StringValueFacetCounts (LUCENE-9950), which essentially chucks the concept of "dim" altogether and assumes the field itself is the dim (sounds like what solr does, but I need to get more familiar with that impl). When I introduced StringValueFacetCounts, I was hesitant to suggest deprecating what SSDV faceting does since I think there are valid applications for wanting to pack many dims into a single field. For what it's worth, taxonomy-based faceting operates in the same way, defaulting to packing all the dims into one doc value field. Anyway, not really sure where I'm going with all this except to say +1 to getAllDims being potentially trappy. I can see users thinking, "well, I need to grab counts for a few different dims so I'll just call getAllDims then pull out what I want instead of calling getTopChildren for each dim." Hopefully they're not doing this, but it would be an easy trap to fall into. I suppose the last thing I'd say is that there are valid use-cases for wanting the "top" dims along with their "top" children, and getAllDims provides a reasonable way to do this. For example, in Amazon's product search, we have a large number of different dims but only want to show a small sub-set to customers on a search page. One way to go about this would be to determine the "top" dims for the match set along with the "top n" values under each; getAllDims is helpful for this but has a bit of an unpleasant side-effect that it unnecessarily resolves the paths for all children for all dims. As I think about this, I wonder if a getTopDims method would be more useful that lets the user specify the number of dims they want back along with the number of children for each? I'll open a Jira for that. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
