Re: Any potential benefits to a SSDV#bulkLookupOrd(long ord) impl?

Greg Miller Fri, 17 Dec 2021 07:03:20 -0800

On Thu, Dec 16, 2021 at 4:56 PM Robert Muir <[email protected]> wrote:
>
> On Thu, Dec 16, 2021 at 5:57 PM Greg Miller <[email protected]> wrote:
> >
> > This is separate from adding hierarchical support. I'm probably not
> > communicating the current state well, but here's where SSDV faceting
> > does ordinal lookups:
> > https://github.com/apache/lucene/blob/c64e5fe84c4990968844193e3a62f4ebbba638ea/lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java#L148
> >
> > So this is done for every returned value, which as you describe,
> > scales with the requested top-n. For getAllDims, this logic is
> > executed for every dimension.
> >
> > I don't think these lookups are avoidable since we provide the path
> > for each returned value, and in order to get the path, we need to
> > dereference the ordinal.
> >
>
> OK I get it. I think the strangeness (compared to e.g. solr faceting)
> is that we're mixing ordinals from different fields ("dims") all into
> one DV field? And then we have a trappy method to do top-N for all
> possible dims in this single packed field (what if there are
> thousands???).


Right. The "get all dims" functionality does have a bit of a "trappy"
feel to it for this reason. I think there are situations where "dim
mixing" can be beneficial; if you actually do need facet counts for
most (or all) of your dims, I can see the benefit of iterating the
FacetsCollector once, counting everything in the SSDV field in the
same pass, then getting the results you need. But this is very
suboptimal if you have a large number of dims and only need faceting
on a small number of them (burn a bunch of up-front cost counting dims
you don't care about). I tried to address this by providing
StringValueFacetCounts (LUCENE-9950), which essentially chucks the
concept of "dim" altogether and assumes the field itself is the dim
(sounds like what solr does, but I need to get more familiar with that
impl). When I introduced StringValueFacetCounts, I was hesitant to
suggest deprecating what SSDV faceting does since I think there are
valid applications for wanting to pack many dims into a single field.
For what it's worth, taxonomy-based faceting operates in the same way,
defaulting to packing all the dims into one doc value field.

Anyway, not really sure where I'm going with all this except to say +1
to getAllDims being potentially trappy. I can see users thinking,
"well, I need to grab counts for a few different dims so I'll just
call getAllDims then pull out what I want instead of calling
getTopChildren for each dim." Hopefully they're not doing this, but it
would be an easy trap to fall into.

I suppose the last thing I'd say is that there are valid use-cases for
wanting the "top" dims along with their "top" children, and getAllDims
provides a reasonable way to do this. For example, in Amazon's product
search, we have a large number of different dims but only want to show
a small sub-set  to customers on a search page. One way to go about
this would be to determine the "top" dims for the match set along with
the "top n" values under each; getAllDims is helpful for this but has
a bit of an unpleasant side-effect that it unnecessarily resolves the
paths for all children for all dims. As I think about this, I wonder
if a getTopDims method would be more useful that lets the user specify
the number of dims they want back along with the number of children
for each? I'll open a Jira for that.

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Any potential benefits to a SSDV#bulkLookupOrd(long ord) impl?

Reply via email to