Re: [PR] Compute facets while collecting [lucene]

via GitHub Thu, 25 Jul 2024 07:11:52 -0700


epotyom commented on PR #13568:
URL: https://github.com/apache/lucene/pull/13568#issuecomment-2250425707


   Thank you for the feedback @gsmiller ! I've discussed it with @Shradha26, 
our thoughts are below:
   
   > 1. In general, it seems like we expose a lot of abstractions to the 
end-user to deal with. It would be nice if this could be simplified by either 
simplifying the need for some of the abstractions, or by adding some 
helper/sugar implementation to make doing common tasks easier. Just starting 
from the demo code (which is where I started), it feels like we push users 
right in the deep end trying to interact with all these concepts (recorders, 
cutters, bi-maps, chained iterators).
   
   I agree that the new recorder, cutter, bi-map, chained ordinal iterator 
abstractions require some extra work on user's side, but I believe that they 
are necessary to provide the flexibility that we want. We will think more about 
adding helper methods as a follow up task though, if we agree on the low level 
API of course. I do believe that we want to have demo for the low level API 
usage even if we have helper methods. But of course we can provide demos for 
the helpers when we have them.
   
   > 1. The java package organization "smells" a little funny to be at first. I 
wonder if it's organized well to limit API surface area? I will look at this in 
more detail in another pass, but I wonder if all the java sub-packages might be 
forcing some class definitions to have broader visibility than necessary 
(where-as a flatter package structure might allow a tighter API surface for 
users).
   
   Hmm, good point. We've recently improved package organization (thanks to 
@stefanvodita 's feedback!), but we will think more about it. I do like that we 
are grouping some classes within the module, it makes it easier for developers 
to have the core concepts.
   
   > 1. I'm struggling with the "ordinal abstraction" that's being forced on 
all the non-taxonomy faceting. It doesn't feel right. Maybe it's the best thing 
to do here (I don't have any other suggestions at the moment), but it certainly 
feels like we're forcing these faceting implementations that naturally interact 
with distinct long values or user-defined ranges into a paradigm where they 
have to "speak ordinal" so they can work with these protocol abstractions 
everywhere. My gut feel on this is that we're trying to take the protocol 
abstractions too far. I'll keep thinking about this but I'd like to hear more 
from you on how you arrived at this design since I'm sure you've thought a lot 
more about it.
   
   Can you please explain what doesn't feel right? The reason we want to have 
ordinals abstraction is to decouple the “faceting” and the “aggregation” logic. 
It allows any faceting logic (taxonomy, ranges, etc) to be compatible with any 
aggregation logic (count, single or multiple long aggregations, single or 
multiple double aggregations and so on). It also allows decoupling top-n and 
sorting mechanisms from particular facets implementations. If not int ordinals, 
we need some other public abstraction that can be uniquely identified 
(implements equals and hashCode?), is comparable (for top-n and sort) and at 
the same time is performant.
   
   The ordinal abstraction allows as many distinct long values and ranges as 
existing facets. Both of them rely on arrays to keep counts (ranges use arrays 
directly, `LongValueFacetCounts` uses `LongIntHashMap` which relies on arrays), 
so they are already limited by int max value as max array size. To be fair, 
`LongValueFacetCounts` uses extra array for the first 1k long values, but I 
don't think it's a deal breaker.
   
   > 1. The iterator logic (the "read" side after you do the accumulations) 
feels a little clunky to me at the moment. A couple thoughts: (a) Some of this 
seems very similar to the interaction between DocIdSetIterator and Scorer and I 
wonder if we share some idea from that. Specifically, what if the "recorders" 
expose ordinal iterators that act as shared state? And instead of having to 
call back to something like "getCount" with a specific ordinal, you just call 
"getCount" with no argument and get the count of the currently-positioned 
shared ordinal? The iteration logic likely has to iterate keys and values at 
the same time, so then you wouldn't be doing another map lookup on the 
"getValue" callbacks.
   
   Good point. I think we still need to be able get count (or other data, from 
other recorders) by ordinal, e.g. for these two use cases:
   1. We sometimes need data from multiple recorders for top-n, e.g. sort by 
long aggregation with tie break by count. If both recorders can only return 
iterators with the payload (count, long aggregations, etc), we would need to 
merge these two ordinals streams somehow, it might be tricky.
   2. We don't always need counts (or other recorders) for ordinals 
sorting/filtering, for example `LongValueFacetCounts#getAllChildrenSortByValue` 
only need the long value itself. Adding count to the initial iterator can be 
wasteful in this case.
   
   At the same time I see how it can make top-n by count faster, maybe we can 
add `FacetRecorder#getIterator` method that gives ordinals with payload? But it 
might make `OrdToComparable` even more awkward as it will have to handle both 
the payload and in some cases like [1] above map lookup? I would suggest that 
we try implementing it as a follow up task. In our internal performance tests 
getting top-N and doing map lookups doesn't appear to be a bottleneck. And to 
compare the existing facets module with the new one with luceneutil benchmarks, 
I think we need to change existing tasks to take query execution into the 
equation to make a fair comparison? It might make top-n relatively too cheap to 
require further performance improvements?
   
   > (b) The comparable factory stuff feels a little awkward (the 
`OrdToComparable` stuff). Is there a way to have a more functional interface 
here than the need to "create" new comparable objects constantly (yes, I saw 
the reuse thing, but I wonder if we can simplify further)? More specifically, 
instead of having `T #getComparable(ord, ...)` could you not just have a `int 
#compareTo(ord)` method?
   
   IIUC you suggest that we don't reuse objects that spill over PriorityQueue, 
and extract counts/aggregations in `#compareTo(ord)` method? With the current 
design it might be too expensive, as it requires map lookup every time we want 
to compare two PQ items. With your suggestion above about iterator that has 
both ordinal and count, there are other challenges such as how do we handle 
payload from two recorders, e.g. count and long aggregation? It would require 
somehow merging the two iterators, which might be more expensive than map 
lookup with reuse.
   
   With that being said, I do agree that the interface might not be very 
intuitive for users, but this is the best we could do so far for our goal, 
which is to have single performant TopN implementation for all facet types and 
any sort orders, including custom order provided by user. I'll give it another 
go, will think about the count iterator suggestion, and more ideas are welcome!
   
   > 1. This is a little lower-level feedback, but everything that implements 
`GetOrd` also extends `SkeletalGetOrd`. Can we collapse these to simplify? 
Could we just have an abstract class like `OrdinalProvider` or something?
   
   We can do that, it is just that it's best practice to depend on interfaces 
not abstract classes. One difference that comes to my mind is that classes in 
Java can only extend one other class, but can implement multiple interfaces. 
Not sure if it's very useful in this case, but we are making the interface 
public to let users implement their own sort order, so I think it's more 
preferable to make the interface public than an abstract class.
   
   ---------
   
   Thanks again for the feedback, looking forward to the further iterations and 
comments!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Compute facets while collecting [lucene]

Reply via email to