epotyom commented on PR #13568: URL: https://github.com/apache/lucene/pull/13568#issuecomment-2250425707
Thank you for the feedback @gsmiller ! I've discussed it with @Shradha26, our thoughts are below: > 1. In general, it seems like we expose a lot of abstractions to the end-user to deal with. It would be nice if this could be simplified by either simplifying the need for some of the abstractions, or by adding some helper/sugar implementation to make doing common tasks easier. Just starting from the demo code (which is where I started), it feels like we push users right in the deep end trying to interact with all these concepts (recorders, cutters, bi-maps, chained iterators). I agree that the new recorder, cutter, bi-map, chained ordinal iterator abstractions require some extra work on user's side, but I believe that they are necessary to provide the flexibility that we want. We will think more about adding helper methods as a follow up task though, if we agree on the low level API of course. I do believe that we want to have demo for the low level API usage even if we have helper methods. But of course we can provide demos for the helpers when we have them. > 1. The java package organization "smells" a little funny to be at first. I wonder if it's organized well to limit API surface area? I will look at this in more detail in another pass, but I wonder if all the java sub-packages might be forcing some class definitions to have broader visibility than necessary (where-as a flatter package structure might allow a tighter API surface for users). Hmm, good point. We've recently improved package organization (thanks to @stefanvodita 's feedback!), but we will think more about it. I do like that we are grouping some classes within the module, it makes it easier for developers to have the core concepts. > 1. I'm struggling with the "ordinal abstraction" that's being forced on all the non-taxonomy faceting. It doesn't feel right. Maybe it's the best thing to do here (I don't have any other suggestions at the moment), but it certainly feels like we're forcing these faceting implementations that naturally interact with distinct long values or user-defined ranges into a paradigm where they have to "speak ordinal" so they can work with these protocol abstractions everywhere. My gut feel on this is that we're trying to take the protocol abstractions too far. I'll keep thinking about this but I'd like to hear more from you on how you arrived at this design since I'm sure you've thought a lot more about it. Can you please explain what doesn't feel right? The reason we want to have ordinals abstraction is to decouple the “faceting” and the “aggregation” logic. It allows any faceting logic (taxonomy, ranges, etc) to be compatible with any aggregation logic (count, single or multiple long aggregations, single or multiple double aggregations and so on). It also allows decoupling top-n and sorting mechanisms from particular facets implementations. If not int ordinals, we need some other public abstraction that can be uniquely identified (implements equals and hashCode?), is comparable (for top-n and sort) and at the same time is performant. The ordinal abstraction allows as many distinct long values and ranges as existing facets. Both of them rely on arrays to keep counts (ranges use arrays directly, `LongValueFacetCounts` uses `LongIntHashMap` which relies on arrays), so they are already limited by int max value as max array size. To be fair, `LongValueFacetCounts` uses extra array for the first 1k long values, but I don't think it's a deal breaker. > 1. The iterator logic (the "read" side after you do the accumulations) feels a little clunky to me at the moment. A couple thoughts: (a) Some of this seems very similar to the interaction between DocIdSetIterator and Scorer and I wonder if we share some idea from that. Specifically, what if the "recorders" expose ordinal iterators that act as shared state? And instead of having to call back to something like "getCount" with a specific ordinal, you just call "getCount" with no argument and get the count of the currently-positioned shared ordinal? The iteration logic likely has to iterate keys and values at the same time, so then you wouldn't be doing another map lookup on the "getValue" callbacks. Good point. I think we still need to be able get count (or other data, from other recorders) by ordinal, e.g. for these two use cases: 1. We sometimes need data from multiple recorders for top-n, e.g. sort by long aggregation with tie break by count. If both recorders can only return iterators with the payload (count, long aggregations, etc), we would need to merge these two ordinals streams somehow, it might be tricky. 2. We don't always need counts (or other recorders) for ordinals sorting/filtering, for example `LongValueFacetCounts#getAllChildrenSortByValue` only need the long value itself. Adding count to the initial iterator can be wasteful in this case. At the same time I see how it can make top-n by count faster, maybe we can add `FacetRecorder#getIterator` method that gives ordinals with payload? But it might make `OrdToComparable` even more awkward as it will have to handle both the payload and in some cases like [1] above map lookup? I would suggest that we try implementing it as a follow up task. In our internal performance tests getting top-N and doing map lookups doesn't appear to be a bottleneck. And to compare the existing facets module with the new one with luceneutil benchmarks, I think we need to change existing tasks to take query execution into the equation to make a fair comparison? It might make top-n relatively too cheap to require further performance improvements? > (b) The comparable factory stuff feels a little awkward (the `OrdToComparable` stuff). Is there a way to have a more functional interface here than the need to "create" new comparable objects constantly (yes, I saw the reuse thing, but I wonder if we can simplify further)? More specifically, instead of having `T #getComparable(ord, ...)` could you not just have a `int #compareTo(ord)` method? IIUC you suggest that we don't reuse objects that spill over PriorityQueue, and extract counts/aggregations in `#compareTo(ord)` method? With the current design it might be too expensive, as it requires map lookup every time we want to compare two PQ items. With your suggestion above about iterator that has both ordinal and count, there are other challenges such as how do we handle payload from two recorders, e.g. count and long aggregation? It would require somehow merging the two iterators, which might be more expensive than map lookup with reuse. With that being said, I do agree that the interface might not be very intuitive for users, but this is the best we could do so far for our goal, which is to have single performant TopN implementation for all facet types and any sort orders, including custom order provided by user. I'll give it another go, will think about the count iterator suggestion, and more ideas are welcome! > 1. This is a little lower-level feedback, but everything that implements `GetOrd` also extends `SkeletalGetOrd`. Can we collapse these to simplify? Could we just have an abstract class like `OrdinalProvider` or something? We can do that, it is just that it's best practice to depend on interfaces not abstract classes. One difference that comes to my mind is that classes in Java can only extend one other class, but can implement multiple interfaces. Not sure if it's very useful in this case, but we are making the interface public to let users implement their own sort order, so I think it's more preferable to make the interface public than an abstract class. --------- Thanks again for the feedback, looking forward to the further iterations and comments! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org