[GitHub] [lucene] jpountz commented on issue #12358: Optimize `count()` for BooleanQuery disjunction
jpountz commented on issue #12358: URL: https://github.com/apache/lucene/issues/12358#issuecomment-1596907955 I was wondering about a possible alternative approach that would consist of handling this optimization more at the collector level. E.g. we could add a new method to LeafCollector to take a set of doc IDs (pseudo code): ``` void collect(DocIdSet set) { // default implementation for (int docID : set) { collect(docID); } } ``` Then `BooleanScorer` could pass a `DocIdSet` to the collector for each window, and TotalHitCountCollector could override the above method to do more something like this: ``` void collect(DocIdSet set) { // total hit count implementation count += set.size(); // internally uses Long#bitCount rather than iterating bits one by one } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on issue #11020: CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit [LUCENE-9981]
msokolov commented on issue #11020: URL: https://github.com/apache/lucene/issues/11020#issuecomment-1597066615 @adioss why do you say this is "a security issue"? I don't see any security concerns being discussed here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] adioss commented on issue #11020: CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit [LUCENE-9981]
adioss commented on issue #11020: URL: https://github.com/apache/lucene/issues/11020#issuecomment-1597126024 Hi @msokolov thanks for the feedback. In fact, * somebody created https://forum.opensearch.org/t/is-opendistro-opensearch-exposed-to-redos-attack/5898/3 (opensearch sensitive to redos attack and in this case, opensearch is not available anymore) * then https://github.com/opensearch-project/OpenSearch/issues/687 * then https://issues.apache.org/jira/browse/LUCENE-9981 (migrated here) * then patches applied to 9.0 and 8.10.0 (and https://github.com/opensearch-project/OpenSearch/issues/687 closed) Do you think my analysis is not relevant or incorrect? Adrien -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12374: Provide constructor to accept the LeafSlice computed by extensions
jpountz commented on PR #12374: URL: https://github.com/apache/lucene/pull/12374#issuecomment-1597335385 I'm not entirely happy with this new constructor either, because now there are two ways to customize slicing: either with this new constructor or by overriding the `slices()` method. I'd like to have a single one, and the approach to override a method to compute slices given a list of segments looks more user-friendly to me than having to compute slices up-front? For reference, this [20-years-old comment](https://github.com/apache/lucene/commit/360d91dde75244b3b73793ac3a2e6c8737c45f9f) on `IndexSearcher` about making sure to reuse `IndexSearcher` instances is outdated, all IndexSearcher constructors are cheap compared to the work that running any query would need to do. It's actually required to create a new `IndexSearcher` for every search if you wish to leverage `IndexSearcher`'s timeout support. If there is reluctance to computing slices on every call to `#search`, could we compute slices lazily and cache them in `getSlices()`? This way, it wouldn't be needed to increase the visibility of `slices` to `protected`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on issue #12230: TotalHitCountCollector to stop counting when threshold is reached
jpountz commented on issue #12230: URL: https://github.com/apache/lucene/issues/12230#issuecomment-1597398770 Would it help if we had something like a `PartialHitCountCollector` in `lucene/misc` that does what you're describing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on issue #12230: TotalHitCountCollector to stop counting when threshold is reached
javanna commented on issue #12230: URL: https://github.com/apache/lucene/issues/12230#issuecomment-1597471217 > Would it help if we had something like a PartialHitCountCollector in lucene/misc that does what you're describing? I believe so, that sounds like a good trade-off. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] andrross commented on pull request #12374: Provide constructor to accept the LeafSlice computed by extensions
andrross commented on PR #12374: URL: https://github.com/apache/lucene/pull/12374#issuecomment-1597890665 +1 to not having two ways to customize slicing. > the approach to override a method to compute slices given a list of segments looks more user-friendly to me than having to compute slices up-front? @jpountz I agree. I think the basic problem here is that the protected method is called from the constructor of the parent class, which really limits the flexibility and can lead to unexpected behavior. The simplest approach is to just compute the slices on every call to `#search`. If we want to be more defensive about possible performance regressions, then lazy computation + caching like you said is the way to go (i.e. memoization). Is there any utility in Lucene like this [memoizing supplier](https://github.com/google/guava/blob/master/guava/src/com/google/common/base/Suppliers.java#L157-L201) in Guava? (I couldn't find one but I'm not super familiar with the code base). Hand rolling the memoization logic inside IndexSearch is an option too, we'll just need to take some care to ensure that it is thread safe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] sohami commented on pull request #12374: Provide constructor to accept the LeafSlice computed by extensions
sohami commented on PR #12374: URL: https://github.com/apache/lucene/pull/12374#issuecomment-1597927888 > +1 to not having two ways to customize slicing. >> the approach to override a method to compute slices given a list of segments looks more user-friendly to me than having to compute slices up-front? Agreed to this however the `slices` method doesn't provide got way to overwrite the computation for all the cases as it is called from base class constructor. > For reference, this [20-years-old comment](https://github.com/apache/lucene/commit/360d91dde75244b3b73793ac3a2e6c8737c45f9f) on IndexSearcher about making sure to reuse IndexSearcher instances is outdated, all IndexSearcher constructors are cheap compared to the work that running any query would need to do. It's actually required to create a new IndexSearcher for every search if you wish to leverage IndexSearcher's timeout support. Thanks for sharing the context > If there is reluctance to computing slices on every call to #search, could we compute slices lazily and cache them in getSlices()? This way, it wouldn't be needed to increase the visibility of slices to protected? With `getSlices()` based approach and using member variable `leafSlices` to cache the computed value, my only concern is that it will be hard to enforce any usage to fetch the slices within `IndexSearcher` to use `getSlices` always (and not the cached member directly). With that in mind I guess its cleaner to compute `slices` always and remove the member `leafSlices`. Thoughts ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] easyice opened a new pull request, #12377: Avoid redundant loop for compute min value in DirectMonotonicWriter
easyice opened a new pull request, #12377: URL: https://github.com/apache/lucene/pull/12377 ### Description This small change will reduce an unnecessary loop in DirectMonotonicWriter#flush -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org