[GitHub] [lucene] jpountz commented on issue #12358: Optimize `count()` for BooleanQuery disjunction

2023-06-19 Thread via GitHub


jpountz commented on issue #12358:
URL: https://github.com/apache/lucene/issues/12358#issuecomment-1596907955

   I was wondering about a possible alternative approach that would consist of 
handling this optimization more at the collector level. E.g. we could add a new 
method to LeafCollector to take a set of doc IDs (pseudo code):
   
   ```
   void collect(DocIdSet set) {
 // default implementation
 for (int docID : set) {
   collect(docID);
 }
   }
   ```
   
   Then `BooleanScorer` could pass a `DocIdSet` to the collector for each 
window, and TotalHitCountCollector could override the above method to do more 
something like this:
   
   ```
   void collect(DocIdSet set) {
 // total hit count implementation
 count += set.size(); // internally uses Long#bitCount rather than 
iterating bits one by one
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on issue #11020: CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit [LUCENE-9981]

2023-06-19 Thread via GitHub


msokolov commented on issue #11020:
URL: https://github.com/apache/lucene/issues/11020#issuecomment-1597066615

   @adioss why do you say this is "a security issue"? I don't see any security 
concerns being discussed here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] adioss commented on issue #11020: CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit [LUCENE-9981]

2023-06-19 Thread via GitHub


adioss commented on issue #11020:
URL: https://github.com/apache/lucene/issues/11020#issuecomment-1597126024

   Hi @msokolov  thanks for the feedback.
   In fact, 
   * somebody created 
https://forum.opensearch.org/t/is-opendistro-opensearch-exposed-to-redos-attack/5898/3
 (opensearch sensitive to redos attack and in this case, opensearch is not 
available anymore)
   * then https://github.com/opensearch-project/OpenSearch/issues/687 
   * then https://issues.apache.org/jira/browse/LUCENE-9981 (migrated here)
   * then patches applied to 9.0 and 8.10.0 (and 
https://github.com/opensearch-project/OpenSearch/issues/687 closed)
   
   Do you think my analysis is not relevant or incorrect?
   
   Adrien


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12374: Provide constructor to accept the LeafSlice computed by extensions

2023-06-19 Thread via GitHub


jpountz commented on PR #12374:
URL: https://github.com/apache/lucene/pull/12374#issuecomment-1597335385

   I'm not entirely happy with this new constructor either, because now there 
are two ways to customize slicing: either with this new constructor or by 
overriding the `slices()` method. I'd like to have a single one, and the 
approach to override a method to compute slices given a list of segments looks 
more user-friendly to me than having to compute slices up-front?
   
   For reference, this [20-years-old 
comment](https://github.com/apache/lucene/commit/360d91dde75244b3b73793ac3a2e6c8737c45f9f)
 on `IndexSearcher` about making sure to reuse `IndexSearcher` instances is 
outdated, all IndexSearcher constructors are cheap compared to the work that 
running any query would need to do. It's actually required to create a new 
`IndexSearcher` for every search if you wish to leverage `IndexSearcher`'s 
timeout support.
   
   If there is reluctance to computing slices on every call to `#search`, could 
we compute slices lazily and cache them in `getSlices()`? This way, it wouldn't 
be needed to increase the visibility of `slices` to `protected`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on issue #12230: TotalHitCountCollector to stop counting when threshold is reached

2023-06-19 Thread via GitHub


jpountz commented on issue #12230:
URL: https://github.com/apache/lucene/issues/12230#issuecomment-1597398770

   Would it help if we had something like a `PartialHitCountCollector` in 
`lucene/misc` that does what you're describing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on issue #12230: TotalHitCountCollector to stop counting when threshold is reached

2023-06-19 Thread via GitHub


javanna commented on issue #12230:
URL: https://github.com/apache/lucene/issues/12230#issuecomment-1597471217

   > Would it help if we had something like a PartialHitCountCollector in 
lucene/misc that does what you're describing?
   
   I believe so, that sounds like a good trade-off.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] andrross commented on pull request #12374: Provide constructor to accept the LeafSlice computed by extensions

2023-06-19 Thread via GitHub


andrross commented on PR #12374:
URL: https://github.com/apache/lucene/pull/12374#issuecomment-1597890665

   +1 to not having two ways to customize slicing.
   
   > the approach to override a method to compute slices given a list of 
segments looks more user-friendly to me than having to compute slices up-front?
   
   @jpountz I agree. I think the basic problem here is that the protected 
method is called from the constructor of the parent class, which really limits 
the flexibility and can lead to unexpected behavior. The simplest approach is 
to just compute the slices on every call to `#search`. If we want to be more 
defensive about possible performance regressions, then lazy computation + 
caching like you said is the way to go (i.e. memoization). Is there any utility 
in Lucene like this [memoizing 
supplier](https://github.com/google/guava/blob/master/guava/src/com/google/common/base/Suppliers.java#L157-L201)
 in Guava? (I couldn't find one but I'm not super familiar with the code base). 
Hand rolling the memoization logic inside IndexSearch is an option too, we'll 
just need to take some care to ensure that it is thread safe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] sohami commented on pull request #12374: Provide constructor to accept the LeafSlice computed by extensions

2023-06-19 Thread via GitHub


sohami commented on PR #12374:
URL: https://github.com/apache/lucene/pull/12374#issuecomment-1597927888

   > +1 to not having two ways to customize slicing.
   
   >> the approach to override a method to compute slices given a list of 
segments looks more user-friendly to me than having to compute slices up-front?
   
   Agreed to this however the `slices` method doesn't provide got way to 
overwrite the computation for all the cases as it is called from base class 
constructor.
   
   > For reference, this [20-years-old 
comment](https://github.com/apache/lucene/commit/360d91dde75244b3b73793ac3a2e6c8737c45f9f)
 on IndexSearcher about making sure to reuse IndexSearcher instances is 
outdated, all IndexSearcher constructors are cheap compared to the work that 
running any query would need to do. It's actually required to create a new 
IndexSearcher for every search if you wish to leverage IndexSearcher's timeout 
support.
   
   Thanks for sharing the context
   
   > If there is reluctance to computing slices on every call to #search, could 
we compute slices lazily and cache them in getSlices()? This way, it wouldn't 
be needed to increase the visibility of slices to protected?
   
   With `getSlices()` based approach and using member variable `leafSlices` to 
cache the computed value, my only concern is that it will be hard to enforce 
any usage to fetch the slices within `IndexSearcher` to use `getSlices` always 
(and not the cached member directly). With that in mind I guess its cleaner to 
compute `slices` always and remove the member `leafSlices`. Thoughts ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] easyice opened a new pull request, #12377: Avoid redundant loop for compute min value in DirectMonotonicWriter

2023-06-19 Thread via GitHub


easyice opened a new pull request, #12377:
URL: https://github.com/apache/lucene/pull/12377

   ### Description
   
   This small change will reduce an unnecessary loop in 
DirectMonotonicWriter#flush


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org