Re: [PR] Remove CollectorOwner class (#13671) [lucene]

2024-09-02 Thread via GitHub
epotyom commented on code in PR #13702: URL: https://github.com/apache/lucene/pull/13702#discussion_r1740595511 ## lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysQuery.java: ## @@ -58,15 +61,33 @@ class DrillSidewaysQuery extends Query { */ DrillSidewaysQuery(

Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-02 Thread via GitHub
javanna commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1740601737 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -890,11 +945,70 @@ public static class LeafSlice { * * @lucene.experimental

Re: [PR] Remove CollectorOwner class (#13671) [lucene]

2024-09-02 Thread via GitHub
epotyom commented on code in PR #13702: URL: https://github.com/apache/lucene/pull/13702#discussion_r1740604447 ## lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java: ## @@ -344,31 +331,30 @@ public ConcurrentDrillSidewaysResult search( // Main query Fac

Re: [PR] Remove CollectorOwner class (#13671) [lucene]

2024-09-02 Thread via GitHub
epotyom commented on code in PR #13702: URL: https://github.com/apache/lucene/pull/13702#discussion_r1740606822 ## lucene/facet/src/test/org/apache/lucene/facet/TestDrillSideways.java: ## @@ -284,7 +284,7 @@ public void testCollectionTerminated() throws Exception { Weig

Re: [PR] Remove CollectorOwner class (#13671) [lucene]

2024-09-02 Thread via GitHub
epotyom commented on code in PR #13702: URL: https://github.com/apache/lucene/pull/13702#discussion_r1740638891 ## lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java: ## @@ -480,59 +462,61 @@ private void searchSequentially( } Query[] drillDownQueries = q

Re: [PR] Remove CollectorOwner class (#13671) [lucene]

2024-09-02 Thread via GitHub
epotyom commented on code in PR #13702: URL: https://github.com/apache/lucene/pull/13702#discussion_r1740642814 ## lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java: ## @@ -414,62 +399,59 @@ public ConcurrentDrillSidewaysResult search( /** * Search using

Re: [PR] Fix integer overflow in GeoEncodingUtils#Grid implementations [lucene]

2024-09-02 Thread via GitHub
iverase merged PR #13704: URL: https://github.com/apache/lucene/pull/13704 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [I] Narrow polygons close to latitude 90 do not match any points [lucene]

2024-09-02 Thread via GitHub
iverase closed issue #13703: Narrow polygons close to latitude 90 do not match any points URL: https://github.com/apache/lucene/issues/13703 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

Re: [PR] Remove CollectorOwner class (#13671) [lucene]

2024-09-02 Thread via GitHub
epotyom commented on code in PR #13702: URL: https://github.com/apache/lucene/pull/13702#discussion_r1740698503 ## lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java: ## @@ -414,62 +399,59 @@ public ConcurrentDrillSidewaysResult search( /** * Search using

[I] RegExp::automaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
ChrisHegarty opened a new issue, #13706: URL: https://github.com/apache/lucene/issues/13706 There are a number of optimization in Elasticsearch that depend upon the automaton from a `RegExp` being total - accepts all strings - [1] [2]. Changes in the upcoming Lucene 10, to not minimize aut

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
ChrisHegarty commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324446714 @mikemccand @rmuir - your thoughts here would be helpful, since I'm less familiar with this area of code. -- This is an automated message from the Apache Git Service. To res

Re: [PR] Speed up advancing within a block. [lucene]

2024-09-02 Thread via GitHub
jpountz commented on PR #13692: URL: https://github.com/apache/lucene/pull/13692#issuecomment-2324658146 I plotted the number of docs that queries need to skip within a block when advancing to get a better idea of the distribution, for our queries that bottleneck on advancing. The way you s

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
dweiss commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324660150 Minimization is a sure way to prove an automaton accepts all input strings because then the isTotal check is trivial [1]. You could try to trace all possible transitions, starting fr

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
jpountz commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324748057 In a similar vein, I wonder if `RegExp` could create more efficient automata out of the box. For instance, it looks like `Operations#repeat` could be optimized in the case when ther

Re: [I] Allow MultiLeafKnnCollector.greediness to be configurable [lucene]

2024-09-02 Thread via GitHub
jpountz commented on issue #13699: URL: https://github.com/apache/lucene/issues/13699#issuecomment-2324768570 > Do we have other internal API parameters via system property? I'm wondering why you think that's preferable to adding Java functions? I don't think we do indeed, I'm trying

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
rmuir commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324827442 @ChrisHegarty implementation of `isTotal()` method requires a minimal DFA. If the automaton is not minimal, it may return false but it should not create a problem. This is th

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
rmuir commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324856962 This is just what i'm mulling over now, relaxing `isTotal` to no longer require a minimal DFA: ``` diff --git a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
ChrisHegarty commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324861119 > In a similar vein, I wonder if `RegExp` could create more efficient automata out of the box. For instance, it looks like `Operations#repeat` could be optimized in the case wh

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
rmuir commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324880374 Here's a round two, to prevent any error on NFA or having transitions to dead states: ``` diff --git a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java b/l

Re: [PR] Speed up advancing within a block. [lucene]

2024-09-02 Thread via GitHub
jpountz commented on PR #13692: URL: https://github.com/apache/lucene/pull/13692#issuecomment-2324897665 Another data point: here is what luceneutil gives me when using branchless binary search: ``` TaskQPS baseline StdDevQPS my_modified_version

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
ChrisHegarty commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324899200 @rmuir Just skimming your update to isTotal, it looks good. I think that it will be more generally useful, given that we minimize less now. Separately, I might also make

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
rmuir commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324907262 > I had a similar thought. Looking at the code it kinda looks a little tacky, but also kinda makes sone sense, e.g. > > ```diff > case REGEXP_REPEAT: > +if (

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
rmuir commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324948858 > @rmuir Just skimming your update to isTotal, it looks good. I think that it will be more generally useful, given that we minimize less now. need to stare at it some more. I do

[PR] Relax Operations.isTotal() to work with a deterministic automaton [lucene]

2024-09-02 Thread via GitHub
rmuir opened a new pull request, #13707: URL: https://github.com/apache/lucene/pull/13707 Operations.isTotal currently returns `false` unless the DFA is minimal. This makes the method almost useless and we definitely don't want to encourage minimization just to make such a check. C

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
rmuir commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325027450 @ChrisHegarty I created draft PR, but I am still not happy with it yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to G

Re: [PR] Relax Operations.isTotal() to work with a deterministic automaton [lucene]

2024-09-02 Thread via GitHub
rmuir commented on PR #13707: URL: https://github.com/apache/lucene/pull/13707#issuecomment-2325046919 I feel like with this code there are only two options: * keep `isTotal` O(1) , to prevent traps and problems. it does document that the input must be minimized. Maybe it is enough to fix

Re: [PR] Relax Operations.isTotal() to work with a deterministic automaton [lucene]

2024-09-02 Thread via GitHub
rmuir commented on PR #13707: URL: https://github.com/apache/lucene/pull/13707#issuecomment-2325097210 OK I took a third option here in the latest commit. Javadoc is unchanged, we just detect the kind of "total" automaton being created by RegExp, too, without involving any scary algor

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
rmuir commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325100044 See PR: #13707, I took a different approach which solves the practical problem without doing scary stuff. ``` public static boolean isTotal(Automaton a, int minAlphabet, i

[PR] move Operations.sameLanguage/subsetOf to AutomatonTestUtil in test-framework [lucene]

2024-09-02 Thread via GitHub
rmuir opened a new pull request, #13708: URL: https://github.com/apache/lucene/pull/13708 These methods run in quadratic time and have been traps in the past: they run in quadratic time. I think originally this was `equals()` but it is so costly, that we factored out into separate `s

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
dweiss commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325203893 I like the brevity of using sameLanguage! :) I keep trying to find a counterexample to the assertion that a deterministic, total automaton must accept full language in each st

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
rmuir commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325216584 I like the sameLanguage too, but I don't like the potential quadratic cost, considering we currently expect the calculation to be fast, and it is called on every automaton. I think it

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
rmuir commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325222565 I think the "full traversal" suggested by Dawid here would be very fast. The annoying part is probably just the reachability (e.g. regex parser produces automatons with some unreachab

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
dweiss commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325226500 > I like the sameLanguage too, but I don't like the potential quadratic cost, considering we currently expect the calculation to be fast, and it is called on every automaton. I think

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
rmuir commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325229488 > I don't think all states need to be considered - only those reachable from the initial state. Tracking which states have been checked already may add some overhead but even with

Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-02 Thread via GitHub
javanna commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1741215790 ## lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollectorManager.java: ## @@ -28,17 +31,77 @@ */ public class TotalHitCountCollectorManager imple

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
dweiss commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325234318 You could probably create an automaton containing states with an insane number of outgoing transitions, for example one transition for each character... then resolving that such a st

Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-02 Thread via GitHub
javanna commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1741216330 ## lucene/test-framework/src/java/org/apache/lucene/tests/search/ScorerIndexSearcher.java: ## @@ -76,4 +77,14 @@ protected void search(List leaves, Weight weight, Col

Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-02 Thread via GitHub
javanna commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1741216647 ## lucene/core/src/java/org/apache/lucene/search/BulkScorer.java: ## @@ -27,18 +27,6 @@ */ public abstract class BulkScorer { - /** - * Scores and collects all

Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-02 Thread via GitHub
javanna commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1741216982 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -732,7 +842,11 @@ private void search( * @param collector to receive hits * @throws

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
dweiss commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325236726 Like so: ``` Automaton c = Operations.repeat( Operations.union( List.of( Automata.makeCharRange(

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
rmuir commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325244262 I don't think it is too bad because transitions are already sorted and collapsed for each state when you call `finishState()`. For such an automaton you paid the price at construction

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
rmuir commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325264840 @dweiss i coded your idea up like this: ```java /** * Returns true if the given automaton accepts all strings for the specified min/max range of the * alphabet.

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
rmuir commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325265996 https://github.com/apache/lucene/pull/13707/commits/ce4cce05538f79418ba4616dda89755975617762 -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] Relax Operations.isTotal() to work with a deterministic automaton [lucene]

2024-09-02 Thread via GitHub
rmuir commented on PR #13707: URL: https://github.com/apache/lucene/pull/13707#issuecomment-2325272520 iterating again, I changed the code here to @dweiss solution and added tests based on his examples. it runs in linear time and space (bitset), and should work generally for any DFA,

Re: [PR] Relax Operations.isTotal() to work with a deterministic automaton [lucene]

2024-09-02 Thread via GitHub
rmuir commented on PR #13707: URL: https://github.com/apache/lucene/pull/13707#issuecomment-2325282540 heh that empty case got found in a different way by the randomized check. it isn't enough to do `a.getNumStates() > 0`, for the situation where all states are dead states :) w

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
rmuir commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325291557 > * automaton must not be empty where "empty" means, at least one reachable state. Automaton can have all dead states, and that doesn't make it total :) -- This is an automat

Re: [PR] Speed up advancing within a block. [lucene]

2024-09-02 Thread via GitHub
jpountz commented on PR #13692: URL: https://github.com/apache/lucene/pull/13692#issuecomment-2325294386 Other data points: if I bias towards the next 2 doc IDs rather than just the next doc ID: ```java static int findNextGEQ(long[] values, long target, int startIndex) { i

Re: [PR] Relax Operations.isTotal() to work with a deterministic automaton [lucene]

2024-09-02 Thread via GitHub
rmuir commented on PR #13707: URL: https://github.com/apache/lucene/pull/13707#issuecomment-2325299722 ok, the tests have found all the fun with empty cases and I think it is ready for review. for background, this problem is similar to `Operations.isEmpty()`: https://github.com/apach

Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-02 Thread via GitHub
dweiss commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325690199 Yes, I like it! I had some time to think about it before I went to bed and this implementation is actually a direct rollout of the definition of accepted language equivalence