epotyom commented on code in PR #13702:
URL: https://github.com/apache/lucene/pull/13702#discussion_r1740595511
##
lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysQuery.java:
##
@@ -58,15 +61,33 @@ class DrillSidewaysQuery extends Query {
*/
DrillSidewaysQuery(
javanna commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1740601737
##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -890,11 +945,70 @@ public static class LeafSlice {
*
* @lucene.experimental
epotyom commented on code in PR #13702:
URL: https://github.com/apache/lucene/pull/13702#discussion_r1740604447
##
lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java:
##
@@ -344,31 +331,30 @@ public ConcurrentDrillSidewaysResult search(
// Main query
Fac
epotyom commented on code in PR #13702:
URL: https://github.com/apache/lucene/pull/13702#discussion_r1740606822
##
lucene/facet/src/test/org/apache/lucene/facet/TestDrillSideways.java:
##
@@ -284,7 +284,7 @@ public void testCollectionTerminated() throws Exception {
Weig
epotyom commented on code in PR #13702:
URL: https://github.com/apache/lucene/pull/13702#discussion_r1740638891
##
lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java:
##
@@ -480,59 +462,61 @@ private void searchSequentially(
}
Query[] drillDownQueries = q
epotyom commented on code in PR #13702:
URL: https://github.com/apache/lucene/pull/13702#discussion_r1740642814
##
lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java:
##
@@ -414,62 +399,59 @@ public ConcurrentDrillSidewaysResult search(
/**
* Search using
iverase merged PR #13704:
URL: https://github.com/apache/lucene/pull/13704
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscr...@lucene.apa
iverase closed issue #13703: Narrow polygons close to latitude 90 do not match
any points
URL: https://github.com/apache/lucene/issues/13703
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the speci
epotyom commented on code in PR #13702:
URL: https://github.com/apache/lucene/pull/13702#discussion_r1740698503
##
lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java:
##
@@ -414,62 +399,59 @@ public ConcurrentDrillSidewaysResult search(
/**
* Search using
ChrisHegarty opened a new issue, #13706:
URL: https://github.com/apache/lucene/issues/13706
There are a number of optimization in Elasticsearch that depend upon the
automaton from a `RegExp` being total - accepts all strings - [1] [2]. Changes
in the upcoming Lucene 10, to not minimize aut
ChrisHegarty commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324446714
@mikemccand @rmuir - your thoughts here would be helpful, since I'm less
familiar with this area of code.
--
This is an automated message from the Apache Git Service.
To res
jpountz commented on PR #13692:
URL: https://github.com/apache/lucene/pull/13692#issuecomment-2324658146
I plotted the number of docs that queries need to skip within a block when
advancing to get a better idea of the distribution, for our queries that
bottleneck on advancing. The way you s
dweiss commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324660150
Minimization is a sure way to prove an automaton accepts all input strings
because then the isTotal check is trivial [1]. You could try to trace all
possible transitions, starting fr
jpountz commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324748057
In a similar vein, I wonder if `RegExp` could create more efficient automata
out of the box. For instance, it looks like `Operations#repeat` could be
optimized in the case when ther
jpountz commented on issue #13699:
URL: https://github.com/apache/lucene/issues/13699#issuecomment-2324768570
> Do we have other internal API parameters via system property? I'm
wondering why you think that's preferable to adding Java functions?
I don't think we do indeed, I'm trying
rmuir commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324827442
@ChrisHegarty implementation of `isTotal()` method requires a minimal DFA.
If the automaton is not minimal, it may return false but it should not create a
problem.
This is th
rmuir commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324856962
This is just what i'm mulling over now, relaxing `isTotal` to no longer
require a minimal DFA:
```
diff --git
a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.
ChrisHegarty commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324861119
> In a similar vein, I wonder if `RegExp` could create more efficient
automata out of the box. For instance, it looks like `Operations#repeat` could
be optimized in the case wh
rmuir commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324880374
Here's a round two, to prevent any error on NFA or having transitions to
dead states:
```
diff --git
a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
b/l
jpountz commented on PR #13692:
URL: https://github.com/apache/lucene/pull/13692#issuecomment-2324897665
Another data point: here is what luceneutil gives me when using branchless
binary search:
```
TaskQPS baseline StdDevQPS
my_modified_version
ChrisHegarty commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324899200
@rmuir Just skimming your update to isTotal, it looks good. I think that it
will be more generally useful, given that we minimize less now.
Separately, I might also make
rmuir commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324907262
> I had a similar thought. Looking at the code it kinda looks a little
tacky, but also kinda makes sone sense, e.g.
>
> ```diff
> case REGEXP_REPEAT:
> +if (
rmuir commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2324948858
> @rmuir Just skimming your update to isTotal, it looks good. I think that
it will be more generally useful, given that we minimize less now.
need to stare at it some more. I do
rmuir opened a new pull request, #13707:
URL: https://github.com/apache/lucene/pull/13707
Operations.isTotal currently returns `false` unless the DFA is minimal.
This makes the method almost useless and we definitely don't want to
encourage minimization just to make such a check.
C
rmuir commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325027450
@ChrisHegarty I created draft PR, but I am still not happy with it yet.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to G
rmuir commented on PR #13707:
URL: https://github.com/apache/lucene/pull/13707#issuecomment-2325046919
I feel like with this code there are only two options:
* keep `isTotal` O(1) , to prevent traps and problems. it does document that
the input must be minimized. Maybe it is enough to fix
rmuir commented on PR #13707:
URL: https://github.com/apache/lucene/pull/13707#issuecomment-2325097210
OK I took a third option here in the latest commit.
Javadoc is unchanged, we just detect the kind of "total" automaton being
created by RegExp, too, without involving any scary algor
rmuir commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325100044
See PR: #13707, I took a different approach which solves the practical
problem without doing scary stuff.
```
public static boolean isTotal(Automaton a, int minAlphabet, i
rmuir opened a new pull request, #13708:
URL: https://github.com/apache/lucene/pull/13708
These methods run in quadratic time and have been traps in the past: they
run in quadratic time.
I think originally this was `equals()` but it is so costly, that we factored
out into separate `s
dweiss commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325203893
I like the brevity of using sameLanguage! :)
I keep trying to find a counterexample to the assertion that a
deterministic, total automaton must accept full language in each st
rmuir commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325216584
I like the sameLanguage too, but I don't like the potential quadratic cost,
considering we currently expect the calculation to be fast, and it is called on
every automaton. I think it
rmuir commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325222565
I think the "full traversal" suggested by Dawid here would be very fast. The
annoying part is probably just the reachability (e.g. regex parser produces
automatons with some unreachab
dweiss commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325226500
> I like the sameLanguage too, but I don't like the potential quadratic
cost, considering we currently expect the calculation to be fast, and it is
called on every automaton. I think
rmuir commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325229488
> I don't think all states need to be considered - only those reachable from
the initial state. Tracking which states have been checked already may add some
overhead but even with
javanna commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1741215790
##
lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollectorManager.java:
##
@@ -28,17 +31,77 @@
*/
public class TotalHitCountCollectorManager
imple
dweiss commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325234318
You could probably create an automaton containing states with an insane
number of outgoing transitions, for example one transition for each
character... then resolving that such a st
javanna commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1741216330
##
lucene/test-framework/src/java/org/apache/lucene/tests/search/ScorerIndexSearcher.java:
##
@@ -76,4 +77,14 @@ protected void search(List leaves, Weight
weight, Col
javanna commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1741216647
##
lucene/core/src/java/org/apache/lucene/search/BulkScorer.java:
##
@@ -27,18 +27,6 @@
*/
public abstract class BulkScorer {
- /**
- * Scores and collects all
javanna commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1741216982
##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -732,7 +842,11 @@ private void search(
* @param collector to receive hits
* @throws
dweiss commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325236726
Like so:
```
Automaton c =
Operations.repeat(
Operations.union(
List.of(
Automata.makeCharRange(
rmuir commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325244262
I don't think it is too bad because transitions are already sorted and
collapsed for each state when you call `finishState()`. For such an automaton
you paid the price at construction
rmuir commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325264840
@dweiss i coded your idea up like this:
```java
/**
* Returns true if the given automaton accepts all strings for the
specified min/max range of the
* alphabet.
rmuir commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325265996
https://github.com/apache/lucene/pull/13707/commits/ce4cce05538f79418ba4616dda89755975617762
--
This is an automated message from the Apache Git Service.
To respond to the message,
rmuir commented on PR #13707:
URL: https://github.com/apache/lucene/pull/13707#issuecomment-2325272520
iterating again, I changed the code here to @dweiss solution and added tests
based on his examples.
it runs in linear time and space (bitset), and should work generally for any
DFA,
rmuir commented on PR #13707:
URL: https://github.com/apache/lucene/pull/13707#issuecomment-2325282540
heh that empty case got found in a different way by the randomized check.
it isn't enough to do `a.getNumStates() > 0`, for the situation where all
states are dead states :)
w
rmuir commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325291557
> * automaton must not be empty
where "empty" means, at least one reachable state. Automaton can have all
dead states, and that doesn't make it total :)
--
This is an automat
jpountz commented on PR #13692:
URL: https://github.com/apache/lucene/pull/13692#issuecomment-2325294386
Other data points: if I bias towards the next 2 doc IDs rather than just the
next doc ID:
```java
static int findNextGEQ(long[] values, long target, int startIndex) {
i
rmuir commented on PR #13707:
URL: https://github.com/apache/lucene/pull/13707#issuecomment-2325299722
ok, the tests have found all the fun with empty cases and I think it is
ready for review.
for background, this problem is similar to `Operations.isEmpty()`:
https://github.com/apach
dweiss commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325690199
Yes, I like it!
I had some time to think about it before I went to bed and this
implementation is actually a direct rollout of the definition of accepted
language equivalence
49 matches
Mail list logo