Re: [PR] Remove CollectorOwner class (#13671) [lucene]

2024-09-04 Thread via GitHub


epotyom commented on code in PR #13702:
URL: https://github.com/apache/lucene/pull/13702#discussion_r1743335144


##
lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java:
##
@@ -414,62 +399,59 @@ public  ConcurrentDrillSidewaysResult search(
 
   /**
* Search using DrillDownQuery with custom collectors. This method can be 
used with any {@link
-   * CollectorOwner}s. It doesn't return anything because it is expected that 
you read results from
-   * provided {@link CollectorOwner}s.
-   *
-   * To read the results, run {@link CollectorOwner#getResult()} for drill 
down and all drill
-   * sideways dimensions.
-   *
-   * Note: use {@link Collections#unmodifiableList(List)} to wrap {@code
-   * drillSidewaysCollectorOwners} to convince compiler that it is safe to use 
List here.
-   *
-   * Use {@link MultiCollectorManager} wrapped by {@link CollectorOwner} to 
collect both hits and
-   * facets for the entire query and/or for drill-sideways dimensions.
+   * CollectorManager}s.
*
-   * TODO: Class CollectorOwner was created so that we can ignore 
CollectorManager type C,
-   * because we want each dimensions to be able to use their own types. 
Alternatively, we can use
-   * typesafe heterogeneous container and provide CollectorManager type for 
each dimension to this
-   * method? I do like CollectorOwner approach as it seems more intuitive?
+   * Note: Use {@link MultiCollectorManager} to collect both hits and 
facets for the entire query
+   * and/or for drill-sideways dimensions. You can also use it to wrap 
different types of {@link
+   * CollectorManager} for drill-sideways dimensions.
*/
-  public void search(
+  public  Result search(
   final DrillDownQuery query,
-  CollectorOwner drillDownCollectorOwner,
-  List> drillSidewaysCollectorOwners)
+  CollectorManager drillDownCollectorManager,
+  List> drillSidewaysCollectorManagers)
   throws IOException {
-if (drillDownCollectorOwner == null) {
+if (drillDownCollectorManager == null) {
   throw new IllegalArgumentException(
   "This search method requires client to provide drill down collector 
manager");
 }
-if (drillSidewaysCollectorOwners == null) {
+if (drillSidewaysCollectorManagers == null) {

Review Comment:
   I agree with all points - thanks for reverting the commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestBoolean2.testRandomQueries fails in CI due to eating up heap space [lucene]

2024-09-04 Thread via GitHub


javanna commented on issue #11754:
URL: https://github.com/apache/lucene/issues/11754#issuecomment-2328396050

   You are correct @rmuir !
   
   In my case, `mulFactor` is `1024`. My branch makes it even more likely to go 
out of heap space, because it may create more slices to be searched 
concurrently, and each slice gets a separate collector with a separate queue of 
that size. We could disable query concurrency for this test, or work on 
limiting the number of slices. On the other hand, I don't understand the 
purpose of `1000 * mulFactor`. If I use `mulFactor` as the size of the queue, 
the test still succeeds and it no longer goes out of heap for my failing seed. 
Maybe that's a reasonable fix?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-04 Thread via GitHub


javanna commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1743463027


##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -890,11 +945,70 @@ public static class LeafSlice {
  *
  * @lucene.experimental
  */
-public final LeafReaderContext[] leaves;
+public final LeafReaderContextPartition[] leaves;
 
-public LeafSlice(List leavesList) {
-  Collections.sort(leavesList, Comparator.comparingInt(l -> l.docBase));
-  this.leaves = leavesList.toArray(new LeafReaderContext[0]);
+public LeafSlice(List 
leafReaderContextPartitions) {
+  leafReaderContextPartitions.sort(Comparator.comparingInt(l -> 
l.ctx.docBase));
+  // TODO should we sort by minDocId too?
+  this.leaves = leafReaderContextPartitions.toArray(new 
LeafReaderContextPartition[0]);
+}
+
+/**
+ * Returns the total number of docs that a slice targets, by summing the 
number of docs that
+ * each of its leaf context partitions targets.
+ */
+public int getNumDocs() {
+  return Arrays.stream(leaves)
+  .map(LeafReaderContextPartition::getNumDocs)
+  .reduce(Integer::sum)
+  .get();
+}
+  }
+
+  /**
+   * Holds information about a specific leaf context and the corresponding 
range of doc ids to
+   * search within.
+   *
+   * @lucene.experimental
+   */
+  public static final class LeafReaderContextPartition {
+private final int minDocId;
+private final int maxDocId;
+private final int numDocs;
+public final LeafReaderContext ctx;
+
+private LeafReaderContextPartition(
+LeafReaderContext leafReaderContext, int minDocId, int maxDocId, int 
numDocs) {
+  this.ctx = leafReaderContext;
+  this.minDocId = minDocId;
+  this.maxDocId = maxDocId;
+  this.numDocs = numDocs;

Review Comment:
   I spent quite a bit of time on this, and concluded that we could use 
`maxDoc` in all cases, also when the partition targets the entire segment, but 
it is not practical. There are places where we special case when min==0 and 
max==NO_MORE_DOCS , hence it is simpler to keep on using NO_MORE_DOCS as an 
upper bound.
   
   That does mean that for this case we need to take maxDoc as a separate 
argument because we cannot compute the number of docs for a partition via 
maxDocId - minDocId.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-04 Thread via GitHub


javanna commented on PR #13542:
URL: https://github.com/apache/lucene/pull/13542#issuecomment-2328473802

   This is ready for a final review. I have no further outstanding items to 
look into. I added some entries to the migrate file, changes entry and updated 
description above to be aligned with the proposed changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestBoolean2.testRandomQueries fails in CI due to eating up heap space [lucene]

2024-09-04 Thread via GitHub


rmuir commented on issue #11754:
URL: https://github.com/apache/lucene/issues/11754#issuecomment-2328792680

   @javanna I don't understand the test, but we need to tone down its 
craziness. Removing 1000 sounds great to me, if the single line change makes 
the test a thousand times less annoying then we succeeded.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestBoolean2.testRandomQueries fails in CI due to eating up heap space [lucene]

2024-09-04 Thread via GitHub


javanna commented on issue #11754:
URL: https://github.com/apache/lucene/issues/11754#issuecomment-2328812477

   Agreed, I opened #13713


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-04 Thread via GitHub


mikemccand commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2328841283

   > I think the relaxation patch is fine as a short first step - it doesn't 
claim to be optimal (PnP, as Mike loves to say).
   
   +1, heh.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] RegExp::toAutomaton no longer minimizes [lucene]

2024-09-04 Thread via GitHub


rmuir commented on issue #13706:
URL: https://github.com/apache/lucene/issues/13706#issuecomment-2329167160

   > Is `subsetOf` really quadratic? I know its javadoc says so, but I'm not 
convinced.
   
   I think this is only one of the scarier parts about it. The other scary part 
is that it may throw `IllegalArgumentException`, or in some cases 
`AssertionError` (if asserts are enabled, if not... UB?). 
   
   For these reasons too, I wanted to avoid its usage in something that gets 
called e.g. by CompiledAutomaton and proposed moving it to AutomatonTestUtil 
for test purposes only: https://github.com/apache/lucene/pull/13708


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce size of queue in TestBoolean2#testRandomQueries [lucene]

2024-09-04 Thread via GitHub


javanna merged PR #13713:
URL: https://github.com/apache/lucene/pull/13713


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TestBoolean2.testRandomQueries fails in CI due to eating up heap space [lucene]

2024-09-04 Thread via GitHub


javanna closed issue #11754: TestBoolean2.testRandomQueries fails in CI due to 
eating up heap space
URL: https://github.com/apache/lucene/issues/11754


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Relax Operations.isTotal() to work with a deterministic automaton [lucene]

2024-09-04 Thread via GitHub


mikemccand commented on code in PR #13707:
URL: https://github.com/apache/lucene/pull/13707#discussion_r1743876694


##
lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java:
##
@@ -857,22 +857,38 @@ public static boolean isEmpty(Automaton a) {
 return true;
   }
 
-  /** Returns true if the given automaton accepts all strings. The automaton 
must be minimized. */
+  /** Returns true if the given automaton accepts all strings. */

Review Comment:
   OK I see from the title of the PR :)  that indeed this algo expects/requires 
a determinized automaton.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Relax Operations.isTotal() to work with a deterministic automaton [lucene]

2024-09-04 Thread via GitHub


rmuir commented on code in PR #13707:
URL: https://github.com/apache/lucene/pull/13707#discussion_r1743877792


##
lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java:
##
@@ -857,22 +857,38 @@ public static boolean isEmpty(Automaton a) {
 return true;
   }
 
-  /** Returns true if the given automaton accepts all strings. The automaton 
must be minimized. */
+  /** Returns true if the given automaton accepts all strings. */

Review Comment:
   for some NFA, it might return `false` when in fact the NFA is "total", I 
think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Relax Operations.isTotal() to work with a deterministic automaton [lucene]

2024-09-04 Thread via GitHub


mikemccand commented on code in PR #13707:
URL: https://github.com/apache/lucene/pull/13707#discussion_r1743879304


##
lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java:
##
@@ -857,22 +857,38 @@ public static boolean isEmpty(Automaton a) {
 return true;
   }
 
-  /** Returns true if the given automaton accepts all strings. The automaton 
must be minimized. */
+  /** Returns true if the given automaton accepts all strings. */
   public static boolean isTotal(Automaton a) {
 return isTotal(a, Character.MIN_CODE_POINT, Character.MAX_CODE_POINT);
   }
 
   /**
* Returns true if the given automaton accepts all strings for the specified 
min/max range of the
-   * alphabet. The automaton must be minimized.
+   * alphabet.

Review Comment:
   Add that the automaton must be determinized?   And maybe explain the CPU 
cost is `O(T)` (`T` = total number of transitions) -- linear in total 
transition count?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-04 Thread via GitHub


stefanvodita commented on PR #13542:
URL: https://github.com/apache/lucene/pull/13542#issuecomment-2329194920

   Do we have any benchmark results? I'm curious to see what's gotten faster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-04 Thread via GitHub


javanna commented on PR #13542:
URL: https://github.com/apache/lucene/pull/13542#issuecomment-2329216197

   > Do we have any benchmark results? I'm curious to see what's gotten faster.
   
   No. I focused on functionality. I wonder if it makes sense to run benchmarks 
before we address the duplicated work across partitions of the same segment. It 
would give an initial idea of the gain, but I don't know how reliable those 
numbers would be. That's also why segments don't get partitioned by default. 
   
   I am looking forward to working on the next step though :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Make Operations#repeat create simpler automata. [lucene]

2024-09-04 Thread via GitHub


jpountz opened a new pull request, #13714:
URL: https://github.com/apache/lucene/pull/13714

   `Operations#repeat` currently creates an automaton that has one more final 
state, and 2x more transitions for each of the final states. With this change, 
the returned automaton has a single final state and only 2x more transitions 
for state 0.
   
   Currently a draft as I'm still looking into whether this new logic is 
correct.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make Operations#repeat create simpler automata. [lucene]

2024-09-04 Thread via GitHub


jpountz commented on PR #13714:
URL: https://github.com/apache/lucene/pull/13714#issuecomment-2329331978

   I just pushed a commit that avoids creating dead states if the input 
automaton doesn't have dead states itself. So the automaton created by `.*` 
would now have a single state.
   
   For context, I started looking into optimizing the single-transition case 
only (hence the branch name) but as I started looking deeper into it, it looked 
like improving the general case wouldn't be much harder, as the algorithm boils 
down to renumbering final states to 0 and other states to `state + 1`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-04 Thread via GitHub


javanna commented on PR #13542:
URL: https://github.com/apache/lucene/pull/13542#issuecomment-2329360505

   > I think it's important at least to ensure that we "do no harm" to the 
existing use cases.
   
   Intra-segment is not enabled by default and requires additional work. The 
scope of this initial support is to set the foundations and make the necessary 
breaking changes (now listed in the migrate guide). Also the added abstractions 
are experimental.
   
   > do we want to have docid intervals that span segments? There was some 
discussion above about how the intervals are defined and it wasn't clear 
whether or not we may have painted ourselves into a corner. Do we have enough 
flexibility? too much?
   
   The technical approach has been the same since this PR was opened: use the 
existing `BulkScorer#score` method that  takes min and max. That is simple and 
effective. The new experimental abstraction (`LeafReaderContextPartition`) 
allows to partition a segment. If we do so, each partition must get a different 
`LeafSlice` (existing concept). A new `slicesWithPartitions` static method is 
added that is used in tests, and can be entirely customized in how segments are 
partitioned. You can have multiple partitions in the same slice, but they ought 
to be from different segments. Does this clarify your concerns? I don't 
entirely follow how we may have painted ourselves in a corner and how running 
benchmarks relates to this. I will run benchmarks at some point, but is that 
necessary to verify that searching force-merged segments can use more of the 
available resources, same as we do when searching multiple segments 
concurrently?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make Operations#repeat create simpler automata. [lucene]

2024-09-04 Thread via GitHub


mikemccand commented on code in PR #13714:
URL: https://github.com/apache/lucene/pull/13714#discussion_r1743958059


##
lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java:
##
@@ -182,26 +182,44 @@ public static Automaton repeat(Automaton a) {
   // Repeating the empty automata will still only accept the empty 
automata.
   return a;
 }
+
+if (a.isAccept(0) && a.getAcceptStates().nextSetBit(1) == -1) {

Review Comment:
   I guess it is allowed to pass `1` to `nextSetBit` even if the `BitSet` is 
length 1?  Javadoc seems to say it only throws `IndexOutOfBoundsException` if 
you pass a negative index, and the example `for` loop would do what you are 
doing here on a length 1 `BitSet` so I think it's OK.



##
lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java:
##
@@ -182,26 +182,44 @@ public static Automaton repeat(Automaton a) {
   // Repeating the empty automata will still only accept the empty 
automata.
   return a;
 }
+
+if (a.isAccept(0) && a.getAcceptStates().nextSetBit(1) == -1) {
+  // If the only accept state is 0, then this automaton already repeats 
itself. Automata
+  // returned by this function only accept state 0, so this makes this 
function idempotent.
+  return a;

Review Comment:
   This is quite a cool observation (both sentences, separately)!
   
   Maybe swap the words `only accept` to make it more accurate (`only` is 
referring to `state 0`, not really `accept`).  Proper `only` placement is 
tricky!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-04 Thread via GitHub


javanna commented on PR #13542:
URL: https://github.com/apache/lucene/pull/13542#issuecomment-2329387884

   Perhaps one thing to add is that the main focus/direction of this change is 
to set the foundations to be able to finally decouple the index geometry from 
how the search is run and can be parallelized. We have support for intra-query 
concurrency, but it can only be leveraged if you have multiple segments. We 
want to be able to search independently from the number of segments, and apply 
the already existing concurrency support to big/force-merged segments that can 
only be searched sequentially otherwise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make Operations#repeat create simpler automata. [lucene]

2024-09-04 Thread via GitHub


jpountz commented on PR #13714:
URL: https://github.com/apache/lucene/pull/13714#issuecomment-2329450444

   Good ideas, I'll look into it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Relax Operations.isTotal() to work with a deterministic automaton [lucene]

2024-09-04 Thread via GitHub


rmuir commented on PR #13707:
URL: https://github.com/apache/lucene/pull/13707#issuecomment-2329472492

   I updated the javadocs: "The automaton must be deterministic, or this method 
may return false."
   Previous implementation was: "The automaton must be minimal, or this method 
may return false."


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-04 Thread via GitHub


msokolov commented on PR #13542:
URL: https://github.com/apache/lucene/pull/13542#issuecomment-2329642279

   >  I don't entirely follow how we may have painted ourselves in a corner and 
how running benchmarks relates to this.
   
   They're not related; perhaps I should have broken my comment into multiple 
posts. My main concern is that we should be able, on a single thread, to 
evaluate matches from portions of multiple segments, and it sounds as if this 
API will allow for that.
   
   Regarding the benchmarks, I understand these changes are not intended to 
have any impact on existing search code paths and need to be invoked in a 
special way. I still think testing should be done to verify that.
   
   And this relates to a different concern - that we are proposing to launch a 
large amount of dead code that will have no real purpose until further work is 
done. Again, we might learn something from implementing changes that actually 
*do* move benchmarks. I guess an underlying concern is that we are trying to 
get stuff in before an arbitrary (Lucene 10) release date, and assuming that no 
breaking changes will later be required. Isn't it conceivable that we will need 
to make a change to the Scorer API to support cloning? Maybe "painted into a 
corner" was too strong, but there could certainly be future changes we might 
need. To me it would make more sense to hold on off on merging (even if it 
means missing the impending Lucene 10 release) until we have some positive 
results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Backport of #13702 [lucene]

2024-09-04 Thread via GitHub


gsmiller opened a new pull request, #13716:
URL: https://github.com/apache/lucene/pull/13716

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] `count()` in LRUQueryCache [lucene]

2024-09-04 Thread via GitHub


LuXugang opened a new issue, #13717:
URL: https://github.com/apache/lucene/issues/13717

   ### Description
   
   Currently, the implementation of `Weight.count()` in `LRUQueryCache` first 
considers whether the wrapped `Weight` can obtain the `count`, and then 
considers the cache. 
   However, for certain queries, such as 
`IndexSortSortedNumericDocValuesRangeQuery` and `PointRangeQuery`, obtaining 
the count from `CacheAndCount` should be faster. Should we adjust the logical 
order?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove CollectorOwner class (#13671) [lucene]

2024-09-04 Thread via GitHub


gsmiller commented on PR #13702:
URL: https://github.com/apache/lucene/pull/13702#issuecomment-2329679349

   No problem @javanna !
   
   @epotyom - I got this back-ported onto 9.x but would appreciate a 
retro-active review if you have time (#13716). As before, I had to do some work 
to `DrillSideways` since it still supports a search method that directly 
accepts a `Collector` (instead of `CollectorManager`) so I had to modify that 
directly (since there's nothing to cherry-pick from main since the method has 
been removed). I'm pretty sure I got it right (and all tests pass, etc.), but 
please let me know if you spot something and we can fix it. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add dynamic range facets [lucene]

2024-09-04 Thread via GitHub


stefanvodita commented on PR #13689:
URL: https://github.com/apache/lucene/pull/13689#issuecomment-2329719474

   It would be great to merge this in time for 9.12. Since it's completely new 
code and marked experimental, I don't expect it to be controversial, but I'll 
wait a bit longer in case anyone was planning to review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] CompiledAutomaton not equal when NFARunAutomaton is non-null [lucene]

2024-09-04 Thread via GitHub


ChrisHegarty commented on issue #13715:
URL: https://github.com/apache/lucene/issues/13715#issuecomment-2329846860

   Before considering possible fixes, do we agree that there is a problem worth 
fixing? I’m particular thinking of the equality of IntervalQuery.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] CompiledAutomaton not equal when NFARunAutomaton is non-null [lucene]

2024-09-04 Thread via GitHub


zhaih commented on issue #13715:
URL: https://github.com/apache/lucene/issues/13715#issuecomment-2329914096

   I think it's a nice to have one. Altho for the specific IntervalQuery I feel 
like maybe we can directly claim they're equal if the pattern is the same and 
not checking the automaton at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] CompiledAutomaton not equal when NFARunAutomaton is non-null [lucene]

2024-09-04 Thread via GitHub


rmuir commented on issue #13715:
URL: https://github.com/apache/lucene/issues/13715#issuecomment-2329919715

   @ChrisHegarty I think it might be an oversight IntervalQuery is getting an 
NFA, was that really intended?
   
   The following other methods on intervals are powered by automatons and get a 
DFA:
   * Intervals.prefix()
   * Intervals.wildcard()
   * Intervals.range()
   * Intervals.fuzzyTerm()
   
   This happens because they reuse the parsing from the associated Query 
classes, but this Intervals.regexp() neglects to do that and just creates its 
own RegExp and converts it to an NFA. Given that the default of RegexpQuery is 
to determinize, I think we should determinize here to avoid surprises such as 
this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] CompiledAutomaton not equal when NFARunAutomaton is non-null [lucene]

2024-09-04 Thread via GitHub


rmuir commented on issue #13715:
URL: https://github.com/apache/lucene/issues/13715#issuecomment-2329956146

   There are other related problems to fix here separately, just start with 
CompiledAutomaton: 
   * `hashCode()` is inconsistent with `equals()`: either they both consider 
`nfaRunAutomaton` or they do not
   * `ramBytesUsed` doesn't consider `nfaRunAutomaton`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] CompiledAutomaton not equal when NFARunAutomaton is non-null [lucene]

2024-09-04 Thread via GitHub


rmuir commented on issue #13715:
URL: https://github.com/apache/lucene/issues/13715#issuecomment-2330179575

   That's usually how it works, it is good stuff. To be practical, for now, I'd 
recommend just fixing Intervals.regex to `determinize()` so that it matches all 
other queries which are using DFAs. It should be a one-line change to fix your 
problem because then there won't be an NFA :)
   
   I think it is enough for lucene 10 that the queries no longer `minimize()` 
up-front, and I'm hoping we can explore taking that next step soon so that we 
no longer `determinize()` up-front either (just as needed via NFA query), but 
we shouldn't be doing that now.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org