Re: [PR] Add IntervalsSource for range and regexp queries [lucene]

2024-07-14 Thread via GitHub


mayya-sharipova commented on code in PR #13562:
URL: https://github.com/apache/lucene/pull/13562#discussion_r1677132919


##
lucene/queries/src/java/org/apache/lucene/queries/intervals/Intervals.java:
##
@@ -206,6 +210,91 @@ public static IntervalsSource wildcard(BytesRef wildcard, 
int maxExpansions) {
 return new MultiTermIntervalsSource(ca, maxExpansions, 
wildcard.utf8ToString());
   }
 
+  /**
+   * Return an {@link IntervalsSource} over the disjunction of all terms that 
match a regular
+   * expression
+   *
+   * WARNING: Setting {@code maxExpansions} to higher than the default 
value of {@link
+   * #DEFAULT_MAX_EXPANSIONS} can be both slow and memory-intensive
+   *
+   * @param regexp regula expression
+   * @throws IllegalStateException if the regex expands to more than {@link 
#DEFAULT_MAX_EXPANSIONS}
+   * terms
+   * @see RegexpQuery for regexp format
+   */
+  public static IntervalsSource regexp(BytesRef regexp) {
+return regexp(regexp, DEFAULT_MAX_EXPANSIONS);
+  }
+
+  /**
+   * Expert: Return an {@link IntervalsSource} over the disjunction of all 
terms that match a
+   * regular expression
+   *
+   * WARNING: Setting {@code maxExpansions} to higher than the default 
value of {@link
+   * #DEFAULT_MAX_EXPANSIONS} can be both slow and memory-intensive
+   *
+   * @param regexp regula expression

Review Comment:
   Addressed in bdc43c7256da8dc178ec33d307d18d47de80ebeb



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add IntervalsSource for range and regexp queries [lucene]

2024-07-14 Thread via GitHub


mayya-sharipova commented on code in PR #13562:
URL: https://github.com/apache/lucene/pull/13562#discussion_r1677132855


##
lucene/queries/src/java/org/apache/lucene/queries/intervals/Intervals.java:
##
@@ -206,6 +210,91 @@ public static IntervalsSource wildcard(BytesRef wildcard, 
int maxExpansions) {
 return new MultiTermIntervalsSource(ca, maxExpansions, 
wildcard.utf8ToString());
   }
 
+  /**
+   * Return an {@link IntervalsSource} over the disjunction of all terms that 
match a regular
+   * expression
+   *
+   * WARNING: Setting {@code maxExpansions} to higher than the default 
value of {@link

Review Comment:
   @romseygeek Thanks for feedback. Very nice to see you again. Addressed in  
bdc43c7256da8dc178ec33d307d18d47de80ebeb



##
lucene/queries/src/java/org/apache/lucene/queries/intervals/Intervals.java:
##
@@ -206,6 +210,91 @@ public static IntervalsSource wildcard(BytesRef wildcard, 
int maxExpansions) {
 return new MultiTermIntervalsSource(ca, maxExpansions, 
wildcard.utf8ToString());
   }
 
+  /**
+   * Return an {@link IntervalsSource} over the disjunction of all terms that 
match a regular
+   * expression
+   *
+   * WARNING: Setting {@code maxExpansions} to higher than the default 
value of {@link
+   * #DEFAULT_MAX_EXPANSIONS} can be both slow and memory-intensive
+   *
+   * @param regexp regula expression
+   * @throws IllegalStateException if the regex expands to more than {@link 
#DEFAULT_MAX_EXPANSIONS}
+   * terms
+   * @see RegexpQuery for regexp format
+   */
+  public static IntervalsSource regexp(BytesRef regexp) {
+return regexp(regexp, DEFAULT_MAX_EXPANSIONS);
+  }
+
+  /**
+   * Expert: Return an {@link IntervalsSource} over the disjunction of all 
terms that match a
+   * regular expression
+   *
+   * WARNING: Setting {@code maxExpansions} to higher than the default 
value of {@link
+   * #DEFAULT_MAX_EXPANSIONS} can be both slow and memory-intensive
+   *
+   * @param regexp regula expression

Review Comment:
   Addressed in 
bdc43c7256da8dc178ec33d307d18d47de80ebebbdc43c7256da8dc178ec33d307d18d47de80ebeb



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add IntervalsSource for range and regexp queries [lucene]

2024-07-14 Thread via GitHub


mayya-sharipova commented on PR #13562:
URL: https://github.com/apache/lucene/pull/13562#issuecomment-2227356227

   @romseygeek @dweiss Thank you for the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add IntervalsSource for range and regexp queries [lucene]

2024-07-14 Thread via GitHub


mayya-sharipova merged PR #13562:
URL: https://github.com/apache/lucene/pull/13562


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add RefCountedSharedArena [lucene]

2024-07-14 Thread via GitHub


ChrisHegarty opened a new pull request, #13570:
URL: https://github.com/apache/lucene/pull/13570

   This commit adds a ref counted shared arena to support aggregating segment 
filed into a single Arena.
   
   TODO: benchmark, and add better tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Group memory arenas by segment to reduce costly `Arena.close()` [lucene]

2024-07-14 Thread via GitHub


ChrisHegarty commented on PR #13555:
URL: https://github.com/apache/lucene/pull/13555#issuecomment-2227465099

   Hi,   I don't have access to commit to this branch, so (sorry) I just 
created an alternative PR to sketch out an idea around a simplification of ref 
counting the arena. See #13570


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] TestSnapshotDeletionPolicy#testMultiThreadedSnapshotting assertion failure [lucene]

2024-07-14 Thread via GitHub


aoli-al opened a new issue, #13571:
URL: https://github.com/apache/lucene/issues/13571

   ### Description
   
   I saw the following assertion failure when running 
TestSnapshotDeletionPolicy#testMultiThreadedSnapshotting
   
   ```
   Caused by: java.lang.AssertionError: seqNo=9 vs maxSeqNo=8
   at 
org.apache.lucene.index.DocumentsWriterDeleteQueue.getNextSequenceNumber(DocumentsWriterDeleteQueue.java:567)
   at 
org.apache.lucene.index.DocumentsWriterDeleteQueue.updateSlice(DocumentsWriterDeleteQueue.java:286)
   at 
org.apache.lucene.index.DocumentsWriterPerThread.finishDocuments(DocumentsWriterPerThread.java:344)
   at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:284)
   at 
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)
   at 
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1552)
   at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1837)
   at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1477)
   at 
org.apache.lucene.index.TestSnapshotDeletionPolicy$2.run(TestSnapshotDeletionPolicy.java:302)
   ```
   
   This failure might also be related to #13446 and #13127. Please apply the 
following patch to reproduce the failure. It takes ~30 seconds to show the 
error. 
   
   ```
   diff --git 
a/lucene/core/src/java/org/apache/lucene/index/ConcurrentApproximatePriorityQueue.java
 
b/lucene/core/src/java/org/apache/lucene/index/ConcurrentApproximatePriorityQueue.java
   index 8a8fc72ab4c..12e05293a19 100644
   --- 
a/lucene/core/src/java/org/apache/lucene/index/ConcurrentApproximatePriorityQueue.java
   +++ 
b/lucene/core/src/java/org/apache/lucene/index/ConcurrentApproximatePriorityQueue.java
   @@ -38,7 +38,7 @@ final class ConcurrentApproximatePriorityQueue {
int concurrency = coreCount / 4;
concurrency = Math.max(MIN_CONCURRENCY, concurrency);
concurrency = Math.min(MAX_CONCURRENCY, concurrency);
   -return concurrency;
   +return 3;
  }

  final int concurrency;
   diff --git 
a/lucene/core/src/java/org/apache/lucene/index/DocumentsWriter.java 
b/lucene/core/src/java/org/apache/lucene/index/DocumentsWriter.java
   index 7955df5630e..ba3c6f0d3f0 100644
   --- a/lucene/core/src/java/org/apache/lucene/index/DocumentsWriter.java
   +++ b/lucene/core/src/java/org/apache/lucene/index/DocumentsWriter.java
   @@ -411,8 +411,33 @@ final class DocumentsWriter implements Closeable, 
Accountable {
  final DocumentsWriterDeleteQueue.Node delNode)
  throws IOException {
boolean hasEvents = preUpdate();
   +if (Thread.currentThread().getName().contains("t9")) {
   +try {
   +Thread.sleep(10);
   +} catch (Exception e) {}
   +}

   +if (!Thread.currentThread().getName().contains("t0")) {
   +  if (Thread.currentThread().getName().contains("t1")) {
   +try {
   +  Thread.sleep(5000);
   +} catch (Exception e) {}
   +  } else if (Thread.currentThread().getName().contains("t2")) {
   +try {
   +  Thread.sleep(5000);
   +} catch (Exception e) {}
   +  } else {
   +try {
   +  Thread.sleep(1000);
   +} catch (Exception e) {}
   +  }
   +}
final DocumentsWriterPerThread dwpt = flushControl.obtainAndLock();
   +if (!Thread.currentThread().getName().contains("t0") && 
!Thread.currentThread().getName().contains("t1") && 
!Thread.currentThread().getName().contains("t2")) {
   
   
   ### Version and environment details
   
   Commit: 33a4c1d8ef02dacedde9c7f04a3c7e2e78c9
   Java version: openjdk 21.0.3 2024-04-16


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add levels to DocValues skipper index [lucene]

2024-07-14 Thread via GitHub


jpountz commented on code in PR #13563:
URL: https://github.com/apache/lucene/pull/13563#discussion_r1677204200


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesFormat.java:
##
@@ -194,5 +194,34 @@ public DocValuesProducer fieldsProducer(SegmentReadState 
state) throws IOExcepti
   static final int TERMS_DICT_REVERSE_INDEX_SIZE = 1 << 
TERMS_DICT_REVERSE_INDEX_SHIFT;
   static final int TERMS_DICT_REVERSE_INDEX_MASK = 
TERMS_DICT_REVERSE_INDEX_SIZE - 1;
 
+  // number of documents in an interval
   private static final int DEFAULT_SKIP_INDEX_INTERVAL_SIZE = 4096;
+  // number of intervals represented as a shift to create a new level, this is 
1 << 3 == 8
+  // intervals.
+  static final int SKIP_INDEX_LEVEL_SHIFT = 3;
+  // max number of levels
+  // Increasing this number, it increases how much heap we need at index time.
+  // we currently need (1 * 8 * 8 * 8)  = 512 accumulators on heap
+  static final int SKIP_INDEX_MAX_LEVEL = 4;
+  // how many intervals at level 0 are in each level (1 << 
(SKIP_INDEX_LEVEL_SHIFT * level)).
+  static int[] SKIP_INDEX_NUMBER_INTERVALS_PER_LEVEL = new 
int[SKIP_INDEX_MAX_LEVEL];
+  // number of bytes to skip when skipping a level. It does not take into 
account the
+  // current interval that is being read.
+  static long[] SKIP_INDEX_JUMP_LENGTH_PER_LEVEL = new 
long[SKIP_INDEX_MAX_LEVEL];
+
+  static {
+for (int level = 0; level < SKIP_INDEX_MAX_LEVEL; level++) {
+  SKIP_INDEX_NUMBER_INTERVALS_PER_LEVEL[level] = 1 << 
(SKIP_INDEX_LEVEL_SHIFT * level);

Review Comment:
   Nit: It's so cheap to compute that I wonder if we should really use a lookup 
table vs. recomputing it all the time.



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java:
##
@@ -1792,61 +1794,88 @@ public DocValuesSkipper getSkipper(FieldInfo field) 
throws IOException {
 if (input.length() > 0) {
   input.prefetch(0, 1);
 }
+// TODO: should we write to disk the actual max level for this segment?
 return new DocValuesSkipper() {
-  int minDocID = -1;
-  int maxDocID = -1;
-  long minValue, maxValue;
-  int docCount;
+  final int[] minDocID = new int[SKIP_INDEX_MAX_LEVEL];
+  final int[] maxDocID = new int[SKIP_INDEX_MAX_LEVEL];
+
+  {
+for (int i = 0; i < SKIP_INDEX_MAX_LEVEL; i++) {
+  minDocID[i] = maxDocID[i] = -1;
+}
+  }
+
+  final long[] minValue = new long[SKIP_INDEX_MAX_LEVEL];
+  final long[] maxValue = new long[SKIP_INDEX_MAX_LEVEL];
+  final int[] docCount = new int[SKIP_INDEX_MAX_LEVEL];
+  int levels;
 
   @Override
   public void advance(int target) throws IOException {
 if (target > entry.maxDocId) {
-  minDocID = DocIdSetIterator.NO_MORE_DOCS;
-  maxDocID = DocIdSetIterator.NO_MORE_DOCS;
+  // skipper is exhausted
+  for (int i = 0; i < SKIP_INDEX_MAX_LEVEL; i++) {
+minDocID[i] = maxDocID[i] = DocIdSetIterator.NO_MORE_DOCS;
+  }
 } else {
+  // find next interval
+  assert target > maxDocID[0] : "target must be bigger that current 
interval";
   while (true) {
-maxDocID = input.readInt();
-if (maxDocID >= target) {
-  minDocID = input.readInt();
-  maxValue = input.readLong();
-  minValue = input.readLong();
-  docCount = input.readInt();
+levels = input.readByte();

Review Comment:
   I'm a bit confused by this, because this `levels` variable feels like the 
number of lower levels that are getting updated, while some highel levels may 
still be valid? But then it would not be correct to return `levels` in 
`numLevels()` as we'd be missing these higher levels that are still valid?
   
   E.g. if there are 8 intervals, there are two levels. But when we read the 
second interval, then `levels` is 1 because only the lower level needs 
updating, but there are still 2 valid levels?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesConsumer.java:
##
@@ -207,65 +210,120 @@ void accumulate(long value) {
   maxValue = Math.max(maxValue, value);
 }
 
+void accumulate(SkipAccumulator other) {
+  maxDocID = other.maxDocID;
+  minValue = Math.min(minValue, other.minValue);
+  maxValue = Math.max(maxValue, other.maxValue);
+  docCount += other.docCount;
+}
+
 void nextDoc(int docID) {
   maxDocID = docID;
   ++docCount;
 }
 
-void writeTo(DataOutput output) throws IOException {
-  output.writeInt(maxDocID);
-  output.writeInt(minDocID);
-  output.writeLong(maxValue);
-  output.writeLong(minValue);
-  output.writeInt(docCount);
+public static SkipAccumulator merge(List list, int index, 
int length) {
+  SkipAccumulator acc = new SkipAccumulator(list.get(index).minDocID);
+  for (int i = 0; i < len

Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]

2024-07-14 Thread via GitHub


msokolov commented on PR #13566:
URL: https://github.com/apache/lucene/pull/13566#issuecomment-2227508427

   I tried implementing a "strongly-connected" version of this thing. My 
implementation is super slow (makes indexing take way longer) and the result in 
terms of search metrics was about the same. I might see about using another 
implementation, maybe this one? 
https://www.geeksforgeeks.org/tarjan-algorithm-find-strongly-connected-components/
 but it may not be worth it given this whole thing is best-effort anyway


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org