Re: [PR] [Draft] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]

2023-12-08 Thread via GitHub


kuramitsu commented on PR #12885:
URL: https://github.com/apache/lucene/pull/12885#issuecomment-1846729364

   The modification within the getRomanization function has been dropped.
   Instead, in the incrementToken function, I added a process to treat the 
hiragana OOV term converted to kataka as the reading when it appears.
   This processing allows for correct handling even when the Romaji option is 
false.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


jpountz commented on PR #12841:
URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846774383

   I'll check if there is GC activity during the benchmark. In the meantime, I 
looked into using lambdas instead, and it seems like it would work well: 
https://github.com/apache/lucene/commit/8109e639508b6875d657e84634dd27c58942f610
   
   ```
   Benchmark(size)   Mode  Cnt  
 Score   Error   Units
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt  64  thrpt5  
13.174 ± 0.870  ops/us
   GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline  64  thrpt5  
 6.834 ± 0.021  ops/us
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


jpountz commented on PR #12841:
URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846779641

   I confirmed there's GC activity happening with the slice approach by using 
`-prof gc`:
   
   ```
   Benchmark
 (size)   Mode  Cnt  ScoreError   Units
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt  
 64  thrpt5 11.482 ±  0.727  ops/us
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt:gc.alloc.rate
 64  thrpt5  0.001 ±  0.001  MB/sec
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt:gc.alloc.rate.norm   
 64  thrpt5 ≈ 10⁻⁴ B/op
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt:gc.count 
 64  thrpt5≈ 0   counts
   GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline  
 64  thrpt5  6.613 ±  0.504  ops/us
   GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline:gc.alloc.rate
 64  thrpt5  0.001 ±  0.001  MB/sec
   GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline:gc.alloc.rate.norm   
 64  thrpt5 ≈ 10⁻⁴ B/op
   GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline:gc.count 
 64  thrpt5≈ 0   counts
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt_viaByteBuffer
 64  thrpt5  0.153 ±  0.002  ops/us
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt_viaByteBuffer:gc.alloc.rate  
 64  thrpt5   2391.134 ± 25.909  MB/sec
   
GroupVIntBenchmark.mmap_byteBufferReadGroupVInt_viaByteBuffer:gc.alloc.rate.norm
  64  thrpt5  16432.009 ±  0.001B/op
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt_viaByteBuffer:gc.count   
 64  thrpt5202.000   counts
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt_viaByteBuffer:gc.time
 64  thrpt5129.000   ms
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


easyice commented on PR #12841:
URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846784038

   +1 for gc overhead, here is the gc output (`-prof gc` ):
   
   ```
   Benchmark
(size)   Mode  Cnt Score Error   Units
   GroupVIntBenchmark.mmap_byteBufferReadGroupVIntUseByteBuffer 
64  thrpt5 1.873 ±   0.269  ops/us
   GroupVIntBenchmark.mmap_byteBufferReadGroupVIntUseByteBuffer:gc.alloc.rate   
64  thrpt5  4800.320 ± 690.606  MB/sec
   
GroupVIntBenchmark.mmap_byteBufferReadGroupVIntUseByteBuffer:gc.alloc.rate.norm 
 64  thrpt5  2688.001 ±   0.001B/op
   GroupVIntBenchmark.mmap_byteBufferReadGroupVIntUseByteBuffer:gc.count
64  thrpt5   273.000counts
   GroupVIntBenchmark.mmap_byteBufferReadGroupVIntUseByteBuffer:gc.time 
64  thrpt5   218.000ms
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt  
64  thrpt5   8.118 ± 5.922  ops/us
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt:gc.alloc.rate
64  thrpt5   0.001 ± 0.001  MB/sec
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt:gc.alloc.rate.norm   
64  thrpt5≈ 10⁻⁴  B/op
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt:gc.count 
64  thrpt5   ≈ 0counts
   ```
   
   > I looked into using lambdas instead, and it seems like it would work well: 
   
   It's so cool!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on PR #12841:
URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846881254

   I would still be safe and initialize the IntReader on construction of the 
IndexInput. It can strongly bind to the current segment.
   
   Can we do the same for all other inputs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on PR #12841:
URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846883907

   I will nag Maurizio again about the problem with slice(). The reason for 
this was some strange problem with Hotspot. I thought they fixed it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Correct last remaining instances of typo e.g. "Levenstein" -> "Levenshtein" [lucene]

2023-12-08 Thread via GitHub


shaikhu commented on PR #12519:
URL: https://github.com/apache/lucene/pull/12519#issuecomment-1846884628

   Oops I completely forgot about this. Restored forked repo and reopening.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


easyice commented on PR #12841:
URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846907147

   > Can we do the same for all other inputs?
   
   I think so, i will do this if @jpountz doesn't mind.
   
   > I will nag Maurizio again about the problem with slice().
   
   Thank you so much for pinpointing the cause of the problem!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable CheckIndex to exorcise segments with missing segment infos (.si) (#7820) [lucene]

2023-12-08 Thread via GitHub


gokaai commented on code in PR #12872:
URL: https://github.com/apache/lucene/pull/12872#discussion_r1420277534


##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -957,6 +974,9 @@ private Status.SegmentInfoStatus testSegment(
 SegmentReader reader = null;
 
 try {
+  if (info.info.maxDoc() == 0) {
+throw new CheckIndexException(" this segment has missing or corrupt 
segment info");

Review Comment:
   > Remove the extra space before this
   
   I was keeping it consistent with [this exception 
message](https://github.com/apache/lucene/blob/ba81826951c1f628efb60c6165add27601173bd9/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java#L949)
 - should I change this one too? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


easyice commented on PR #12841:
URL: https://github.com/apache/lucene/pull/12841#issuecomment-1847060669

   I'm running the performance differences between previous commit, it will 
take a moment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420364348


##
lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseDirectoryTestCase.java:
##
@@ -1438,4 +1440,68 @@ public void testListAllIsSorted() throws IOException {
   assertArrayEquals(expected, actual);
 }
   }
+
+  public void testDataTypes() throws IOException {
+final long[] values = new long[] {43, 12345, 123456, 1234567890};
+try (Directory dir = getDirectory(createTempDir("testDataTypes"))) {
+  IndexOutput out = dir.createOutput("test", IOContext.DEFAULT);
+  out.writeByte((byte) 43);
+  out.writeShort((short) 12345);
+  out.writeInt(1234567890);
+  out.writeGroupVInts(values, 4);
+  out.writeLong(1234567890123456789L);
+  out.close();
+
+  long[] restored = new long[4];
+  IndexInput in = dir.openInput("test", IOContext.DEFAULT);
+  assertEquals(43, in.readByte());
+  assertEquals(12345, in.readShort());
+  assertEquals(1234567890, in.readInt());
+  in.readGroupVInts(restored, 4);
+  assertArrayEquals(values, restored);
+  assertEquals(1234567890123456789L, in.readLong());
+  in.close();
+}
+  }
+
+  public void testGroupVInt() throws IOException {
+try (Directory dir = getDirectory(createTempDir("testGroupVInt"))) {
+  // test fallback to default implementation of readGroupVInt
+  doTestGroupVInt(dir, 5, 1, 6, 8);
+
+  // use more iterations to covers all bpv
+  doTestGroupVInt(dir, atLeast(100), 1, 31, 128);
+
+  // we use BaseChunkedDirectoryTestCase#testGroupVIntMultiBlocks cover 
multiple blocks for
+  // ByteBuffersDataInput and MMapDirectory
+}
+  }
+
+  protected void doTestGroupVInt(
+  Directory dir, int iterations, int minBpv, int maxBpv, int maxNumValues) 
throws IOException {
+long[] values = new long[maxNumValues];
+long[] restored = new long[maxNumValues];
+
+for (int i = 0; i < iterations; i++) {

Review Comment:
   I am a bit afraid that those tests kill my SSD at some point (also Jenkin's 
SSD). This creates files over and over. With test directories its not a 
problem, but with real FDDirectory ones we should reduce the iteration counts.
   
   The chunked test should clearly setup a directory with a random chunk size 
once and run all tests on it. Ideally write only one file with multiple 
variants of groupVints behind each other (so it definitely crosses borders) and 
then read all of them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420364348


##
lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseDirectoryTestCase.java:
##
@@ -1438,4 +1440,68 @@ public void testListAllIsSorted() throws IOException {
   assertArrayEquals(expected, actual);
 }
   }
+
+  public void testDataTypes() throws IOException {
+final long[] values = new long[] {43, 12345, 123456, 1234567890};
+try (Directory dir = getDirectory(createTempDir("testDataTypes"))) {
+  IndexOutput out = dir.createOutput("test", IOContext.DEFAULT);
+  out.writeByte((byte) 43);
+  out.writeShort((short) 12345);
+  out.writeInt(1234567890);
+  out.writeGroupVInts(values, 4);
+  out.writeLong(1234567890123456789L);
+  out.close();
+
+  long[] restored = new long[4];
+  IndexInput in = dir.openInput("test", IOContext.DEFAULT);
+  assertEquals(43, in.readByte());
+  assertEquals(12345, in.readShort());
+  assertEquals(1234567890, in.readInt());
+  in.readGroupVInts(restored, 4);
+  assertArrayEquals(values, restored);
+  assertEquals(1234567890123456789L, in.readLong());
+  in.close();
+}
+  }
+
+  public void testGroupVInt() throws IOException {
+try (Directory dir = getDirectory(createTempDir("testGroupVInt"))) {
+  // test fallback to default implementation of readGroupVInt
+  doTestGroupVInt(dir, 5, 1, 6, 8);
+
+  // use more iterations to covers all bpv
+  doTestGroupVInt(dir, atLeast(100), 1, 31, 128);
+
+  // we use BaseChunkedDirectoryTestCase#testGroupVIntMultiBlocks cover 
multiple blocks for
+  // ByteBuffersDataInput and MMapDirectory
+}
+  }
+
+  protected void doTestGroupVInt(
+  Directory dir, int iterations, int minBpv, int maxBpv, int maxNumValues) 
throws IOException {
+long[] values = new long[maxNumValues];
+long[] restored = new long[maxNumValues];
+
+for (int i = 0; i < iterations; i++) {

Review Comment:
   I am a bit afraid that those tests kill my SSD at some point (also Jenkin's 
SSD). This creates files over and over. With test directories its not a 
problem, but with real FSDirectory ones we should reduce the iteration counts.
   
   The chunked test should clearly setup a directory with a random chunk size 
once and run all tests on it. Ideally write only one file with multiple 
variants of groupVints behind each other (so it definitely crosses borders) and 
then read all of them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420373977


##
lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java:
##
@@ -62,4 +62,42 @@ private static long readLongInGroup(DataInput in, int 
numBytesMinus1) throws IOE
 return in.readInt() & 0xL;
 }
   }
+
+  /**
+   * Provides an abstraction for read int values, so that decoding logic can 
be reused in different DataInput.
+   *
+   */
+  public static interface IntReader {

Review Comment:
   add `@FunctionalInterface`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420375351


##
lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java:
##
@@ -62,4 +62,42 @@ private static long readLongInGroup(DataInput in, int 
numBytesMinus1) throws IOE
 return in.readInt() & 0xL;
 }
   }
+
+  /**
+   * Provides an abstraction for read int values, so that decoding logic can 
be reused in different DataInput.
+   *
+   */
+  public static interface IntReader {
+int read(long v);
+  }
+
+  /**
+   * Faster implementation of read single group, It read values from the 
buffer that would not cross
+   * boundaries.
+   *
+   * @param flag the flag of group varint.
+   * @param reader the supplier of read int.
+   * @param dst the array to read ints into.

Review Comment:
   parameter "pos" is missing in javadocs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420377148


##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -324,24 +324,9 @@ private void readGroupVInt(long[] dst, int offset) throws 
IOException {
 
 try {
   final int flag = curSegment.get(LAYOUT_BYTE, curPosition++) & 0xFF;
-
-  final int n1Minus1 = flag >> 6;
-  final int n2Minus1 = (flag >> 4) & 0x03;
-  final int n3Minus1 = (flag >> 2) & 0x03;
-  final int n4Minus1 = flag & 0x03;
-
-  dst[offset] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n1Minus1];
-  curPosition += 1 + n1Minus1;
-  dst[offset + 1] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n2Minus1];
-  curPosition += 1 + n2Minus1;
-  dst[offset + 2] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n3Minus1];
-  curPosition += 1 + n3Minus1;
-  dst[offset + 3] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n4Minus1];
-  curPosition += 1 + n4Minus1;
+  curPosition +=
+  GroupVIntUtil.readGroupVInt(
+  flag, p -> curSegment.get(LAYOUT_LE_INT, p), curPosition, dst, 
offset);

Review Comment:
   isn't the +1 missing here like above?
   
   This is why I'd like to have a test that reads a vint group and other data 
from the same input placed behind each other!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420379683


##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -303,6 +304,34 @@ public byte readByte(long pos) throws IOException {
 }
   }
 
+  @Override
+  public void readGroupVInts(long[] dst, int limit) throws IOException {
+int i;
+for (i = 0; i <= limit - 4; i += 4) {
+  readGroupVInt(dst, i);
+}
+for (; i < limit; ++i) {
+  dst[i] = readVInt();
+}
+  }
+
+  private void readGroupVInt(long[] dst, int offset) throws IOException {
+final MemorySegment curSegment = this.curSegment;

Review Comment:
   This part MUST also be inside they try-catch, as it would throw NPE or 
IllegalStateException when the input was closed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420382794


##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -303,6 +304,48 @@ public byte readByte(long pos) throws IOException {
 }
   }
 
+  @Override
+  public void readGroupVInts(long[] dst, int limit) throws IOException {
+int i;
+for (i = 0; i <= limit - 4; i += 4) {
+  readGroupVInt(dst, i);
+}
+for (; i < limit; ++i) {
+  dst[i] = readVInt();
+}
+  }
+
+  private void readGroupVInt(long[] dst, int offset) throws IOException {
+if (curSegment.byteSize() - curPosition < 
GroupVIntUtil.MAX_LENGTH_PER_GROUP) {
+  GroupVIntUtil.fallbackReadGroupVInt(this, dst, offset);
+  return;
+}
+

Review Comment:
   the bigger problem here is that it is not in the try-catch block. when the 
indexinput is closed this fails with NPE or IllegalStateException.
   
   Everything in this method must be in the try-catch. Look at othe rmethods, 
the NPE/ISE catch block always covers everything.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


easyice commented on PR #12841:
URL: https://github.com/apache/lucene/pull/12841#issuecomment-1847102122

   It looks good on `byteBuffers` and `MMapDirectory`, the benchmark result is 
pretty close to previous commit,
   but a bit slowdon on `NIOFSDirectory`, i will dig it.
   
   * `*ReadGroupVInt` the current commit
   * `*_viaDupCode` the previous commit, using duplicated code.
   * `*Baseline` no change.
   
   java17
   ```
   Benchmark   (size)   Mode  
Cnt  Score   Error   Units
   GroupVIntBenchmark.byteBuffersReadGroupVInt 64  thrpt
5  5.156 ± 0.378  ops/us
   GroupVIntBenchmark.byteBuffersReadGroupVIntBaseline 64  thrpt
5  2.130 ± 0.039  ops/us
   GroupVIntBenchmark.byteBuffersReadGroupVInt_viaDupCode  64  thrpt
5  5.455 ± 0.164  ops/us
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt 64  thrpt
5  7.632 ± 0.377  ops/us
   GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline 64  thrpt
5  7.376 ± 0.485  ops/us
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt_viaDupCode  64  thrpt
5  7.408 ± 0.532  ops/us
   GroupVIntBenchmark.nioReadGroupVInt 64  thrpt
5  6.905 ± 0.589  ops/us
   GroupVIntBenchmark.nioReadGroupVIntBaseline 64  thrpt
5  5.574 ± 0.216  ops/us
   GroupVIntBenchmark.nioReadGroupVInt_viaDupCode  64  thrpt
5  9.014 ± 0.162  ops/us
   ```
   
   java21
   ```
   Benchmark   (size)   Mode  
Cnt   Score   Error   Units
   GroupVIntBenchmark.byteBuffersReadGroupVInt 64  thrpt
5   5.539 ± 0.220  ops/us
   GroupVIntBenchmark.byteBuffersReadGroupVIntBaseline 64  thrpt
5   1.825 ± 0.055  ops/us
   GroupVIntBenchmark.byteBuffersReadGroupVInt_viaDupCode  64  thrpt
5   5.543 ± 0.097  ops/us
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt 64  thrpt
5   9.428 ± 0.189  ops/us
   GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline 64  thrpt
5   5.565 ± 0.080  ops/us
   GroupVIntBenchmark.mmap_byteBufferReadGroupVInt_viaDupCode  64  thrpt
5  10.476 ± 1.343  ops/us
   GroupVIntBenchmark.nioReadGroupVInt 64  thrpt
5   8.434 ± 1.887  ops/us
   GroupVIntBenchmark.nioReadGroupVIntBaseline 64  thrpt
5   5.431 ± 0.072  ops/us
   GroupVIntBenchmark.nioReadGroupVInt_viaDupCode  64  thrpt
5  10.723 ± 0.399  ops/us
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Removing @lucene.experimental tags in testXXX methods in CheckIndex [lucene]

2023-12-08 Thread via GitHub


slow-J opened a new pull request, #12893:
URL: https://github.com/apache/lucene/pull/12893

   Following up on @mikemccand's comment in previous CheckIndex 
PR:https://github.com/apache/lucene/pull/12876.
   
   >  I do think some of these tags in CheckIndex.java could be removed, e.g. 
on each of the testXXX methods? But likely we should leave it on each of the 
XXXStatus returned classes, so we are free to break/change what metadata these 
classes return? And let's do it separately -- this PR looks like a great 
cleanup already. Progress not perfection!
   
   Targetting Lucene 9.10.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420387856


##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -303,6 +304,34 @@ public byte readByte(long pos) throws IOException {
 }
   }
 
+  @Override
+  public void readGroupVInts(long[] dst, int limit) throws IOException {
+int i;
+for (i = 0; i <= limit - 4; i += 4) {
+  readGroupVInt(dst, i);
+}
+for (; i < limit; ++i) {
+  dst[i] = readVInt();
+}
+  }
+
+  private void readGroupVInt(long[] dst, int offset) throws IOException {
+final MemorySegment curSegment = this.curSegment;

Review Comment:
   please add a test to the exitsing closed tests, reading readGroupVInt from 
an already closed IndexInput



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


easyice commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420418454


##
lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+import java.io.IOException;
+import org.apache.lucene.store.DataInput;
+
+/**
+ * This class contains utility methods and constants for group varint
+ *
+ * @lucene.internal
+ */
+public final class GroupVIntUtil {
+  // the maximum length of a single group-varint is 4 integers + 1 byte flag.
+  public static final int MAX_LENGTH_PER_GROUP = 17;
+  private static final int[] MASKS = new int[] {0xFF, 0x, 0xFF, 
0x};
+
+  /**
+   * Default implementation of read single group, for optimal performance, you 
should use {@link
+   * DataInput#readGroupVInts(long[], int)} instead.
+   *
+   * @param dst the array to read ints into.
+   * @param offset the offset in the array to start storing ints.
+   */
+  public static void readGroupVInt(DataInput in, long[] dst, int offset) 
throws IOException {
+final int flag = in.readByte() & 0xFF;
+
+final int n1Minus1 = flag >> 6;
+final int n2Minus1 = (flag >> 4) & 0x03;
+final int n3Minus1 = (flag >> 2) & 0x03;
+final int n4Minus1 = flag & 0x03;
+
+dst[offset] = readLongInGroup(in, n1Minus1);
+dst[offset + 1] = readLongInGroup(in, n2Minus1);
+dst[offset + 2] = readLongInGroup(in, n3Minus1);
+dst[offset + 3] = readLongInGroup(in, n4Minus1);
+  }
+
+  private static long readLongInGroup(DataInput in, int numBytesMinus1) throws 
IOException {
+switch (numBytesMinus1) {
+  case 0:
+return in.readByte() & 0xFFL;
+  case 1:
+return in.readShort() & 0xL;
+  case 2:
+return (in.readShort() & 0xL) | ((in.readByte() & 0xFFL) << 16);
+  default:
+return in.readInt() & 0xL;
+}
+  }
+
+  /**
+   * Provides an abstraction for read int values, so that decoding logic can 
be reused in different
+   * DataInput.
+   */
+  public static interface IntReader {
+int read(long v);
+  }
+
+  /**
+   * Faster implementation of read single group, It read values from the 
buffer that would not cross
+   * boundaries.
+   *
+   * @param flag the flag of group varint.
+   * @param reader the supplier of read int.
+   * @param dst the array to read ints into.
+   * @param offset the offset in the array to start storing ints.
+   * @return the number of bytes read, it is a positive number and less than 
{@link
+   * #MAX_LENGTH_PER_GROUP}
+   */
+  public static int readGroupVInt(int flag, IntReader reader, long pos, long[] 
dst, int offset)
+  throws IOException {

Review Comment:
   @uschindler should we also throws `NullPointerException`, 
`IllegalStateException` exception that maybe happen in 
`MemorySegmentIndexInput`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420447123


##
lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+import java.io.IOException;
+import org.apache.lucene.store.DataInput;
+
+/**
+ * This class contains utility methods and constants for group varint
+ *
+ * @lucene.internal
+ */
+public final class GroupVIntUtil {
+  // the maximum length of a single group-varint is 4 integers + 1 byte flag.
+  public static final int MAX_LENGTH_PER_GROUP = 17;
+  private static final int[] MASKS = new int[] {0xFF, 0x, 0xFF, 
0x};
+
+  /**
+   * Default implementation of read single group, for optimal performance, you 
should use {@link
+   * DataInput#readGroupVInts(long[], int)} instead.
+   *
+   * @param dst the array to read ints into.
+   * @param offset the offset in the array to start storing ints.
+   */
+  public static void readGroupVInt(DataInput in, long[] dst, int offset) 
throws IOException {
+final int flag = in.readByte() & 0xFF;
+
+final int n1Minus1 = flag >> 6;
+final int n2Minus1 = (flag >> 4) & 0x03;
+final int n3Minus1 = (flag >> 2) & 0x03;
+final int n4Minus1 = flag & 0x03;
+
+dst[offset] = readLongInGroup(in, n1Minus1);
+dst[offset + 1] = readLongInGroup(in, n2Minus1);
+dst[offset + 2] = readLongInGroup(in, n3Minus1);
+dst[offset + 3] = readLongInGroup(in, n4Minus1);
+  }
+
+  private static long readLongInGroup(DataInput in, int numBytesMinus1) throws 
IOException {
+switch (numBytesMinus1) {
+  case 0:
+return in.readByte() & 0xFFL;
+  case 1:
+return in.readShort() & 0xL;
+  case 2:
+return (in.readShort() & 0xL) | ((in.readByte() & 0xFFL) << 16);
+  default:
+return in.readInt() & 0xL;
+}
+  }
+
+  /**
+   * Provides an abstraction for read int values, so that decoding logic can 
be reused in different
+   * DataInput.
+   */
+  public static interface IntReader {
+int read(long v);
+  }
+
+  /**
+   * Faster implementation of read single group, It read values from the 
buffer that would not cross
+   * boundaries.
+   *
+   * @param flag the flag of group varint.
+   * @param reader the supplier of read int.
+   * @param dst the array to read ints into.
+   * @param offset the offset in the array to start storing ints.
+   * @return the number of bytes read, it is a positive number and less than 
{@link
+   * #MAX_LENGTH_PER_GROUP}
+   */
+  public static int readGroupVInt(int flag, IntReader reader, long pos, long[] 
dst, int offset)
+  throws IOException {

Review Comment:
   No. Why? It is an unchecked exception.
   
   The NPE and ISE are mmap special case (also for the legacy mmap impl). 
Please only cover the whole method in memorysegment's input by the try-catch. 
If anywhere while the method is executed we get an NPE the input is closed. No 
need to do anything outside.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420451580


##
lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+import java.io.IOException;
+import org.apache.lucene.store.DataInput;
+
+/**
+ * This class contains utility methods and constants for group varint
+ *
+ * @lucene.internal
+ */
+public final class GroupVIntUtil {
+  // the maximum length of a single group-varint is 4 integers + 1 byte flag.
+  public static final int MAX_LENGTH_PER_GROUP = 17;
+  private static final int[] MASKS = new int[] {0xFF, 0x, 0xFF, 
0x};
+
+  /**
+   * Default implementation of read single group, for optimal performance, you 
should use {@link
+   * DataInput#readGroupVInts(long[], int)} instead.
+   *
+   * @param dst the array to read ints into.
+   * @param offset the offset in the array to start storing ints.
+   */
+  public static void readGroupVInt(DataInput in, long[] dst, int offset) 
throws IOException {
+final int flag = in.readByte() & 0xFF;
+
+final int n1Minus1 = flag >> 6;
+final int n2Minus1 = (flag >> 4) & 0x03;
+final int n3Minus1 = (flag >> 2) & 0x03;
+final int n4Minus1 = flag & 0x03;
+
+dst[offset] = readLongInGroup(in, n1Minus1);
+dst[offset + 1] = readLongInGroup(in, n2Minus1);
+dst[offset + 2] = readLongInGroup(in, n3Minus1);
+dst[offset + 3] = readLongInGroup(in, n4Minus1);
+  }
+
+  private static long readLongInGroup(DataInput in, int numBytesMinus1) throws 
IOException {
+switch (numBytesMinus1) {
+  case 0:
+return in.readByte() & 0xFFL;
+  case 1:
+return in.readShort() & 0xL;
+  case 2:
+return (in.readShort() & 0xL) | ((in.readByte() & 0xFFL) << 16);
+  default:
+return in.readInt() & 0xL;
+}
+  }
+
+  /**
+   * Provides an abstraction for read int values, so that decoding logic can 
be reused in different
+   * DataInput.
+   */
+  public static interface IntReader {
+int read(long v);
+  }
+
+  /**
+   * Faster implementation of read single group, It read values from the 
buffer that would not cross
+   * boundaries.
+   *
+   * @param flag the flag of group varint.
+   * @param reader the supplier of read int.
+   * @param dst the array to read ints into.
+   * @param offset the offset in the array to start storing ints.
+   * @return the number of bytes read, it is a positive number and less than 
{@link
+   * #MAX_LENGTH_PER_GROUP}
+   */
+  public static int readGroupVInt(int flag, IntReader reader, long pos, long[] 
dst, int offset)
+  throws IOException {

Review Comment:
   All public methods in MemorySegmentIndexInput need the try catch. Everything 
private not.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420450756


##
lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+import java.io.IOException;
+import org.apache.lucene.store.DataInput;
+
+/**
+ * This class contains utility methods and constants for group varint
+ *
+ * @lucene.internal
+ */
+public final class GroupVIntUtil {
+  // the maximum length of a single group-varint is 4 integers + 1 byte flag.
+  public static final int MAX_LENGTH_PER_GROUP = 17;
+  private static final int[] MASKS = new int[] {0xFF, 0x, 0xFF, 
0x};
+
+  /**
+   * Default implementation of read single group, for optimal performance, you 
should use {@link
+   * DataInput#readGroupVInts(long[], int)} instead.
+   *
+   * @param dst the array to read ints into.
+   * @param offset the offset in the array to start storing ints.
+   */
+  public static void readGroupVInt(DataInput in, long[] dst, int offset) 
throws IOException {
+final int flag = in.readByte() & 0xFF;
+
+final int n1Minus1 = flag >> 6;
+final int n2Minus1 = (flag >> 4) & 0x03;
+final int n3Minus1 = (flag >> 2) & 0x03;
+final int n4Minus1 = flag & 0x03;
+
+dst[offset] = readLongInGroup(in, n1Minus1);
+dst[offset + 1] = readLongInGroup(in, n2Minus1);
+dst[offset + 2] = readLongInGroup(in, n3Minus1);
+dst[offset + 3] = readLongInGroup(in, n4Minus1);
+  }
+
+  private static long readLongInGroup(DataInput in, int numBytesMinus1) throws 
IOException {
+switch (numBytesMinus1) {
+  case 0:
+return in.readByte() & 0xFFL;
+  case 1:
+return in.readShort() & 0xL;
+  case 2:
+return (in.readShort() & 0xL) | ((in.readByte() & 0xFFL) << 16);
+  default:
+return in.readInt() & 0xL;
+}
+  }
+
+  /**
+   * Provides an abstraction for read int values, so that decoding logic can 
be reused in different
+   * DataInput.
+   */
+  public static interface IntReader {
+int read(long v);
+  }
+
+  /**
+   * Faster implementation of read single group, It read values from the 
buffer that would not cross
+   * boundaries.
+   *
+   * @param flag the flag of group varint.
+   * @param reader the supplier of read int.
+   * @param dst the array to read ints into.
+   * @param offset the offset in the array to start storing ints.
+   * @return the number of bytes read, it is a positive number and less than 
{@link
+   * #MAX_LENGTH_PER_GROUP}
+   */
+  public static int readGroupVInt(int flag, IntReader reader, long pos, long[] 
dst, int offset)
+  throws IOException {

Review Comment:
   See my other comment. Just extend the NPE/ISE try on the whole read method. 
Like everywhere else. Just look at the other methods, they are all identically 
structured. There is a reason why it is done like that. Don't change the 
pattern.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix position increment in (Reverse)PathHierarchyTokenizer [lucene]

2023-12-08 Thread via GitHub


lukas-vlcek commented on PR #12875:
URL: https://github.com/apache/lucene/pull/12875#issuecomment-1847241112

   @mikemccand Do you think you can give me some hint about?
   
   > (e.g. `UnifiedHighlighter`, in certain modes)
   
   I am looking at `TestUnifiedHighlighter*` tests. Does it mean that I need to 
use specific fieldType? Can I use any fieldType(s) from existing 
`UHTestHelper.parametersFactoryList()`? Such as `postingsType` and 
`postingsWithTvType`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]

2023-12-08 Thread via GitHub


gsmiller commented on code in PR #12812:
URL: https://github.com/apache/lucene/pull/12812#discussion_r1420685472


##
lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java:
##
@@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool {
   super(allocator);
 }
 
+/**
+ * For slices, buffers must be filled with zeros, so that we can find a 
slice's end based on a
+ * non-zero final value.
+ */
+private static boolean assertSliceBuffer(int[] buffer) {
+  for (int value : buffer) {

Review Comment:
   OK just a guess, but I'm wondering if it was written this way to avoid 
conditional checks and branching in the CPU? It looks inefficient to us humans, 
but to a CPU, it might be more efficient to use the older style implementation 
that is branchless?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]

2023-12-08 Thread via GitHub


stefanvodita commented on code in PR #12812:
URL: https://github.com/apache/lucene/pull/12812#discussion_r1420699163


##
lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java:
##
@@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool {
   super(allocator);
 }
 
+/**
+ * For slices, buffers must be filled with zeros, so that we can find a 
slice's end based on a
+ * non-zero final value.
+ */
+private static boolean assertSliceBuffer(int[] buffer) {
+  for (int value : buffer) {

Review Comment:
   I think that’s possible, but I have some counterarguments:
   1. Branch prediction will kick in. We should never take the branch, so 
mispredictions will be rare and they will happen when we want to stop execution 
anyway.
   2. We would not enable assertions in performance-critical situations, so 
this method would not run at all.
   3. The previous solution has other problems. The count can overflow. It can 
also be zero at the end if it encounters negative and positive values along the 
way that happened to add up to zero.



##
lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java:
##
@@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool {
   super(allocator);
 }
 
+/**
+ * For slices, buffers must be filled with zeros, so that we can find a 
slice's end based on a
+ * non-zero final value.
+ */
+private static boolean assertSliceBuffer(int[] buffer) {
+  for (int value : buffer) {

Review Comment:
   I think that’s possible, but I have some counterarguments:
   1. Branch prediction will kick in. We should never take the branch, so 
mispredictions will be rare and they will happen when we want to stop execution 
anyway.
   2. We would not enable assertions in performance-critical situations, so 
this method would not run at all.
   3. The previous solution has other problems. The count can overflow. It can 
also be zero at the end if it encounters negative and positive values along the 
way that happened to add up to zero.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]

2023-12-08 Thread via GitHub


gsmiller commented on code in PR #12812:
URL: https://github.com/apache/lucene/pull/12812#discussion_r1420707913


##
lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java:
##
@@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool {
   super(allocator);
 }
 
+/**
+ * For slices, buffers must be filled with zeros, so that we can find a 
slice's end based on a
+ * non-zero final value.
+ */
+private static boolean assertSliceBuffer(int[] buffer) {
+  for (int value : buffer) {

Review Comment:
   Yeah, fair points. Was only pointing out why it may have been written the 
way it was. I'm honestly not concerned with how we write this check if it's 
only being used in asserts. I got thrown off since your PR leads with: "This PR 
makes it so we exit faster if we find a non-zero value", and was just pointing 
out that this may not be true. If this is only for assertions, then I don't 
think we really need to "optimize" it :). Avoiding overflow is good. You can do 
it this way, or with a disjunctive accumulator if you want to keep it 
branchless. But again, if only for assertions, whatever is simplest is fine 
with me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]

2023-12-08 Thread via GitHub


stefanvodita commented on code in PR #12812:
URL: https://github.com/apache/lucene/pull/12812#discussion_r1420723711


##
lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java:
##
@@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool {
   super(allocator);
 }
 
+/**
+ * For slices, buffers must be filled with zeros, so that we can find a 
slice's end based on a
+ * non-zero final value.
+ */
+private static boolean assertSliceBuffer(int[] buffer) {
+  for (int value : buffer) {

Review Comment:
   Bad choice of words on my part to say "exit faster". Maybe I should have 
said "fail faster" or "short-circuit the check". I guess I don't have a strong 
preference either as long as we are addressing the correctness issues.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [WIP] LUCENE-10002: Deprecate FacetsCollector#search helper methods as they internally use IndexSearcher#search(Query, Collector) API [lucene]

2023-12-08 Thread via GitHub


gsmiller commented on PR #12890:
URL: https://github.com/apache/lucene/pull/12890#issuecomment-1847461557

   IMO we should deprecate these without replacement. I agree that users should 
be able to implement this logic pretty easily in their application layer, and 
would probably be better suited to do so. For example, if a user doesn't want 
to collect actual hits and just do faceting, these methods are a little trappy 
since they internally decide to also do a full (accurate) count of hits. 
They're baking in some assumptions like this that users may not expect.
   
   Just my opinion of course. I don't feel all that strongly about it, and if 
we want some convenience functionality somewhere for this, that's fine. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-12-08 Thread via GitHub


benwtrent commented on code in PR #12699:
URL: https://github.com/apache/lucene/pull/12699#discussion_r1420733099


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java:
##
@@ -1190,4 +1176,63 @@ public void seekExact(long ord) {
   public long ord() {
 throw new UnsupportedOperationException();
   }
+
+  static class OutputAccumulator extends DataInput {
+
+BytesRef[] outputs = new BytesRef[16];
+BytesRef current;
+int num;
+int outputIndex;
+int index;
+
+void push(BytesRef output) {
+  if (output != Lucene90BlockTreeTermsReader.NO_OUTPUT) {

Review Comment:
   While we have strict contracts, I can see that 
`BytesSequenceOutputs#add(BytesRef, BytesRef)` has assertions that the length 
is > 0. 
   
   Here in `OutputAccumulator` `readByte()` makes a big assumption that a 
`BytesRef` has at least length 1. If it had a length of 0, we would read past 
the ref end and read bytes sitting in a `byte[]` that we shouldn't.
   
   IMO `OutputAccumulator` needs to be way more cautious than 
`BytesSequenceOutputs#add(BytesRef, BytesRef)` because `OutputAccumulator` 
isn't making copies and is relying on the underlying byte arrays not changing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-12-08 Thread via GitHub


benwtrent commented on code in PR #12699:
URL: https://github.com/apache/lucene/pull/12699#discussion_r1420733099


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java:
##
@@ -1190,4 +1176,63 @@ public void seekExact(long ord) {
   public long ord() {
 throw new UnsupportedOperationException();
   }
+
+  static class OutputAccumulator extends DataInput {
+
+BytesRef[] outputs = new BytesRef[16];
+BytesRef current;
+int num;
+int outputIndex;
+int index;
+
+void push(BytesRef output) {
+  if (output != Lucene90BlockTreeTermsReader.NO_OUTPUT) {

Review Comment:
   While we have strict contracts, I can see that 
`BytesSequenceOutputs#add(BytesRef, BytesRef)` has assertions that the length 
is > 0. 
   
   Here in `OutputAccumulator` `readByte()` makes a big assumption that a 
`BytesRef` has at least length 1. If it had a length of 0, we would read past 
the ref end and read bytes sitting in a `byte[]` that we shouldn't.
   
   IMO `OutputAccumulator` needs to be way more cautious than 
``BytesSequenceOutputs#add(BytesRef, BytesRef)` because `OutputAccumulator` 
isn't making copies and is relying on the underlying byte arrays not changing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


easyice commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420565417


##
lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseDirectoryTestCase.java:
##
@@ -1438,4 +1440,68 @@ public void testListAllIsSorted() throws IOException {
   assertArrayEquals(expected, actual);
 }
   }
+
+  public void testDataTypes() throws IOException {
+final long[] values = new long[] {43, 12345, 123456, 1234567890};
+try (Directory dir = getDirectory(createTempDir("testDataTypes"))) {
+  IndexOutput out = dir.createOutput("test", IOContext.DEFAULT);
+  out.writeByte((byte) 43);
+  out.writeShort((short) 12345);
+  out.writeInt(1234567890);
+  out.writeGroupVInts(values, 4);
+  out.writeLong(1234567890123456789L);
+  out.close();
+
+  long[] restored = new long[4];
+  IndexInput in = dir.openInput("test", IOContext.DEFAULT);
+  assertEquals(43, in.readByte());
+  assertEquals(12345, in.readShort());
+  assertEquals(1234567890, in.readInt());
+  in.readGroupVInts(restored, 4);
+  assertArrayEquals(values, restored);
+  assertEquals(1234567890123456789L, in.readLong());
+  in.close();
+}
+  }
+
+  public void testGroupVInt() throws IOException {
+try (Directory dir = getDirectory(createTempDir("testGroupVInt"))) {
+  // test fallback to default implementation of readGroupVInt
+  doTestGroupVInt(dir, 5, 1, 6, 8);
+
+  // use more iterations to covers all bpv
+  doTestGroupVInt(dir, atLeast(100), 1, 31, 128);
+
+  // we use BaseChunkedDirectoryTestCase#testGroupVIntMultiBlocks cover 
multiple blocks for
+  // ByteBuffersDataInput and MMapDirectory
+}
+  }
+
+  protected void doTestGroupVInt(
+  Directory dir, int iterations, int minBpv, int maxBpv, int maxNumValues) 
throws IOException {
+long[] values = new long[maxNumValues];
+long[] restored = new long[maxNumValues];
+
+for (int i = 0; i < iterations; i++) {

Review Comment:
   I changed the all iterations write to a single file, then  do `writeVInt` to 
an other file as expected.



##
lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java:
##
@@ -62,4 +62,42 @@ private static long readLongInGroup(DataInput in, int 
numBytesMinus1) throws IOE
 return in.readInt() & 0xL;
 }
   }
+
+  /**
+   * Provides an abstraction for read int values, so that decoding logic can 
be reused in different DataInput.
+   *
+   */
+  public static interface IntReader {
+int read(long v);
+  }
+
+  /**
+   * Faster implementation of read single group, It read values from the 
buffer that would not cross
+   * boundaries.
+   *
+   * @param flag the flag of group varint.
+   * @param reader the supplier of read int.
+   * @param dst the array to read ints into.

Review Comment:
   sorry for the missing!



##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -324,24 +324,9 @@ private void readGroupVInt(long[] dst, int offset) throws 
IOException {
 
 try {
   final int flag = curSegment.get(LAYOUT_BYTE, curPosition++) & 0xFF;
-
-  final int n1Minus1 = flag >> 6;
-  final int n2Minus1 = (flag >> 4) & 0x03;
-  final int n3Minus1 = (flag >> 2) & 0x03;
-  final int n4Minus1 = flag & 0x03;
-
-  dst[offset] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n1Minus1];
-  curPosition += 1 + n1Minus1;
-  dst[offset + 1] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n2Minus1];
-  curPosition += 1 + n2Minus1;
-  dst[offset + 2] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n3Minus1];
-  curPosition += 1 + n3Minus1;
-  dst[offset + 3] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n4Minus1];
-  curPosition += 1 + n4Minus1;
+  curPosition +=
+  GroupVIntUtil.readGroupVInt(
+  flag, p -> curSegment.get(LAYOUT_LE_INT, p), curPosition, dst, 
offset);

Review Comment:
   the `curPosition` had +1 when read flag, so it don't need +1
   
   > This is why I'd like to have a test that reads a vint group and other data 
from the same input placed behind each other!
   
   In `BaseDirectoryTestCase#testDataTypes` we did the similar test, is that 
enough?



##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -303,6 +304,34 @@ public byte readByte(long pos) throws IOException {
 }
   }
 
+  @Override
+  public void readGroupVInts(long[] dst, int limit) throws IOException {
+int i;
+for (i = 0; i <= limit - 4; i += 4) {
+  readGroupVInt(dst, i);
+}
+for (; i < limit; ++i) {
+  dst[i] = readVInt();
+}
+  }
+
+  private void readGroupVInt(long[] dst, int offset) throws IOException {
+final MemorySeg

Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]

2023-12-08 Thread via GitHub


gsmiller commented on code in PR #12812:
URL: https://github.com/apache/lucene/pull/12812#discussion_r1420739649


##
lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java:
##
@@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool {
   super(allocator);
 }
 
+/**
+ * For slices, buffers must be filled with zeros, so that we can find a 
slice's end based on a
+ * non-zero final value.
+ */
+private static boolean assertSliceBuffer(int[] buffer) {
+  for (int value : buffer) {

Review Comment:
   Since it's only used in asserts, I think readability is probably 
preferential to performance? The way you have it here in this change would be 
the most readable IMO, and probably makes the most sense. Let's keep the change 
the way you have it but maybe just update the CHANGES entry to highlight that 
you're making the assertion defensive against overflow and negative values? 
That might leave a better "digital archeology" trail for the future so someone 
doesn't make the same mistake I did, read this change as an optimization and 
then go off wondering if a branchless implementation wouldn't make more sense? 
:)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]

2023-12-08 Thread via GitHub


epotyom commented on code in PR #12862:
URL: https://github.com/apache/lucene/pull/12862#discussion_r1420763238


##
lucene/CHANGES.txt:
##
@@ -67,6 +67,8 @@ API Changes
 
 * GITHUB#11023: Adding -level param to CheckIndex, making the old -fast param 
the default behaviour. (Jakub Slowinski)
 
+* GITHUB#12180: Add Facets#getBulkSpecificValues method to get specific values 
for multiple FacetLabel at once. (Egor Potemkin)

Review Comment:
   Done! I'm never sure if I'm supposed to use plurals for class names in 
sentences like this...



##
lucene/facet/src/java/org/apache/lucene/facet/Facets.java:
##
@@ -44,11 +50,17 @@ public abstract FacetResult getTopChildren(int topN, String 
dim, String... path)
   throws IOException;
 
   /**
-   * Return the count or value for a specific path. Returns -1 if this path 
doesn't exist, else the
-   * count.
+   * Return the count or value for a specific path. Returns -1 {@link 
#MISSING_SPECIFIC_VALUE} if

Review Comment:
   You are right, fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]

2023-12-08 Thread via GitHub


epotyom commented on code in PR #12862:
URL: https://github.com/apache/lucene/pull/12862#discussion_r1420763643


##
lucene/facet/src/java/org/apache/lucene/facet/LongValueFacetCounts.java:
##
@@ -568,6 +568,12 @@ public Number getSpecificValue(String dim, String... path) 
{
 throw new UnsupportedOperationException();
   }
 
+  @Override
+  public Number[] getBulkSpecificValues(FacetLabel[] facetLabels) {
+// TODO: should we impl this?

Review Comment:
   Yes, but this class doesn't implement `getSpecificValue`. We can add it as a 
default implementation maybe, to `Facets` class? But then there will be less 
incentive to write faster version of `getBulkSpecificValues` for future 
`Facets` subclasses.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]

2023-12-08 Thread via GitHub


epotyom commented on code in PR #12862:
URL: https://github.com/apache/lucene/pull/12862#discussion_r1420764008


##
lucene/facet/src/java/org/apache/lucene/facet/MultiFacets.java:
##
@@ -77,6 +80,39 @@ public Number getSpecificValue(String dim, String... path) 
throws IOException {
 return facets.getSpecificValue(dim, path);
   }
 
+  @Override
+  public Number[] getBulkSpecificValues(FacetLabel[] facetLabels) throws 
IOException {
+if (facetLabels.length == 0) {
+  return new Number[0];
+}
+// Split facetLabels into chunks by the Facets they belog to
+Map labelsPerFacet = new HashMap<>();
+for (int i = 0; i < facetLabels.length; i++) {
+  String dim = facetLabels[i].components[0];
+  Facets facets = dimToFacets.get(dim);

Review Comment:
   The idea was to rely on defaults which make each object unique. But I just 
found that there is `IdentityHashMap` class that always uses object ID even if 
equals/hashCode are overriden, which I think is safer to use in this case. 
Thanks for heads up!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]

2023-12-08 Thread via GitHub


epotyom commented on code in PR #12862:
URL: https://github.com/apache/lucene/pull/12862#discussion_r1420764333


##
lucene/facet/src/java/org/apache/lucene/facet/MultiFacets.java:
##
@@ -77,6 +80,39 @@ public Number getSpecificValue(String dim, String... path) 
throws IOException {
 return facets.getSpecificValue(dim, path);
   }
 
+  @Override
+  public Number[] getBulkSpecificValues(FacetLabel[] facetLabels) throws 
IOException {
+if (facetLabels.length == 0) {
+  return new Number[0];
+}
+// Split facetLabels into chunks by the Facets they belog to
+Map labelsPerFacet = new HashMap<>();
+for (int i = 0; i < facetLabels.length; i++) {
+  String dim = facetLabels[i].components[0];
+  Facets facets = dimToFacets.get(dim);
+  if (facets == null) {
+if (defaultFacets == null) {
+  throw new IllegalArgumentException("invalid dim \"" + dim + "\"");
+}
+facets = defaultFacets;
+  }
+  labelsPerFacet.computeIfAbsent(facets, f -> new IntArrayList()).add(i);
+}
+// Call getBulkSpecificValues for each chunk and reconcile results
+Number[] result = new Number[facetLabels.length];
+for (Map.Entry fl : labelsPerFacet.entrySet()) {
+  FacetLabel[] labelsForFacet = new FacetLabel[fl.getValue().size()];
+  for (IntCursor ordI : fl.getValue()) {
+labelsForFacet[ordI.index] = facetLabels[ordI.value];
+  }
+  Number[] ords = fl.getKey().getBulkSpecificValues(labelsForFacet);

Review Comment:
   It is because there are Facet implementations for integer and float values 
that share the same base class `Facets`.
   
   Oh, `ords` is a bad name for this variable - renamed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420760926


##
lucene/core/src/test/org/apache/lucene/store/TestMMapDirectory.java:
##
@@ -114,4 +115,31 @@ public void testNullParamsIndexInput() throws Exception {
   }
 }
   }
+
+  private void doTestReadingClosedInput(Directory dir, long[] values, int 
numValues)

Review Comment:
   You don't need to test all variants and different sizes here. Just create an 
empty file with indexoutput. open it and close it, then call `readGroupVInts`. 
As this should only trigger the AlreadyClosedException theres no need to repeat 
this multiple times.
   These are IO-heavy tests, so only do the required stuff and don't scrub my 
SSD. It also slows down those tests.
   
   Just add it here and try to call that method on "two": 
   - 
https://github.com/apache/lucene/blob/08d24dd9cb10b0808264b7c44e5b620d7d1a398e/lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseChunkedDirectoryTestCase.java#L70-L74
   - 
https://github.com/apache/lucene/blob/08d24dd9cb10b0808264b7c44e5b620d7d1a398e/lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseChunkedDirectoryTestCase.java#L91-L95
   
   Anyways, I don't think we need this test in reality. Otherwise we would need 
a call of all public methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]

2023-12-08 Thread via GitHub


stefanvodita commented on code in PR #12812:
URL: https://github.com/apache/lucene/pull/12812#discussion_r1420778257


##
lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java:
##
@@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool {
   super(allocator);
 }
 
+/**
+ * For slices, buffers must be filled with zeros, so that we can find a 
slice's end based on a
+ * non-zero final value.
+ */
+private static boolean assertSliceBuffer(int[] buffer) {
+  for (int value : buffer) {

Review Comment:
   Makes sense, I modified the CHANGES entry. Sorry about the confusion!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]

2023-12-08 Thread via GitHub


epotyom commented on code in PR #12862:
URL: https://github.com/apache/lucene/pull/12862#discussion_r1420792687


##
lucene/facet/src/java/org/apache/lucene/facet/MultiFacets.java:
##
@@ -77,6 +80,39 @@ public Number getSpecificValue(String dim, String... path) 
throws IOException {
 return facets.getSpecificValue(dim, path);
   }
 
+  @Override
+  public Number[] getBulkSpecificValues(FacetLabel[] facetLabels) throws 
IOException {
+if (facetLabels.length == 0) {
+  return new Number[0];
+}
+// Split facetLabels into chunks by the Facets they belog to
+Map labelsPerFacet = new HashMap<>();
+for (int i = 0; i < facetLabels.length; i++) {
+  String dim = facetLabels[i].components[0];
+  Facets facets = dimToFacets.get(dim);
+  if (facets == null) {
+if (defaultFacets == null) {
+  throw new IllegalArgumentException("invalid dim \"" + dim + "\"");
+}
+facets = defaultFacets;
+  }
+  labelsPerFacet.computeIfAbsent(facets, f -> new IntArrayList()).add(i);
+}
+// Call getBulkSpecificValues for each chunk and reconcile results
+Number[] result = new Number[facetLabels.length];
+for (Map.Entry fl : labelsPerFacet.entrySet()) {
+  FacetLabel[] labelsForFacet = new FacetLabel[fl.getValue().size()];
+  for (IntCursor ordI : fl.getValue()) {
+labelsForFacet[ordI.index] = facetLabels[ordI.value];
+  }
+  Number[] ords = fl.getKey().getBulkSpecificValues(labelsForFacet);

Review Comment:
   Hmm, re-read your message. I don't think we can change `getSpecificValue` 
behavior as users might not check for `null`. But we can choose to implement 
`null` for the new method. I think it's more intuitive to use `null`, but It 
could be a little weird that 2 sibling methods behave differently?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420793976


##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -324,24 +324,9 @@ private void readGroupVInt(long[] dst, int offset) throws 
IOException {
 
 try {
   final int flag = curSegment.get(LAYOUT_BYTE, curPosition++) & 0xFF;
-
-  final int n1Minus1 = flag >> 6;
-  final int n2Minus1 = (flag >> 4) & 0x03;
-  final int n3Minus1 = (flag >> 2) & 0x03;
-  final int n4Minus1 = flag & 0x03;
-
-  dst[offset] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n1Minus1];
-  curPosition += 1 + n1Minus1;
-  dst[offset + 1] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n2Minus1];
-  curPosition += 1 + n2Minus1;
-  dst[offset + 2] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n3Minus1];
-  curPosition += 1 + n3Minus1;
-  dst[offset + 3] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n4Minus1];
-  curPosition += 1 + n4Minus1;
+  curPosition +=
+  GroupVIntUtil.readGroupVInt(
+  flag, p -> curSegment.get(LAYOUT_LE_INT, p), curPosition, dst, 
offset);

Review Comment:
   Are you sure. It just reads the flag but does not increment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420793976


##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -324,24 +324,9 @@ private void readGroupVInt(long[] dst, int offset) throws 
IOException {
 
 try {
   final int flag = curSegment.get(LAYOUT_BYTE, curPosition++) & 0xFF;
-
-  final int n1Minus1 = flag >> 6;
-  final int n2Minus1 = (flag >> 4) & 0x03;
-  final int n3Minus1 = (flag >> 2) & 0x03;
-  final int n4Minus1 = flag & 0x03;
-
-  dst[offset] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n1Minus1];
-  curPosition += 1 + n1Minus1;
-  dst[offset + 1] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n2Minus1];
-  curPosition += 1 + n2Minus1;
-  dst[offset + 2] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n3Minus1];
-  curPosition += 1 + n3Minus1;
-  dst[offset + 3] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n4Minus1];
-  curPosition += 1 + n4Minus1;
+  curPosition +=
+  GroupVIntUtil.readGroupVInt(
+  flag, p -> curSegment.get(LAYOUT_LE_INT, p), curPosition, dst, 
offset);

Review Comment:
   Are you sure? It just reads the flag but does not increment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


uschindler commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420760926


##
lucene/core/src/test/org/apache/lucene/store/TestMMapDirectory.java:
##
@@ -114,4 +115,31 @@ public void testNullParamsIndexInput() throws Exception {
   }
 }
   }
+
+  private void doTestReadingClosedInput(Directory dir, long[] values, int 
numValues)

Review Comment:
   You don't need to test all variants and different sizes here. Just create a 
file with random data using indexoutput. open it and close it, then call 
`readGroupVInts`. As this should only trigger the AlreadyClosedException theres 
no need to repeat this multiple times.
   These are IO-heavy tests, so only do the required stuff and don't scrub my 
SSD. It also slows down those tests.
   
   Just add it here and try to call that method on "two": 
   - 
https://github.com/apache/lucene/blob/08d24dd9cb10b0808264b7c44e5b620d7d1a398e/lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseChunkedDirectoryTestCase.java#L70-L74
   - 
https://github.com/apache/lucene/blob/08d24dd9cb10b0808264b7c44e5b620d7d1a398e/lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseChunkedDirectoryTestCase.java#L91-L95
   
   Anyways, I don't think we need this test in reality. Otherwise we would need 
a call of all public methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]

2023-12-08 Thread via GitHub


epotyom commented on PR #12862:
URL: https://github.com/apache/lucene/pull/12862#issuecomment-1847540870

   Thank you fore reviewing @mikemccand ! Resolved your comments in 2nd commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Fix NPE on off-heap test and FST is null [lucene]

2023-12-08 Thread via GitHub


dungba88 opened a new pull request, #12894:
URL: https://github.com/apache/lucene/pull/12894

   ### Description
   
   The test can throw a NPE when it's using off-heap mode and no nodes are 
accepted


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


easyice commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420812950


##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -324,24 +324,9 @@ private void readGroupVInt(long[] dst, int offset) throws 
IOException {
 
 try {
   final int flag = curSegment.get(LAYOUT_BYTE, curPosition++) & 0xFF;
-
-  final int n1Minus1 = flag >> 6;
-  final int n2Minus1 = (flag >> 4) & 0x03;
-  final int n3Minus1 = (flag >> 2) & 0x03;
-  final int n4Minus1 = flag & 0x03;
-
-  dst[offset] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n1Minus1];
-  curPosition += 1 + n1Minus1;
-  dst[offset + 1] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n2Minus1];
-  curPosition += 1 + n2Minus1;
-  dst[offset + 2] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n3Minus1];
-  curPosition += 1 + n3Minus1;
-  dst[offset + 3] =
-  curSegment.get(LAYOUT_LE_INT, curPosition) & 
GroupVIntUtil.GROUP_VINT_MASKS[n4Minus1];
-  curPosition += 1 + n4Minus1;
+  curPosition +=
+  GroupVIntUtil.readGroupVInt(
+  flag, p -> curSegment.get(LAYOUT_LE_INT, p), curPosition, dst, 
offset);

Review Comment:
   emmm...  we use `curSegment.get(LAYOUT_BYTE, curPosition++)` to read flag 
byte, so `curPosition++`  will increment the read pos , is that right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub


benwtrent opened a new issue, #12895:
URL: https://github.com/apache/lucene/issues/12895

   ### Description
   
   It seems that https://github.com/apache/lucene/pull/12699/ has inadvertantly 
broken reading term dictionaries created in Lucene 9.8<=.
   
   To replicate a bug, one can index wikibigall with LuceneUtil & Lucene 9.8 & 
force-merge.
   
   Then attempt to read the created index using a wildcard query:
   
   ```
   Path path = 
Paths.get("/data/local/lucene/indices/wikibigall.lucene-main.opt.Lucene90.dvfields.nd6.72652M/index");
   try (FSDirectory dir = FSDirectory.open(path);
   DirectoryReader reader = DirectoryReader.open(dir)) {
 IndexSearcher searcher = new IndexSearcher(reader); 
 searcher.count(new WildcardQuery(new Term("body", "*fo*")));
   }
   ```
   
   This will result in a trace similar to below:
   ```
   Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 3 
out of bounds for length 3
at 
org.apache.lucene.store.ByteArrayDataInput.readByte(ByteArrayDataInput.java:136)
at org.apache.lucene.store.DataInput.readVInt(DataInput.java:110)
at 
org.apache.lucene.store.ByteArrayDataInput.readVInt(ByteArrayDataInput.java:114)
at 
org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:158)
at 
org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:149)
at 
org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.pushFrame(IntersectTermsEnum.java:203)
at 
org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:531)
at 
org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:373)
at 
org.apache.lucene.search.MultiTermQueryConstantScoreBlendedWrapper$1.rewriteInner(MultiTermQueryConstantScoreBlendedWrapper.java:111)
at 
org.apache.lucene.search.AbstractMultiTermQueryConstantScoreWrapper$RewritingWeight.rewrite(AbstractMultiTermQueryConstantScoreWrapper.java:179)
at 
org.apache.lucene.search.AbstractMultiTermQueryConstantScoreWrapper$RewritingWeight.bulkScorer(AbstractMultiTermQueryConstantScoreWrapper.java:220)
at 
org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:930)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:678)
at 
org.apache.lucene.search.IndexSearcher.lambda$4(IndexSearcher.java:636)
at 
org.apache.lucene.search.TaskExecutor$TaskGroup.lambda$0(TaskExecutor.java:118)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at 
org.apache.lucene.search.TaskExecutor$TaskGroup.invokeAll(TaskExecutor.java:153)
at org.apache.lucene.search.TaskExecutor.invokeAll(TaskExecutor.java:76)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:640)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:607)
at org.apache.lucene.search.IndexSearcher.count(IndexSearcher.java:423)
at Corruption.main(Corruption.java:18)
   ```
   
   We are currently not sure if this effects Lucene 9.9 created indices & 
reading via Lucene 9.9. 
   
   NOTE: This also fails with just a prefix wildcard query. It seems to be all 
multi-term queries could be affected.
   
   Will provide more example stack traces in issue comments.
   
   ### Version and environment details
   
   Lucene 9.9 reading Lucene 9.8 indices.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub


benwtrent commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847562169

   //cc @gf2121 && @mikemccand 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub


benwtrent commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847568572

   Here are some exceptions ran into when trying to do multi-term queries with 
Lucene 9.9 against an index created in 9.8 or before:
   
   ```
   Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 4 out of bounds 
for length 4
at 
org.apache.lucene.store.ByteArrayDataInput.readVInt(ByteArrayDataInput.java:112)
at 
org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:158)
at 
org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:149)
at 
org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.pushFrame(IntersectTermsEnum.java:203)
at 
org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:531)
at 
org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:373)
at 
org.apache.lucene.index.FilterLeafReader$FilterTermsEnum.next(FilterLeafReader.java:201)
   ```
   
   ```
   Caused by: org.apache.lucene.index.CorruptIndexException: Illegal code for a 
compression algorithm: 3 (resource=SearchIndexInput{[(clone of) 
_3e4i_Lucene90_0.tim], context=IOContext [context=READ, mergeInfo=null, 
flushInfo=null, readOnce=false], 
cacheFile=SharedCacheFile{cacheKey=FileCacheKey[shardId=[.ds-logs-endpoint.events.file-default-2023.11.22-02][0],
 primaryTerm=3, fileName=stateless_commit_61353], length=3174808501}, 
length=308358826, offset=181596})
at 
org.apache.lucene.core@9.9.0/org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:196)
at 
org.apache.lucene.core@9.9.0/org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.loadNextFloorBlock(IntersectTermsEnumFrame.java:119)
at 
org.apache.lucene.core@9.9.0/org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.popPushNext(IntersectTermsEnum.java:339)
at 
org.apache.lucene.core@9.9.0/org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:543)
at 
org.apache.lucene.core@9.9.0/org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:373)
at 
org.apache.lucene.core@9.9.0/org.apache.lucene.index.FilterLeafReader$FilterTermsEnum.next(FilterLeafReader.java:201)
   ```
   
   ```
   Caused by: java.lang.NullPointerException: Cannot invoke 
"org.apache.lucene.util.fst.FST$Arc.output()" because "arc" is null
at 
org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.pushFrame(IntersectTermsEnum.java:196)
at 
org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:531)
at 
org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:373)
at 
org.apache.lucene.index.FilterLeafReader$FilterTermsEnum.next(FilterLeafReader.java:201)
   ```
   
   Still digging into this to figure out why reading is now corrupted. 
https://github.com/apache/lucene/pull/12699 has many subtle changes, including 
a very big one where we don't copy the byte arrays any longer.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove DrillSideways#createDrillDownFacetsCollector in favor of the manager-based hook [lucene]

2023-12-08 Thread via GitHub


gsmiller merged PR #12855:
URL: https://github.com/apache/lucene/pull/12855


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Mark DrillSideways#createDrillDownFacetsCollector as @Deprecated [lucene]

2023-12-08 Thread via GitHub


gsmiller merged PR #12854:
URL: https://github.com/apache/lucene/pull/12854


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]

2023-12-08 Thread via GitHub


gsmiller commented on code in PR #12812:
URL: https://github.com/apache/lucene/pull/12812#discussion_r1420830252


##
lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java:
##
@@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool {
   super(allocator);
 }
 
+/**
+ * For slices, buffers must be filled with zeros, so that we can find a 
slice's end based on a
+ * non-zero final value.
+ */
+private static boolean assertSliceBuffer(int[] buffer) {
+  for (int value : buffer) {

Review Comment:
   No problem! Thanks for the change!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]

2023-12-08 Thread via GitHub


gsmiller merged PR #12812:
URL: https://github.com/apache/lucene/pull/12812


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub


benwtrent commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847590622

   Possibly related: https://github.com/apache/lucene/pull/12631
   
   NOTE: the read corruption doesn't occur when reading from an index created 
in 9.9.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-08 Thread via GitHub


easyice commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1420843242


##
lucene/core/src/test/org/apache/lucene/store/TestMMapDirectory.java:
##
@@ -114,4 +115,31 @@ public void testNullParamsIndexInput() throws Exception {
   }
 }
   }
+
+  private void doTestReadingClosedInput(Directory dir, long[] values, int 
numValues)

Review Comment:
   Thank you @uschindler , It's more simpler, i like this!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix NPE on off-heap test and FST is null [lucene]

2023-12-08 Thread via GitHub


gsmiller commented on code in PR #12894:
URL: https://github.com/apache/lucene/pull/12894#discussion_r1420835887


##
lucene/test-framework/src/java/org/apache/lucene/tests/util/fst/FSTTester.java:
##
@@ -283,14 +283,17 @@ public FST doTest() throws IOException {
   }
 }
 FST fst = fstCompiler.compile();
-;

Review Comment:
   Don't like @mikemccand see this! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix NPE on off-heap test and FST is null [lucene]

2023-12-08 Thread via GitHub


gsmiller merged PR #12894:
URL: https://github.com/apache/lucene/pull/12894


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub


benwtrent commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847604189

   Git bisect has confirmed the read corruption occurs with: 
https://github.com/apache/lucene/pull/12699


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub


jpountz commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847604281

   I have a 9.8 index that reproduces the bug and ran a `git bisect` to figure 
out the first commit that fails, it pointed to #12699.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Reproducible failure of TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren [lucene]

2023-12-08 Thread via GitHub


gsmiller opened a new issue, #12896:
URL: https://github.com/apache/lucene/issues/12896

   ### Description
   
   Saw this test fail a couple times in automated builds (e.g., 
[here](https://jenkins.thetaphi.de/job/Lucene-main-Windows/13501/testReport/junit/org.apache.lucene.search.join/TestParentBlockJoinByteKnnVectorQuery/testScoringWithMultipleChildren/)).
   
   I was able to reproduce locally against `main` with: `./gradlew test --tests 
TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren 
-Dtests.seed=98B9F347D87BF07E`
   
   ### Version and environment details
   
   Reproduced on the `main` branch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub


mikemccand commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847782193

   Ugh -- I'll try to look at this later today.  Disappointing that our back 
compat test specifically for reading 9.8 indices failed to catch this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub


mikemccand commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847786318

   It's also curious that it's not happening w/ 9.9 created indices.  #12699 is 
about optimizing how we accumulate the long output while traversing (reading) 
the FST block tree terms index.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]

2023-12-08 Thread via GitHub


zhaih commented on code in PR #12885:
URL: https://github.com/apache/lucene/pull/12885#discussion_r1421011978


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java:
##
@@ -43,10 +43,30 @@ public JapaneseReadingFormFilter(TokenStream input) {
 this(input, false);
   }
 
+  private boolean isAllHiragana(CharSequence s) {
+int len = s.length();
+for (int i = 0; i < len; i++) {
+  char ch = s.charAt(i);
+  if (ch < 0x3040 || ch > 0x3097) {
+return false;
+  }
+}
+return true;
+  }
+
   @Override
   public boolean incrementToken() throws IOException {
 if (input.incrementToken()) {
   String reading = readingAttr.getReading();
+  if (reading == null & isAllHiragana(termAttr)) {

Review Comment:
   What if we have hiragana and katagana mixed text? I feel like we still need 
to convert the hiragana part to katagana?



##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java:
##
@@ -43,10 +43,30 @@ public JapaneseReadingFormFilter(TokenStream input) {
 this(input, false);
   }
 
+  private boolean isAllHiragana(CharSequence s) {
+int len = s.length();
+for (int i = 0; i < len; i++) {
+  char ch = s.charAt(i);
+  if (ch < 0x3040 || ch > 0x3097) {
+return false;
+  }
+}
+return true;
+  }
+
   @Override
   public boolean incrementToken() throws IOException {
 if (input.incrementToken()) {
   String reading = readingAttr.getReading();
+  if (reading == null & isAllHiragana(termAttr)) {

Review Comment:
   Maybe use `&&` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Reproducible failure of TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren [lucene]

2023-12-08 Thread via GitHub


zhaih commented on issue #12896:
URL: https://github.com/apache/lucene/issues/12896#issuecomment-1847828807

   I think it might due to the same problem as: 
https://github.com/apache/lucene/pull/12889 
   e.g. a doc reorder merge policy reordered the parent child block
   I haven't check it myself tho.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Reproducible failure of TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren [lucene]

2023-12-08 Thread via GitHub


zhaih commented on issue #12896:
URL: https://github.com/apache/lucene/issues/12896#issuecomment-1847830605

   Oh probably not, the test is just using the default merge policy (TMP)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub


benwtrent commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847858897

   @mikemccand I have to use at a minimum: `wikibig1m` for it to replicate.
   
   Couple of weird things I noticed in that optimization PR:
   
- 
https://github.com/apache/lucene/pull/12699/files#diff-c30add12e4170e7905cdb0912fd869d447b5b1772d74c9fcaf0afe41cd58866aL200-R204
 Now instead of giving a complete view of all appended arcs, there is this 
weird assumption that we only give one to `floorDataReader`
- 
https://github.com/apache/lucene/pull/12699/files#diff-c30add12e4170e7905cdb0912fd869d447b5b1772d74c9fcaf0afe41cd58866aR200-L202.
 The frame outputPrefix is then set as the accumulation of all the arc outputs. 
I am not sure if this is relevant as for it to be important `getFrame` would 
have to return a frame already seen before, I am not sure if that is valid or 
not.
   
   
   This makes me think that each frame should have its own `OutputAccumulator` 
But all this code seems incredibly subtle and opaque to me. Been staring at it 
for hours now :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]

2023-12-08 Thread via GitHub


gautamworah96 commented on code in PR #12853:
URL: https://github.com/apache/lucene/pull/12853#discussion_r1421077089


##
lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysQuery.java:
##
@@ -193,42 +204,29 @@ public BulkScorer bulkScorer(LeafReaderContext context) 
throws IOException {
   FacetsCollector sidewaysCollector = 
drillSidewaysCollectorManagers[dim].newCollector();
   sidewaysCollectors[dim] = sidewaysCollector;
 
-  dims[dim] = new DrillSidewaysScorer.DocsAndCost(scorer, 
sidewaysCollector);
+  dims[dim] =
+  new DrillSidewaysScorer.DocsAndCost(
+  scorer, sidewaysCollector.getLeafCollector(context));
 }
 
-// If more than one dim has no matches, then there
-// are no hits nor drill-sideways counts.  Or, if we
-// have only one dim and that dim has no matches,
-// same thing.

Review Comment:
   unrelated but how exactly is `(nullCount > 1 || (nullCount == 1 && 
dims.length == 1))` equivalent to `nullCount > 1`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]

2023-12-08 Thread via GitHub


gautamworah96 commented on code in PR #12853:
URL: https://github.com/apache/lucene/pull/12853#discussion_r1421069686


##
lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysQuery.java:
##
@@ -193,42 +204,29 @@ public BulkScorer bulkScorer(LeafReaderContext context) 
throws IOException {
   FacetsCollector sidewaysCollector = 
drillSidewaysCollectorManagers[dim].newCollector();
   sidewaysCollectors[dim] = sidewaysCollector;
 
-  dims[dim] = new DrillSidewaysScorer.DocsAndCost(scorer, 
sidewaysCollector);
+  dims[dim] =
+  new DrillSidewaysScorer.DocsAndCost(
+  scorer, sidewaysCollector.getLeafCollector(context));
 }
 
-// If more than one dim has no matches, then there
-// are no hits nor drill-sideways counts.  Or, if we
-// have only one dim and that dim has no matches,
-// same thing.
-// if (nullCount > 1 || (nullCount == 1 && dims.length == 1)) {
-if (nullCount > 1) {
+// If baseScorer is null or the dim nullCount > 1, then we have 
nothing to score. We return
+// a null scorer in this case, but we need to make sure #finish gets 
called on all facet
+// collectors since IndexSearcher won't handle this for us:
+if (baseScorer == null || nullCount > 1) {
+  if (drillDownCollector != null) {
+drillDownCollector.finish();
+  }
+  for (FacetsCollector fc : sidewaysCollectors) {
+fc.finish();
+  }
   return null;
 }
 
 // Sort drill-downs by most restrictive first:
-Arrays.sort(
-dims,
-new Comparator() {
-  @Override
-  public int compare(DocsAndCost o1, DocsAndCost o2) {
-return Long.compare(o1.approximation.cost(), 
o2.approximation.cost());
-  }
-});
-
-if (baseScorer == null) {
-  return null;
-}
-
-FacetsCollector drillDownCollector;
-if (drillDownCollectorManager != null) {
-  drillDownCollector = drillDownCollectorManager.newCollector();
-  managedDrillDownCollectors.add(drillDownCollector);
-} else {
-  drillDownCollector = null;
-}
+Arrays.sort(dims, Comparator.comparingLong(o -> 
o.approximation.cost()));

Review Comment:
   Thanks for these small cleanups as well. LGTM



##
lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysQuery.java:
##
@@ -193,42 +204,29 @@ public BulkScorer bulkScorer(LeafReaderContext context) 
throws IOException {
   FacetsCollector sidewaysCollector = 
drillSidewaysCollectorManagers[dim].newCollector();
   sidewaysCollectors[dim] = sidewaysCollector;
 
-  dims[dim] = new DrillSidewaysScorer.DocsAndCost(scorer, 
sidewaysCollector);
+  dims[dim] =
+  new DrillSidewaysScorer.DocsAndCost(
+  scorer, sidewaysCollector.getLeafCollector(context));
 }
 
-// If more than one dim has no matches, then there
-// are no hits nor drill-sideways counts.  Or, if we
-// have only one dim and that dim has no matches,
-// same thing.

Review Comment:
   unlrelated but how exactly is `(nullCount > 1 || (nullCount == 1 && 
dims.length == 1))` equivalent to `nullCount > 1`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]

2023-12-08 Thread via GitHub


gsmiller commented on code in PR #12853:
URL: https://github.com/apache/lucene/pull/12853#discussion_r1421103992


##
lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysQuery.java:
##
@@ -193,42 +204,29 @@ public BulkScorer bulkScorer(LeafReaderContext context) 
throws IOException {
   FacetsCollector sidewaysCollector = 
drillSidewaysCollectorManagers[dim].newCollector();
   sidewaysCollectors[dim] = sidewaysCollector;
 
-  dims[dim] = new DrillSidewaysScorer.DocsAndCost(scorer, 
sidewaysCollector);
+  dims[dim] =
+  new DrillSidewaysScorer.DocsAndCost(
+  scorer, sidewaysCollector.getLeafCollector(context));
 }
 
-// If more than one dim has no matches, then there
-// are no hits nor drill-sideways counts.  Or, if we
-// have only one dim and that dim has no matches,
-// same thing.

Review Comment:
   Yeah, they're not :). We _do_ need to still run the query in the case where 
there's a single dim but it's null because we need to collect all the 
"sideways" hits for that null dim. Looks like this comment (along with the 
commented out condition) came from @mikemccand back in 2014. I doubt he recalls 
much 9 years later. Maybe he realized this is a bug, commented out the 
condition (and replaced it with the correct one) and left the comment 
untouched? That's my best guess. But no need to keep the comment around anymore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]

2023-12-08 Thread via GitHub


gsmiller merged PR #12853:
URL: https://github.com/apache/lucene/pull/12853


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]

2023-12-08 Thread via GitHub


gsmiller commented on PR #12853:
URL: https://github.com/apache/lucene/pull/12853#issuecomment-1847976009

   Thanks @gautamworah96  for taking a look!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] IntTaxonomyFacets chooses dense values array when FacetsCollector has no MatchingDocs [lucene]

2023-12-08 Thread via GitHub


gsmiller commented on issue #12558:
URL: https://github.com/apache/lucene/issues/12558#issuecomment-1847976273

   Fixed the root cause of this in #12853


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Reproducible TestDrillSideways failure [lucene]

2023-12-08 Thread via GitHub


gsmiller commented on issue #12418:
URL: https://github.com/apache/lucene/issues/12418#issuecomment-1847976644

   OK merged #12853 which I think fixes the root cause of this randomized test 
failures. I'm going to resolve out this issue and will keep an eye on nightlies 
for any new failures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]

2023-12-08 Thread via GitHub


kuramitsu commented on code in PR #12885:
URL: https://github.com/apache/lucene/pull/12885#discussion_r1421172943


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java:
##
@@ -43,10 +43,30 @@ public JapaneseReadingFormFilter(TokenStream input) {
 this(input, false);
   }
 
+  private boolean isAllHiragana(CharSequence s) {
+int len = s.length();
+for (int i = 0; i < len; i++) {
+  char ch = s.charAt(i);
+  if (ch < 0x3040 || ch > 0x3097) {
+return false;
+  }
+}
+return true;
+  }
+
   @Override
   public boolean incrementToken() throws IOException {
 if (input.incrementToken()) {
   String reading = readingAttr.getReading();
+  if (reading == null & isAllHiragana(termAttr)) {

Review Comment:
   Thank you. I fixed it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]

2023-12-08 Thread via GitHub


kuramitsu commented on code in PR #12885:
URL: https://github.com/apache/lucene/pull/12885#discussion_r1421180794


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java:
##
@@ -43,10 +43,30 @@ public JapaneseReadingFormFilter(TokenStream input) {
 this(input, false);
   }
 
+  private boolean isAllHiragana(CharSequence s) {
+int len = s.length();
+for (int i = 0; i < len; i++) {
+  char ch = s.charAt(i);
+  if (ch < 0x3040 || ch > 0x3097) {
+return false;
+  }
+}
+return true;
+  }
+
   @Override
   public boolean incrementToken() throws IOException {
 if (input.incrementToken()) {
   String reading = readingAttr.getReading();
+  if (reading == null & isAllHiragana(termAttr)) {

Review Comment:
   Fixed to support partial hiragana. 
   However, it is difficult to describe a test for oov terms that contain a 
mixture of hiragana and katakana, because the system seems to basically split 
different character types for oov terms.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]

2023-12-08 Thread via GitHub


kuramitsu commented on code in PR #12885:
URL: https://github.com/apache/lucene/pull/12885#discussion_r1421180794


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java:
##
@@ -43,10 +43,30 @@ public JapaneseReadingFormFilter(TokenStream input) {
 this(input, false);
   }
 
+  private boolean isAllHiragana(CharSequence s) {
+int len = s.length();
+for (int i = 0; i < len; i++) {
+  char ch = s.charAt(i);
+  if (ch < 0x3040 || ch > 0x3097) {
+return false;
+  }
+}
+return true;
+  }
+
   @Override
   public boolean incrementToken() throws IOException {
 if (input.incrementToken()) {
   String reading = readingAttr.getReading();
+  if (reading == null & isAllHiragana(termAttr)) {

Review Comment:
   Fixed to support partial hiragana. 
   However, it is difficult to describe a test for OOV terms that contain a 
mixture of hiragana and katakana, because the system seems to basically split 
different character types for OOV terms.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub


benwtrent commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848037016

   I think if a fix for this isn't found early next week, we should consider 
reverting it.
   
   No user should upgrade to Lucene 9.9.0 with this bug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Output well-formed UTF-8 bytes in SimpleTextCodec's segmentinfos [lucene]

2023-12-08 Thread via GitHub


msfroh opened a new pull request, #12897:
URL: https://github.com/apache/lucene/pull/12897

   ### Description
   
   The SimpleTextSegmentInfoFormat was writing the random byte array used as a 
segment's ID directly -- not converting to a simple text representation of the 
byte array. As a result, the segment infos were often malformed (as UTF-8 text).
   
   The included test was failing before the change to write out the text 
representation of the byte array.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Output well-formed UTF-8 bytes in SimpleTextCodec's segmentinfos [lucene]

2023-12-08 Thread via GitHub


msfroh commented on PR #12897:
URL: https://github.com/apache/lucene/pull/12897#issuecomment-1848071583

   If needed, I'm happy to add versions of `testFileIsUTF8()` for the other 
SimpleTextCodec format unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-08 Thread via GitHub


mikemccand commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848198927

   I am travelling this weekend and unlikely to make much progress on this 
until early next week.
   
   Maybe we just revert and release 9.9.1 now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make FSTCompiler.compile() to only return the FSTMetadata [lucene]

2023-12-08 Thread via GitHub


dungba88 commented on code in PR #12831:
URL: https://github.com/apache/lucene/pull/12831#discussion_r1420866991


##
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##
@@ -503,9 +518,7 @@ public FSTMetadata getMetadata() {
   }
 
   /**
-   * Save the FST to DataOutput. If you use an {@link 
org.apache.lucene.store.IndexOutput} to build
-   * the FST, then you should not and do not need to call this method, as the 
FST is already saved.
-   * Doing so will throw an {@link UnsupportedOperationException}.
+   * Save the FST to DataOutput.

Review Comment:
   It's no longer possible for user to create a FST that uses `NullFSTReader`, 
so this Javadoc will no longer holds. If users are able to create FST, then it 
will always be readable.



##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -175,7 +176,7 @@ private FSTCompiler(
 fst =
 new FST<>(
 new FST.FSTMetadata<>(inputType, outputs, null, -1, 
VERSION_CURRENT, 0),
-toFSTReader(dataOutput));
+NULL_FST_READER);

Review Comment:
   As we can't pass actual null to FSTReader, `NullFSTReader` will be used 
instead.



##
lucene/analysis/common/src/java/org/apache/lucene/analysis/charfilter/NormalizeCharMap.java:
##
@@ -111,7 +111,7 @@ public NormalizeCharMap build() {
 for (Map.Entry ent : pendingPairs.entrySet()) {
   fstCompiler.add(Util.toUTF16(ent.getKey(), scratch), new 
CharsRef(ent.getValue()));
 }
-map = fstCompiler.compile();
+map = FST.fromFSTReader(fstCompiler.compile(), 
fstCompiler.getFSTReader());

Review Comment:
   This `fromFSTReader` is there to avoid the boilerplate null-check that each 
consumer must now do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org