[I] Reproducible error in TestLucene90HnswVectorsFormat.testIndexedValueNotAliased [lucene]

2023-11-24 Thread via GitHub


iverase opened a new issue, #12840:
URL: https://github.com/apache/lucene/issues/12840

   Command to reproduce:
   
   ```
   ./gradlew test --tests 
TestLucene90HnswVectorsFormat.testIndexedValueNotAliased 
-Dtests.seed=611EEBD0148F03C7
   ```
   
   error:
   ```
   org.apache.lucene.backward_codecs.lucene90.TestLucene90HnswVectorsFormat > 
testIndexedValueNotAliased FAILED
   java.lang.AssertionError: expected:<1.0> but was:<2.0>
   at 
__randomizedtesting.SeedInfo.seed([611EEBD0148F03C7:651A742B93C1394]:0)
   at junit@4.13.1/org.junit.Assert.fail(Assert.java:89)
   at junit@4.13.1/org.junit.Assert.failNotEquals(Assert.java:835)
   at junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:577)
   at junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:701)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Skip decoding tail freqs when they are not needed. [lucene]

2023-11-24 Thread via GitHub


jpountz commented on PR #12832:
URL: https://github.com/apache/lucene/pull/12832#issuecomment-1825371734

   This seems to have further helped [`prefix` 
queries](http://people.apache.org/~mikemccand/lucenebench/Prefix3.html). I'll 
add an annotation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Move group-varint encoding/decoding logic to DataOutput/DataInput? [lucene]

2023-11-24 Thread via GitHub


jpountz commented on issue #12826:
URL: https://github.com/apache/lucene/issues/12826#issuecomment-1825393392

   Let's move your branch to a PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-24 Thread via GitHub


gf2121 commented on code in PR #12838:
URL: https://github.com/apache/lucene/pull/12838#discussion_r1404145355


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99SkipReader.java:
##
@@ -48,7 +48,7 @@
  *
  * Therefore, we'll trim df before passing it to the interface. see 
trim(int)

Review Comment:
   This java doc explains in detail why we need this `trim`. We need to update 
it if we plan to remove this :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Move group-varint encoding/decoding logic to DataOutput/DataInput? [lucene]

2023-11-24 Thread via GitHub


easyice commented on issue #12826:
URL: https://github.com/apache/lucene/issues/12826#issuecomment-1825397930

   Okay!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] add dedicated test to assert internals of LZ4 hashtable [LUCENE-9190] [lucene]

2023-11-24 Thread via GitHub


slow-J commented on issue #10230:
URL: https://github.com/apache/lucene/issues/10230#issuecomment-1825550388

   Already implemented in https://github.com/apache/lucene-solr/pull/1236, this 
issue can be closed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-24 Thread via GitHub


gf2121 commented on code in PR #12838:
URL: https://github.com/apache/lucene/pull/12838#discussion_r1404145355


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99SkipReader.java:
##
@@ -48,7 +48,7 @@
  *
  * Therefore, we'll trim df before passing it to the interface. see 
trim(int)

Review Comment:
   This java doc explains in detail why we need this `trim`. We need to update 
it :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Hide the internal data structure of HeapPointWriter [lucene]

2023-11-24 Thread via GitHub


iverase merged PR #12762:
URL: https://github.com/apache/lucene/pull/12762


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-11-24 Thread via GitHub


jpountz commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1404341416


##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -303,6 +304,30 @@ public byte readByte(long pos) throws IOException {
 }
   }
 
+  @Override
+  public void readGroupVInt(long[] docs, int pos) throws IOException {
+if (curSegment.byteSize() - curPosition < 17) {
+  super.readGroupVInt(docs, pos);
+  return;
+}
+
+final int flag = readByte() & 0xFF;
+
+final int n1Minus1 = flag >> 6;
+final int n2Minus1 = (flag >> 4) & 0x03;
+final int n3Minus1 = (flag >> 2) & 0x03;
+final int n4Minus1 = flag & 0x03;
+
+docs[pos] = curSegment.get(LAYOUT_LE_INT, curPosition) & MASKS[n1Minus1];
+curPosition += 1 + n1Minus1;
+docs[pos + 1] = curSegment.get(LAYOUT_LE_INT, curPosition) & 
MASKS[n2Minus1];
+curPosition += 1 + n2Minus1;
+docs[pos + 2] = curSegment.get(LAYOUT_LE_INT, curPosition) & 
MASKS[n3Minus1];
+curPosition += 1 + n3Minus1;
+docs[pos + 3] = curSegment.get(LAYOUT_LE_INT, curPosition) & 
MASKS[n4Minus1];
+curPosition += 1 + n4Minus1;
+  }

Review Comment:
   Can you add the same `catch (NullPointerException | IllegalStateException 
e)` that `readInt()` and other read methods have, for the case when the index 
input is closed?



##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -303,6 +304,30 @@ public byte readByte(long pos) throws IOException {
 }
   }
 
+  @Override
+  public void readGroupVInt(long[] docs, int pos) throws IOException {
+if (curSegment.byteSize() - curPosition < 17) {
+  super.readGroupVInt(docs, pos);
+  return;
+}

Review Comment:
   I don't think we have a test that covers this case well at the moment.



##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -49,6 +49,7 @@ abstract class MemorySegmentIndexInput extends IndexInput 
implements RandomAcces
   final int chunkSizePower;
   final Arena arena;
   final MemorySegment[] segments;
+  private static final int[] MASKS = new int[] {0xFF, 0x, 0xFF, 
0x};

Review Comment:
   maybe rename to `GROUP_VINT_MASKS` or something along these lines now that 
this logic moved to a class which is not only about group vint?
   
   Also in general I prefer having constants before instance members in the 
class definition.



##
lucene/core/src/java/org/apache/lucene/store/DataOutput.java:
##
@@ -29,6 +29,7 @@
  * internal state like file position).
  */
 public abstract class DataOutput {
+  BytesRef groupVIntBytes;

Review Comment:
   BytesRefBuilder feels like a better fit for how you're using it (using 
`length` rather than `offset` to track the number of written bytes). Also let's 
make it `private`?



##
lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestGroupVInt.java:
##
@@ -31,9 +34,7 @@ public void testEncodeDecode() throws IOException {
 long[] values = new long[ForUtil.BLOCK_SIZE];
 long[] restored = new long[ForUtil.BLOCK_SIZE];
 final int iterations = atLeast(100);
-
-final GroupVIntWriter w = new GroupVIntWriter();
-byte[] encoded = new byte[(int) (Integer.BYTES * ForUtil.BLOCK_SIZE * 
1.25)];
+Directory dir = FSDirectory.open(createTempDir());

Review Comment:
   Let's use `newFSDirectory` to add coverage for all Directory implementations?
   
   ```suggestion
   Directory dir = newFSDirectory(createTempDir());
   ```



##
lucene/core/src/java/org/apache/lucene/store/DataOutput.java:
##
@@ -324,4 +325,45 @@ public void writeSetOfStrings(Set set) throws 
IOException {
   writeString(value);
 }
   }
+
+  /**
+   * Encode integers using group-varint. It uses VInt to encode tail values 
that are not enough for
+   * a group
+   *
+   * @param values the values to write
+   * @param limit the number of values to write.
+   */
+  public void writeGroupVInts(long[] values, int limit) throws IOException {
+if (groupVIntBytes == null) {
+  // the maximum size of one group is 4 integers + 1 byte flag.
+  groupVIntBytes = new BytesRef(17);
+}
+int off = 0;
+
+// encode each group
+while ((limit - off) >= 4) {
+  byte flag = 0;
+  groupVIntBytes.offset = 1;
+  flag |= (encodeGroupValue((int) values[off++]) - 1) << 6;
+  flag |= (encodeGroupValue((int) values[off++]) - 1) << 4;
+  flag |= (encodeGroupValue((int) values[off++]) - 1) << 2;
+  flag |= (encodeGroupValue((int) values[off++]) - 1);
+  groupVIntBytes.bytes[0] = flag;
+  writeBytes(groupVIntBytes.bytes, groupVIntBytes.offset);
+}
+
+// tail vints
+for (; off < limit; off++) {
+  writeVInt((int) values[off]);

Review Comment:
   Now that we're moving this to `DataOutput`, we probably need to check these 
casts, e.g. with `Math.toIntExact`.



##

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-11-24 Thread via GitHub


jpountz commented on PR #12841:
URL: https://github.com/apache/lucene/pull/12841#issuecomment-1825685057

   And maybe `BufferedIndexInput` too for folks using `NIOFSDirectory`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Simplify advancing on postings/impacts enums [lucene]

2023-11-24 Thread via GitHub


jpountz commented on code in PR #12838:
URL: https://github.com/apache/lucene/pull/12838#discussion_r1404365624


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99SkipReader.java:
##
@@ -48,7 +48,7 @@
  *
  * Therefore, we'll trim df before passing it to the interface. see 
trim(int)

Review Comment:
   Indeed! OK, this change is bigger than I thought it'd be, I won't try to 
fold it into 9.9.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-11-24 Thread via GitHub


easyice commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1404387618


##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -303,6 +304,30 @@ public byte readByte(long pos) throws IOException {
 }
   }
 
+  @Override
+  public void readGroupVInt(long[] docs, int pos) throws IOException {
+if (curSegment.byteSize() - curPosition < 17) {
+  super.readGroupVInt(docs, pos);
+  return;
+}
+
+final int flag = readByte() & 0xFF;
+
+final int n1Minus1 = flag >> 6;
+final int n2Minus1 = (flag >> 4) & 0x03;
+final int n3Minus1 = (flag >> 2) & 0x03;
+final int n4Minus1 = flag & 0x03;
+
+docs[pos] = curSegment.get(LAYOUT_LE_INT, curPosition) & MASKS[n1Minus1];
+curPosition += 1 + n1Minus1;
+docs[pos + 1] = curSegment.get(LAYOUT_LE_INT, curPosition) & 
MASKS[n2Minus1];
+curPosition += 1 + n2Minus1;
+docs[pos + 2] = curSegment.get(LAYOUT_LE_INT, curPosition) & 
MASKS[n3Minus1];
+curPosition += 1 + n3Minus1;
+docs[pos + 3] = curSegment.get(LAYOUT_LE_INT, curPosition) & 
MASKS[n4Minus1];
+curPosition += 1 + n4Minus1;
+  }

Review Comment:
   +1, Thanks!



##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -49,6 +49,7 @@ abstract class MemorySegmentIndexInput extends IndexInput 
implements RandomAcces
   final int chunkSizePower;
   final Arena arena;
   final MemorySegment[] segments;
+  private static final int[] MASKS = new int[] {0xFF, 0x, 0xFF, 
0x};

Review Comment:
   +1, great suggestion!



##
lucene/core/src/java/org/apache/lucene/store/DataOutput.java:
##
@@ -29,6 +29,7 @@
  * internal state like file position).
  */
 public abstract class DataOutput {
+  BytesRef groupVIntBytes;

Review Comment:
   +1, Thanks for the suggestion!



##
lucene/core/src/java/org/apache/lucene/store/DataOutput.java:
##
@@ -324,4 +325,45 @@ public void writeSetOfStrings(Set set) throws 
IOException {
   writeString(value);
 }
   }
+
+  /**
+   * Encode integers using group-varint. It uses VInt to encode tail values 
that are not enough for
+   * a group
+   *
+   * @param values the values to write
+   * @param limit the number of values to write.
+   */
+  public void writeGroupVInts(long[] values, int limit) throws IOException {
+if (groupVIntBytes == null) {
+  // the maximum size of one group is 4 integers + 1 byte flag.
+  groupVIntBytes = new BytesRef(17);
+}
+int off = 0;
+
+// encode each group
+while ((limit - off) >= 4) {
+  byte flag = 0;
+  groupVIntBytes.offset = 1;
+  flag |= (encodeGroupValue((int) values[off++]) - 1) << 6;
+  flag |= (encodeGroupValue((int) values[off++]) - 1) << 4;
+  flag |= (encodeGroupValue((int) values[off++]) - 1) << 2;
+  flag |= (encodeGroupValue((int) values[off++]) - 1);
+  groupVIntBytes.bytes[0] = flag;
+  writeBytes(groupVIntBytes.bytes, groupVIntBytes.offset);
+}
+
+// tail vints
+for (; off < limit; off++) {
+  writeVInt((int) values[off]);

Review Comment:
   Good idea, i like that!



##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -303,6 +304,30 @@ public byte readByte(long pos) throws IOException {
 }
   }
 
+  @Override
+  public void readGroupVInt(long[] docs, int pos) throws IOException {
+if (curSegment.byteSize() - curPosition < 17) {
+  super.readGroupVInt(docs, pos);
+  return;
+}

Review Comment:
   In `TestGroupVInt#testEncodeDecode` we use a range of [1-31] `bpv` and a 
ragne of [1-128] `numValues`,  For instance if the `bpv==2` and  `numValues==4` 
it will cover this case?



##
lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestGroupVInt.java:
##
@@ -31,9 +34,7 @@ public void testEncodeDecode() throws IOException {
 long[] values = new long[ForUtil.BLOCK_SIZE];
 long[] restored = new long[ForUtil.BLOCK_SIZE];
 final int iterations = atLeast(100);
-
-final GroupVIntWriter w = new GroupVIntWriter();
-byte[] encoded = new byte[(int) (Integer.BYTES * ForUtil.BLOCK_SIZE * 
1.25)];
+Directory dir = FSDirectory.open(createTempDir());

Review Comment:
   +1



##
lucene/core/src/java/org/apache/lucene/store/DataInput.java:
##
@@ -98,6 +98,55 @@ public int readInt() throws IOException {
 return ((b4 & 0xFF) << 24) | ((b3 & 0xFF) << 16) | ((b2 & 0xFF) << 8) | 
(b1 & 0xFF);
   }
 
+  /**
+   * Read all the group varints, including the tail vints.
+   *
+   * @param docs the array to read ints into.
+   * @param limit the number of int values to read.
+   */
+  public void readGroupVInts(long[] docs, int limit) throws IOException {
+int i;
+for (i = 0; i <= limit - 4; i += 4) {
+  readGroupVInt(docs, i);
+}
+for (; i < limit; ++i) {
+  docs[i] = readVInt();
+}
+  }
+
+  /

[PR] Use group-varint encode the positions [lucene]

2023-11-24 Thread via GitHub


easyice opened a new pull request, #12842:
URL: https://github.com/apache/lucene/pull/12842

   Thanks the suggestion from @jpountz , as  discussed in 
https://github.com/apache/lucene/issues/12826
   
   This PR use group-varint to encode some vint values if `storeOffsets` is 
true, it's still using class `GroupVIntReader` and `GroupVIntWriter`,  i will 
update it after https://github.com/apache/lucene/pull/12841 is finished.
   
   Currently i don't use group-vint if `(storeOffsets==false && 
storePayload==false)`, which means only `token` is stored, because i'm worried 
that it will use extra memory when bulk decoding. Feel free to correct me.
   
   Then benchmark and file size changes i'll add next week. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Use LinkedList instead of manual array re-sizing for better throughput. [LUCENE-9432] [lucene]

2023-11-24 Thread via GitHub


slow-J commented on issue #10472:
URL: https://github.com/apache/lucene/issues/10472#issuecomment-1825871537

   I took a quick look at this 3 years on.
   I took @mohammadsadiq's patch and applied it to `IDVersionSegmentTermsEnum` 
and `OrdsSegmentTermsEnum`
   
   I then changed the LinkedList to a ArrayDeque.
   I ran 2 benchmarks, both with wikibigall and JDK19 on an m5.12xlarge EC2 
host.
   
   
   Test 1:
   
https://github.com/slow-J/lucene/commit/e2f5e745f6523688f8bdff09e901aa346ac14d57
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 IntNRQ  411.92  (3.2%)  399.44  
(4.1%)   -3.0% ( -10% -4%) 0.009
 HighPhrase   39.83  (8.4%)   38.94 
(11.2%)   -2.2% ( -20% -   18%) 0.473
 AndHighMed  123.34  (2.6%)  121.53  
(2.1%)   -1.5% (  -6% -3%) 0.051
 OrHighHigh   53.33  (5.7%)   52.57  
(1.7%)   -1.4% (  -8% -6%) 0.282
AndHighHigh   42.16  (4.3%)   41.69  
(1.1%)   -1.1% (  -6% -4%) 0.260
   OrNotHighMed  247.13  (3.0%)  244.44  
(3.6%)   -1.1% (  -7% -5%) 0.295
MedSloppyPhrase   11.55  (7.5%)   11.44  
(6.8%)   -0.9% ( -14% -   14%) 0.681
  MedPhrase   70.95  (3.5%)   70.35  
(3.6%)   -0.8% (  -7% -6%) 0.451
   HighSloppyPhrase   30.95  (4.9%)   30.75  
(4.2%)   -0.6% (  -9% -8%) 0.653
  BrowseMonthSSDVFacets   21.44 (11.2%)   21.34 
(13.2%)   -0.5% ( -22% -   26%) 0.907
  BrowseDayOfYearSSDVFacets   21.19 (15.9%)   21.10 
(15.0%)   -0.4% ( -27% -   36%) 0.927
LowSloppyPhrase   50.43  (4.3%)   50.21  
(4.0%)   -0.4% (  -8% -8%) 0.738
  HighTermTitleSort  105.90  (1.6%)  105.44  
(2.1%)   -0.4% (  -4% -3%) 0.453
Respell   29.29  (1.3%)   29.17  
(1.4%)   -0.4% (  -3% -2%) 0.339
   PKLookup  173.09  (2.1%)  172.41  
(1.8%)   -0.4% (  -4% -3%) 0.527
   OrNotHighLow  882.21  (2.7%)  879.44  
(3.3%)   -0.3% (  -6% -5%) 0.744
MedIntervalsOrdered   28.81  (3.4%)   28.72  
(3.4%)   -0.3% (  -6% -6%) 0.779
Prefix3   60.35  (3.3%)   60.18  
(3.4%)   -0.3% (  -6% -6%) 0.800
  BrowseMonthTaxoFacets   13.18  (1.0%)   13.15  
(1.5%)   -0.2% (  -2% -2%) 0.578
 Fuzzy2   52.06  (1.4%)   51.95  
(1.3%)   -0.2% (  -2% -2%) 0.630
 Fuzzy1   72.08  (1.1%)   71.93  
(1.3%)   -0.2% (  -2% -2%) 0.595
 AndHighLow  635.21  (1.9%)  633.97  
(2.5%)   -0.2% (  -4% -4%) 0.785
   Wildcard   75.55  (2.3%)   75.42  
(1.9%)   -0.2% (  -4% -4%) 0.807
  OrNotHighHigh  140.51  (4.8%)  140.33  
(4.9%)   -0.1% (  -9% -   10%) 0.936
AndHighMedDayTaxoFacets   44.82  (1.7%)   44.76  
(2.3%)   -0.1% (  -4% -3%) 0.847
  HighTermDayOfYearSort  345.49  (1.7%)  345.27  
(1.5%)   -0.1% (  -3% -3%) 0.898
BrowseRandomLabelTaxoFacets   11.78  (0.7%)   11.77  
(0.8%)   -0.0% (  -1% -1%) 0.872
LowSpanNear   45.65  (1.0%)   45.64  
(1.3%)   -0.0% (  -2% -2%) 0.929
   AndHighHighDayTaxoFacets   15.07  (2.5%)   15.07  
(2.8%)   -0.0% (  -5% -5%) 0.971
   HighTermTitleBDVSort   15.95  (5.4%)   15.95  
(5.9%)   -0.0% ( -10% -   11%) 0.987
MedSpanNear   10.13  (1.2%)   10.13  
(1.2%)   -0.0% (  -2% -2%) 0.959
   BrowseDateTaxoFacets   13.34  (0.6%)   13.34  
(0.9%)   -0.0% (  -1% -1%) 0.970
  OrHighMed  220.09  (2.8%)  220.08  
(2.2%)   -0.0% (  -4% -5%) 0.999
   HighIntervalsOrdered7.05  (2.5%)7.05  
(2.6%)0.0% (  -4% -5%) 1.000
  HighTermMonthSort 2701.34  (3.4%) 2701.37  
(3.2%)0.0% (  -6% -6%) 0.999
LowIntervalsOrdered   40.23  (2.2%)   40.24  
(2.3%)0.0% (  -4% -4%) 0.979
   BrowseDateSSDVFacets4.68  (4.2%)4.68  
(4.3%)0.0% (  -8% -8%) 0.984
  OrHighNotHigh  152.28  (4.4%)  152.38  
(4.6%)0.1% (  -8% -9

Re: [PR] Use group-varint encode the positions [lucene]

2023-11-24 Thread via GitHub


jpountz commented on PR #12842:
URL: https://github.com/apache/lucene/pull/12842#issuecomment-1825874597

   Thanks for looking. Unfortunately, the case I'm most interested in is when 
`storeOffsets` is false and there are no payloads, since this is the default. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Faster prefix sum for bitsPerValue up to 9. [lucene]

2023-11-24 Thread via GitHub


jpountz commented on PR #12843:
URL: https://github.com/apache/lucene/pull/12843#issuecomment-1825884854

   luceneutil doesn't see a noticeable difference (all p-values are high) but 
the micro-benchmark that is attached to this PR seems to see an improvement:
   
   ```
   main
   
   Benchmark(bpv)   Mode  Cnt   Score   Error   
Units
   ForUtilBenchmark.decodeAndPrefixSum  6  thrpt   25  18.762 ± 0.739  
ops/us
   ForUtilBenchmark.decodeAndPrefixSum  7  thrpt   25  18.075 ± 0.220  
ops/us
   ForUtilBenchmark.decodeAndPrefixSum  8  thrpt   25  21.040 ± 0.285  
ops/us
   ForUtilBenchmark.decodeAndPrefixSum  9  thrpt   25  16.790 ± 0.896  
ops/us
   ForUtilBenchmark.decodeAndPrefixSum 10  thrpt   25  17.441 ± 1.260  
ops/us
   ForUtilBenchmark.decodeAndPrefixSum 11  thrpt   25  16.697 ± 0.883  
ops/us
   
   PR
   
   Benchmark(bpv)   Mode  Cnt   Score   Error   
Units
   ForUtilBenchmark.decodeAndPrefixSum  6  thrpt   25  19.171 ± 0.277  
ops/us
   ForUtilBenchmark.decodeAndPrefixSum  7  thrpt   25  18.875 ± 0.203  
ops/us
   ForUtilBenchmark.decodeAndPrefixSum  8  thrpt   25  22.075 ± 0.497  
ops/us
   ForUtilBenchmark.decodeAndPrefixSum  9  thrpt   25  18.689 ± 0.792  
ops/us
   ForUtilBenchmark.decodeAndPrefixSum 10  thrpt   25  17.696 ± 0.252  
ops/us
   ForUtilBenchmark.decodeAndPrefixSum 11  thrpt   25  16.623 ± 0.856  
ops/us
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Grow arrays up to a given limit to avoid overallocation where possible [lucene]

2023-11-24 Thread via GitHub


jpountz commented on issue #12839:
URL: https://github.com/apache/lucene/issues/12839#issuecomment-1825928704

   If I'm not mistaken, the `NeighborArray` class we use for vector search may 
have similar needs (it should probably not size its data structure to `maxSize` 
i the constructor?).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] MultiSimilarity.MultiSimScorer should sum up scores into a double [lucene]

2023-11-24 Thread via GitHub


jpountz closed issue #12675: MultiSimilarity.MultiSimScorer should sum up 
scores into a double
URL: https://github.com/apache/lucene/issues/12675


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] MultiSimilarity.MultiSimScorer should sum up scores into a double [lucene]

2023-11-24 Thread via GitHub


jpountz commented on issue #12675:
URL: https://github.com/apache/lucene/issues/12675#issuecomment-1825930715

   @shubhamvishu Yes indeed!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Faster prefix sum for bitsPerValue up to 9. [lucene]

2023-11-24 Thread via GitHub


jpountz commented on PR #12843:
URL: https://github.com/apache/lucene/pull/12843#issuecomment-1826052610

   Actually we can do even better by better tuning the disk layout for the 
prefix sum. Converting this PR to a draft until this is implemented.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Grow arrays up to a given limit to avoid overallocation where possible [lucene]

2023-11-24 Thread via GitHub


stefanvodita commented on issue #12839:
URL: https://github.com/apache/lucene/issues/12839#issuecomment-1826057193

   Thank you for the pointer @jpountz! I'll put together a PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve set deletions percentage javadoc [lucene]

2023-11-24 Thread via GitHub


yugushihuang commented on code in PR #12828:
URL: https://github.com/apache/lucene/pull/12828#discussion_r1404662302


##
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##
@@ -150,9 +150,10 @@ public double getMaxMergedSegmentMB() {
   }
 
   /**
-   * Controls the maximum percentage of deleted documents that is tolerated in 
the index. Lower
-   * values make the index more space efficient at the expense of increased 
CPU and I/O activity.
-   * Values must be between 5 and 50. Default value is 20.
+   * Sets the maximum percentage of deleted documents that is tolerated in the 
index. The

Review Comment:
   Thanks for the review, I will modify the wording.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Introduce growInRange to reduce array overallocation [lucene]

2023-11-24 Thread via GitHub


stefanvodita opened a new pull request, #12844:
URL: https://github.com/apache/lucene/pull/12844

   In cases where we know there is an upper limit to the potential size of an 
array, we can use `growInRange` to avoid allocating beyond that limit.
   
   We address such cases in `DirectoryTaxonomyReader` and `NeighborArray`.
   
   Closes #12839 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Grow arrays up to a given limit to avoid overallocation where possible [lucene]

2023-11-24 Thread via GitHub


stefanvodita commented on issue #12839:
URL: https://github.com/apache/lucene/issues/12839#issuecomment-1826125298

   I added the new method and used it for `DirectoryTaxonomyReader` and 
`NeighborArray` (#12844). There might be other places where it makes sense to 
use, but I thought it best to get some feedback before going and hunting down 
more of those cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use group-varint encode the positions [lucene]

2023-11-24 Thread via GitHub


easyice commented on PR #12842:
URL: https://github.com/apache/lucene/pull/12842#issuecomment-1826180124

   Thanks for your suggestion, i'm thinking about that too, i will continue 
working on this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org