Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]

2023-12-05 Thread via GitHub


shubhamvishu commented on code in PR #12868:
URL: https://github.com/apache/lucene/pull/12868#discussion_r1414510733


##
lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java:
##
@@ -150,9 +150,10 @@ private FuzzySet(FixedBitSet filter, int bloomSize, int 
hashCount) {
* @return NO or MAYBE
*/
   public ContainsResult contains(BytesRef value) {
-long hash = hashFunction.hash(value);
-int msb = (int) (hash >>> Integer.SIZE);
-int lsb = (int) hash;
+long[] hash = hashFunction.hash128(value);
+
+int msb = ((int) hash[0] >>> Integer.SIZE) >>> 1 + ((int) hash[1] >>> 
Integer.SIZE) >>> 1;
+int lsb = ((int) hash[0]) >>> 1 + ((int) hash[1]) >>> 1;

Review Comment:
   I can try to run benchmarks with that and see how that turns out.but 
with the current logic (below) we see 7-9% better performance consistently(see 
above benchmark results). Could that be misleading by any chance because this 
is more probable to produce 0 value or maybe worth sticking to that since 
benchmarks say a different story?
   
   ```java
   int msb = ((int) hash[0] >>> Integer.SIZE) >>> 1 + ((int) hash[1] >>> 
Integer.SIZE) >>> 1;
   int lsb = ((int) hash[0]) >>> 1 + ((int) hash[1]) >>> 1;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make FSTCompiler.compile() to only return the FSTMetadata [lucene]

2023-12-05 Thread via GitHub


dungba88 commented on PR #12831:
URL: https://github.com/apache/lucene/pull/12831#issuecomment-1840269923

   I'll put the CHANGES.txt for #12624 together with this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add a bulk scorer for disjunctions that does dynamic pruning [LUCENE-9335] [lucene]

2023-12-05 Thread via GitHub


jpountz closed issue #10375: Add a bulk scorer for disjunctions that does 
dynamic pruning [LUCENE-9335]
URL: https://github.com/apache/lucene/issues/10375


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add a bulk scorer for disjunctions that does dynamic pruning [LUCENE-9335] [lucene]

2023-12-05 Thread via GitHub


jpountz commented on issue #10375:
URL: https://github.com/apache/lucene/issues/10375#issuecomment-1840276633

   Implemented.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Add a MergePolicy wrapper that preserves search concurrency? [lucene]

2023-12-05 Thread via GitHub


jpountz opened a new issue, #12877:
URL: https://github.com/apache/lucene/issues/12877

   ### Description
   
   We have an issue about decoupling search concurrency from index geometry 
(#9721), but this comes with trade-offs as the per-segment bit of search is 
hard to parallelize. Maybe we should also introduce a merge policy wrapper that 
tries to preserve a search concurrency of `N` by preventing the creation of 
segments of more than `maxDoc/N` docs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-12-05 Thread via GitHub


dungba88 commented on code in PR #12844:
URL: https://github.com/apache/lucene/pull/12844#discussion_r1415164737


##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java:
##
@@ -351,7 +351,8 @@ public int[] getBulkOrdinals(FacetLabel... categoryPaths) 
throws IOException {
 }
   } else {
 indexesMissingFromCache =
-ArrayUtil.grow(indexesMissingFromCache, numberOfMissingFromCache + 
1);
+ArrayUtil.growInRange(

Review Comment:
   I think we could add a comment here saying that indexesMissingFromCache 
cannot grow beyond categoryPaths.
   
   Was also thinking if an assert is needed here but the check in growInRange 
should be enough.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for index sorting with document blocks [lucene]

2023-12-05 Thread via GitHub


s1monw commented on code in PR #12829:
URL: https://github.com/apache/lucene/pull/12829#discussion_r1415196168


##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java:
##
@@ -262,6 +277,73 @@ long updateDocuments(
 }
   }
 
+  private interface DocValidator extends Consumer {

Review Comment:
   I removed it and went down a different route



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-05 Thread via GitHub


jpountz commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1415213374


##
lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataInput.java:
##
@@ -212,6 +213,46 @@ public long readLong() throws IOException {
 }
   }
 
+  @Override
+  public void readGroupVInts(long[] dst, int limit) throws IOException {
+int i;
+for (i = 0; i <= limit - 4; i += 4) {
+  readGroupVInt(dst, i);
+}
+for (; i < limit; ++i) {
+  dst[i] = readVInt();
+}
+  }
+
+  private void readGroupVInt(long[] dst, int offset) throws IOException {
+ByteBuffer block = blocks[blockIndex(pos)];
+int blockOffset = blockOffset(pos);
+if (block.limit() - blockOffset < GroupVIntUtil.MAX_LENGTH_PER_GROUP) {
+  GroupVIntUtil.fallbackReadGroupVInt(this, dst, offset);
+  return;
+}
+
+final int flag = readByte() & 0xFF;
+
+final int n1Minus1 = flag >> 6;
+final int n2Minus1 = (flag >> 4) & 0x03;
+final int n3Minus1 = (flag >> 2) & 0x03;
+final int n4Minus1 = flag & 0x03;
+
+blockOffset = blockOffset(pos);

Review Comment:
   We already computed `blockOffset()` above, can we avoid computing it twice, 
e.g. by replacing the `readByte()` call with `block.get(blockOffset++)`?



##
lucene/core/src/java/org/apache/lucene/store/DataOutput.java:
##
@@ -29,6 +30,7 @@
  * internal state like file position).
  */
 public abstract class DataOutput {
+  private BytesRefBuilder groupVIntBytes = new BytesRefBuilder();

Review Comment:
   Can it be made final?



##
lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseDirectoryTestCase.java:
##
@@ -1438,4 +1440,68 @@ public void testListAllIsSorted() throws IOException {
   assertArrayEquals(expected, actual);
 }
   }
+
+  public void testDataTypes() throws IOException {
+final long[] values = new long[] {43, 12345, 123456, 1234567890};
+try (Directory dir = getDirectory(createTempDir("testDataTypes"))) {
+  IndexOutput out = dir.createOutput("test", IOContext.DEFAULT);
+  out.writeByte((byte) 43);
+  out.writeShort((short) 12345);
+  out.writeInt(1234567890);
+  out.writeGroupVInts(values, 4);
+  out.writeLong(1234567890123456789L);
+  out.close();
+
+  long[] restored = new long[4];
+  IndexInput in = dir.openInput("test", IOContext.DEFAULT);
+  assertEquals(43, in.readByte());
+  assertEquals(12345, in.readShort());
+  assertEquals(1234567890, in.readInt());
+  in.readGroupVInts(restored, 4);
+  assertArrayEquals(values, restored);
+  assertEquals(1234567890123456789L, in.readLong());
+  in.close();
+}
+  }
+
+  public void testGroupVInt() throws IOException {
+try (Directory dir = getDirectory(createTempDir("testGroupVInt"))) {
+  // test fallbackReadGroupVInt
+  doTestGroupVInt(dir, 5, 1, 6, 8);
+
+  // test large data to covers multiple blocks in ByteBuffersDataInput

Review Comment:
   adding it explicitly since this is probably the most important case to cover
   
   ```suggestion
 // test large data to covers multiple blocks in ByteBuffersDataInput 
and MMapDirectory via TestMultiMMap
   ```



##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -303,6 +304,48 @@ public byte readByte(long pos) throws IOException {
 }
   }
 
+  @Override
+  public void readGroupVInts(long[] dst, int limit) throws IOException {
+int i;
+for (i = 0; i <= limit - 4; i += 4) {
+  readGroupVInt(dst, i);
+}
+for (; i < limit; ++i) {
+  dst[i] = readVInt();
+}
+  }
+
+  private void readGroupVInt(long[] dst, int offset) throws IOException {
+if (curSegment.byteSize() - curPosition < 
GroupVIntUtil.MAX_LENGTH_PER_GROUP) {
+  GroupVIntUtil.fallbackReadGroupVInt(this, dst, offset);
+  return;
+}
+
+try {
+  final int flag = readByte() & 0xFF;

Review Comment:
   Since we already did the job of checking that we have enough bytes left, we 
don't need `readByte()` to do it for us?
   
   ```suggestion
 final int flag = curSegment.get(LAYOUT_BYTE, curPosition++) & 0xFF;
   ```



##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -303,6 +304,48 @@ public byte readByte(long pos) throws IOException {
 }
   }
 
+  @Override
+  public void readGroupVInts(long[] dst, int limit) throws IOException {
+int i;
+for (i = 0; i <= limit - 4; i += 4) {
+  readGroupVInt(dst, i);
+}
+for (; i < limit; ++i) {
+  dst[i] = readVInt();
+}
+  }
+
+  private void readGroupVInt(long[] dst, int offset) throws IOException {
+if (curSegment.byteSize() - curPosition < 
GroupVIntUtil.MAX_LENGTH_PER_GROUP) {
+  GroupVIntUtil.fallbackReadGroupVInt(this, dst, offset);
+  return;
+}
+

Review Comment:
   Maybe copy `curSegment` to a local variable here, e.g. `MemorySegm

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-05 Thread via GitHub


jpountz commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1415234949


##
lucene/core/src/java/org/apache/lucene/store/DataOutput.java:
##
@@ -324,4 +326,42 @@ public void writeSetOfStrings(Set set) throws 
IOException {
   writeString(value);
 }
   }
+
+  /**
+   * Encode integers using group-varint. It uses {@link DataOutput#writeVInt 
VInt} to encode tail
+   * values that are not enough for a group. we need a long[] because this is 
what postings are
+   * using, all longs are actually required to be integers.
+   *
+   * @param values the values to write
+   * @param limit the number of values to write.
+   */
+  public void writeGroupVInts(long[] values, int limit) throws IOException {

Review Comment:
   I'm tempted to mark it `@lucene.experimental` to allow us to change the 
signature to use ints in the future. And likewise for its counterpart on 
`DataInput`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Rewrite JavaScriptCompiler to use modern JVM features (Java 17) [lucene]

2023-12-05 Thread via GitHub


uschindler merged PR #12873:
URL: https://github.com/apache/lucene/pull/12873


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Maybe expression compiler should cache recently compiled expressions? [LUCENE-7882] [lucene]

2023-12-05 Thread via GitHub


uschindler closed issue #8933: Maybe expression compiler should cache recently 
compiled expressions? [LUCENE-7882]
URL: https://github.com/apache/lucene/issues/8933


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Maybe expression compiler should cache recently compiled expressions? [LUCENE-7882] [lucene]

2023-12-05 Thread via GitHub


uschindler commented on issue #8933:
URL: https://github.com/apache/lucene/issues/8933#issuecomment-1840518911

   The new code also fixes the problem with hidden stack frames caused by 
hidden class feature.
   
   I am still hoping to have an option in future to turn on some stack frames 
like possible inside the jdk with annotations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10641: IndexSearcher#setTimeout should also abort query rewrites, point ranges and vector searches [lucene]

2023-12-05 Thread via GitHub


mikemccand commented on code in PR #12345:
URL: https://github.com/apache/lucene/pull/12345#discussion_r1415450166


##
lucene/core/src/java/org/apache/lucene/index/ExitableIndexReader.java:
##
@@ -0,0 +1,436 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import org.apache.lucene.util.Bits;
+
+/**
+ * The {@link ExitableIndexReader} is used to timeout I/O operation which is 
done during query
+ * rewrite. After this time is exceeded, the search thread is stopped by 
throwing a {@link
+ * ExitableIndexReader.TimeExceededException}
+ */
+public final class ExitableIndexReader extends IndexReader {
+  private final IndexReader indexReader;
+  private final QueryTimeout queryTimeout;
+
+  /**
+   * Create a ExitableIndexReader wrapper over another {@link IndexReader} 
with a specified timeout.
+   *
+   * @param indexReader the wrapped {@link IndexReader}
+   * @param queryTimeout max time allowed for collecting hits after which 
{@link
+   * ExitableIndexReader.TimeExceededException} is thrown
+   */
+  public ExitableIndexReader(IndexReader indexReader, QueryTimeout 
queryTimeout) {
+this.indexReader = indexReader;
+this.queryTimeout = queryTimeout;
+doWrapIndexReader(indexReader, queryTimeout);
+  }
+
+  /** Thrown when elapsed search time exceeds allowed search time. */
+  @SuppressWarnings("serial")
+  static class TimeExceededException extends RuntimeException {
+private TimeExceededException() {
+  super("TimeLimit Exceeded");
+}
+
+private TimeExceededException(Exception e) {
+  super(e);
+}
+  }
+
+  @Override
+  public TermVectors termVectors() throws IOException {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.termVectors();
+  }
+
+  @Override
+  public int numDocs() {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.numDocs();
+  }
+
+  @Override
+  public int maxDoc() {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.maxDoc();
+  }
+
+  @Override
+  public StoredFields storedFields() throws IOException {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.storedFields();
+  }
+
+  @Override
+  protected void doClose() throws IOException {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+indexReader.doClose();
+  }
+
+  @Override
+  public IndexReaderContext getContext() {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.getContext();
+  }
+
+  @Override
+  public CacheHelper getReaderCacheHelper() {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.getReaderCacheHelper();
+  }
+
+  @Override
+  public int docFreq(Term term) throws IOException {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.docFreq(term);
+  }
+
+  @Override
+  public long totalTermFreq(Term term) throws IOException {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.totalTermFreq(term);
+  }
+
+  @Override
+  public long getSumDocFreq(String field) throws IOException {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.getSumDocFreq(field);
+  }
+
+  @Override
+  public int getDocCount(String field) throws IOException {
+if (queryTimeout.shouldExit()) {
+  throw new ExitableIndexReader.TimeExceededException();
+}
+return indexReader.getDocCount(field);
+  }
+
+  @Override
+  public long getSumTotalTermFreq(String field) throws IOException {
+

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-12-05 Thread via GitHub


mikemccand commented on PR #12624:
URL: https://github.com/apache/lucene/pull/12624#issuecomment-1840646081

   > > I tested just how much slower the ByteBuffer based store is than the 
FST's BytesStore:
   > 
   > I assume this is before the last iteration that does the freeze, is that 
right? What do you think about the last results?
   
   Sorry yes -- I'll try to come back to this soon, and re-test.  Thanks 
@dungba88 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Do we suboptimally call ByteBuffersDataOutput.toDataInput()? [lucene]

2023-12-05 Thread via GitHub


mikemccand commented on issue #12852:
URL: https://github.com/apache/lucene/issues/12852#issuecomment-1840648672

   At least we should add a warning to the javadocs here.
   
   And we should audit other places that are frequently calling 
`.toDataInput()` (e.g. block tree terms dict writing?).
   
   This method is surprisingly slow compared to simply juggling our own 
`byte[]` under the hood, at least in FST's (well, `Test2BFSTs`) usage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]

2023-12-05 Thread via GitHub


shubhamvishu commented on code in PR #12868:
URL: https://github.com/apache/lucene/pull/12868#discussion_r1415528939


##
lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java:
##
@@ -150,9 +150,10 @@ private FuzzySet(FixedBitSet filter, int bloomSize, int 
hashCount) {
* @return NO or MAYBE
*/
   public ContainsResult contains(BytesRef value) {
-long hash = hashFunction.hash(value);
-int msb = (int) (hash >>> Integer.SIZE);
-int lsb = (int) hash;
+long[] hash = hashFunction.hash128(value);
+
+int msb = ((int) hash[0] >>> Integer.SIZE) >>> 1 + ((int) hash[1] >>> 
Integer.SIZE) >>> 1;
+int lsb = ((int) hash[0]) >>> 1 + ((int) hash[1]) >>> 1;

Review Comment:
   Ok I tried using simple `(hash[0] >>> Integer.SIZE)` and `(int) hash[0]` but 
I still see the same 2.6% regression. I think the current way seem to produce 
better positive hash value and more random than directly using the 2 long 
values which is why we are seeing `PKLookup` performance getting improved by 
7-9%.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-12-05 Thread via GitHub


mikemccand commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1415505550


##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -153,6 +180,40 @@ private FSTCompiler(
 }
   }
 
+  // Get the respective FSTReader of the DataOutput. If the DataOutput is also 
a FSTReader then we
+  // will use it, otherwise we will return a NullFSTReader. Attempting to read 
from a FST with
+  // NullFSTReader
+  // will throw UnsupportedOperationException
+  private FSTReader toFSTReader(DataOutput dataOutput) {
+if (dataOutput instanceof FSTReader) {
+  return (FSTReader) dataOutput;
+}
+return NULL_FST_READER;
+  }
+
+  /**
+   * This class is used for FST backed by non-FSTReader DataOutput. It does 
not allow getting the
+   * reverse BytesReader nor writing to a DataOutput.
+   */
+  private static final class NullFSTReader implements FSTReader {
+
+@Override
+public long ramBytesUsed() {
+  return 0;
+}
+
+@Override
+public FST.BytesReader getReverseBytesReader() {
+  throw new UnsupportedOperationException(
+  "NullFSTReader does not support getReverseBytesReader()");
+}
+
+@Override
+public void writeTo(DataOutput out) {
+  throw new UnsupportedOperationException("NullFSTReader does not support 
writeTo(DataOutput)");

Review Comment:
   Could we improve this message, e.g. something like `FST was not constructed 
with getOnHeapReaderWriter` or so?  `NullFSTReader` isn't something the user 
should even know about (it's a private class implementation detail).



##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -218,13 +279,19 @@ public Builder allowFixedLengthArcs(boolean 
allowFixedLengthArcs) {
 }
 
 /**
- * How many bits wide to make each byte[] block in the BytesStore; if you 
know the FST will be
- * large then make this larger. For example 15 bits = 32768 byte pages.
+ * Set the {@link DataOutput} which is used for low-level writing of FST. 
If you want the FST to
+ * be immediately readable, you need to use a DataOutput that also 
implements {@link FSTReader},

Review Comment:
   Does `FSTReader` need to be public and known to users?  Could we make it 
package private and change this to say "you need to use 
`FSTCompiler#getOnHeapReaderWriter`"?



##
lucene/test-framework/src/java/org/apache/lucene/tests/util/fst/FSTTester.java:
##
@@ -316,6 +313,15 @@ public FST doTest() throws IOException {
 return fst;
   }
 
+  protected FST compile(FSTCompiler fstCompiler) throws IOException {
+return fstCompiler.compile();
+  }
+
+  protected FSTCompiler.Builder getFSTBuilder() {
+return new FSTCompiler.Builder<>(

Review Comment:
   Could we randomize whether the "on disk" vs "on heap" `DataOutput` is used, 
so that `TestFSTs` randomly picks?  We should randomize all three posibilities:
 * Create & use on-heap FST
 * Create on heap, save to disk, load FST from disk
 * Stream to disk, then load FST from disk



##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -165,7 +226,7 @@ public static class Builder {
 private final Outputs outputs;
 private double suffixRAMLimitMB = 32.0;
 private boolean allowFixedLengthArcs = true;
-private int bytesPageBits = 15;
+private DataOutput dataOutput = getOnHeapReaderWriter(15);

Review Comment:
   Hmm can we avoid even creating this, unless the caller didn't pass a 
`Builder.dataOutput` themselves?



##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -120,22 +125,44 @@ public class FSTCompiler {
   final float directAddressingMaxOversizingFactor;
   long directAddressingExpansionCredit;
 
-  final BytesStore bytes;
+  // the DataOutput to stream the FST bytes to
+  final DataOutput dataOutput;
+
+  // buffer to store bytes for the one node we are currently writing
+  final GrowableByteArrayDataOutput scratchBytes = new 
GrowableByteArrayDataOutput();
+
+  private long numBytesWritten;
+
+  /**
+   * Get an on-heap DataOutput that allows the FST to be read immediately 
after writing.

Review Comment:
   Add `, and also optionally saved to an external DataOutput`?
   
   I.e. with these changes we support these possible FST workflows:
 * Build FST and use it immediately entirely in RAM and then discard it
 * Build FST and use it immediately entirely in RAM and also save it to 
disk (any other `DataOutput`), and load it later and use it
 * Build FST but stream it immediately to disk (except the `FSTMetaData`, 
saved at the end) and you cannot use it when done unless you go and open your 
own `DataInput` on the backing store and make a new FST from that
   
   Could we include this enumeration in `FSTCompiler''s class javadocs?
   
   Also: could you update `Test2BFSTs` to also test the 3rd bullet above?  
Right now it onl

Re: [PR] LUCENE-10236: Update field-weight used in CombinedFieldQuery scoring calculation (9.0.1 Backporting) [lucene]

2023-12-05 Thread via GitHub


mikemccand commented on PR #587:
URL: https://github.com/apache/lucene/pull/587#issuecomment-1840761468

   @zacharymorn given that it's been so long, and we are unlikely to release 
another 9.0.x bugfix release, I think we should close this.  The fix is present 
in Lucene 9.1.0 and above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10236: Update field-weight used in CombinedFieldQuery scoring calculation (9.0.1 Backporting) [lucene]

2023-12-05 Thread via GitHub


mikemccand closed pull request #587: LUCENE-10236: Update field-weight used in 
CombinedFieldQuery scoring calculation (9.0.1 Backporting)
URL: https://github.com/apache/lucene/pull/587


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for index sorting with document blocks [lucene]

2023-12-05 Thread via GitHub


mikemccand commented on PR #12829:
URL: https://github.com/apache/lucene/pull/12829#issuecomment-1840782006

   Thanks @s1monw -- I'll try to review soon.
   
   > The only think I am torn on is, if we set the num of children as a value 
for the DV field then I guess we should have a good usecase for that?
   
   One thing this would be helpful for is stronger consistency check in 
`CheckIndex`?  Without doing this we can just validate that the parent docs are 
sorted correctly, whereas today we verify that all docs are sorted correctly 
(since user must provide a sort congruent with their doc-blocks).  So we are 
weakening `CheckIndex` a bit by skipping checking the sort of the child docs 
... if we stored the number of children in the new DV field, we could gain that 
back?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]

2023-12-05 Thread via GitHub


jpountz commented on code in PR #12868:
URL: https://github.com/apache/lucene/pull/12868#discussion_r1415603528


##
lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java:
##
@@ -150,9 +150,10 @@ private FuzzySet(FixedBitSet filter, int bloomSize, int 
hashCount) {
* @return NO or MAYBE
*/
   public ContainsResult contains(BytesRef value) {
-long hash = hashFunction.hash(value);
-int msb = (int) (hash >>> Integer.SIZE);
-int lsb = (int) hash;
+long[] hash = hashFunction.hash128(value);
+
+int msb = ((int) hash[0] >>> Integer.SIZE) >>> 1 + ((int) hash[1] >>> 
Integer.SIZE) >>> 1;
+int lsb = ((int) hash[0]) >>> 1 + ((int) hash[1]) >>> 1;

Review Comment:
   If what you are saying is true, this would be a bug in murmur3. I'm assuming 
that murmur3 doesn't have such a bug, so either there is actually a bug in our 
murmur3 implementation, or this regression is "bad luck"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Do we suboptimally call ByteBuffersDataOutput.toDataInput()? [lucene]

2023-12-05 Thread via GitHub


dweiss commented on issue #12852:
URL: https://github.com/apache/lucene/issues/12852#issuecomment-1840786092

   I don't think it was ever meant to be called repeatedly in fast bursts. It 
was meant to provide a direct reader for previously written buffers, once the 
writing is completed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Do we suboptimally call ByteBuffersDataOutput.toDataInput()? [lucene]

2023-12-05 Thread via GitHub


dweiss commented on issue #12852:
URL: https://github.com/apache/lucene/issues/12852#issuecomment-1840793857

   Also, not sure why it's slow - perhaps because of the wrapping in read-only 
buffers when toBufferList is called? If so, then this could be internally 
tweaked by calling toWriteableBufferList inside toDataInput (since we know data 
input is not going to be messing with the buffers...).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-12-05 Thread via GitHub


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1415625009


##
lucene/core/src/test/org/apache/lucene/util/fst/TestFSTDataOutputWriter.java:
##
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import static org.apache.lucene.tests.util.fst.FSTTester.toIntsRef;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Random;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.InputStreamDataInput;
+import org.apache.lucene.store.OutputStreamDataOutput;
+import org.apache.lucene.tests.store.MockDirectoryWrapper;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.tests.util.TestUtil;
+import org.apache.lucene.tests.util.fst.FSTTester;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRef;
+
+public class TestFSTDataOutputWriter extends LuceneTestCase {
+
+  private MockDirectoryWrapper dir;
+
+  @Override
+  public void setUp() throws Exception {
+super.setUp();
+dir = newMockDirectory();
+  }
+
+  @Override
+  public void tearDown() throws Exception {
+// can be null if we force simpletext (funky, some kind of bug in test 
runner maybe)
+if (dir != null) {
+  dir.close();
+}
+super.tearDown();
+  }
+
+  public void testRandom() throws Exception {
+
+final int iters = atLeast(10);
+final int maxBytes = TEST_NIGHTLY ? 20 : 2;
+for (int iter = 0; iter < iters; iter++) {
+  final int numBytes = TestUtil.nextInt(random(), 1, maxBytes);
+  final byte[] expected = new byte[numBytes];
+  final ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  final DataOutput dataOutput = new OutputStreamDataOutput(baos);

Review Comment:
   Ah I think it was left over code from previous implementation. Will remove 
it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-12-05 Thread via GitHub


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1415627354


##
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java:
##
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import org.apache.lucene.store.ByteBuffersDataInput;
+import org.apache.lucene.store.ByteBuffersDataOutput;
+import org.apache.lucene.store.DataOutput;
+
+/**
+ * An adapter class to use {@link ByteBuffersDataOutput} as a {@link 
FSTReader}. It allows the FST
+ * to be readable immediately after writing
+ */
+final class ReadWriteDataOutput extends DataOutput implements FSTReader, 
Freezable {
+
+  private final ByteBuffersDataOutput dataOutput;
+  // the DataInput to read from in case the DataOutput has multiple blocks
+  private ByteBuffersDataInput dataInput;
+  // the ByteBuffers to read from in case the DataOutput has a single block
+  private ByteBuffer byteBuffer;
+
+  public ReadWriteDataOutput(ByteBuffersDataOutput dataOutput) {
+this.dataOutput = dataOutput;
+  }
+
+  @Override
+  public void writeByte(byte b) {
+dataOutput.writeByte(b);
+  }
+
+  @Override
+  public void writeBytes(byte[] b, int offset, int length) {
+dataOutput.writeBytes(b, offset, length);
+  }
+
+  @Override
+  public long ramBytesUsed() {
+return dataOutput.ramBytesUsed();
+  }
+
+  @Override
+  public void freeze() {
+// these operations are costly, so we want to compute it once and cache
+List byteBuffers = dataOutput.toWriteableBufferList();
+if (byteBuffers.size() == 1) {

Review Comment:
   I think it didn't handle the single buffer use case, and 
ByteBuffersDataInput would fall into the same performance regression problem as 
BytesStore with multiple blocks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for index sorting with document blocks [lucene]

2023-12-05 Thread via GitHub


mikemccand commented on code in PR #12829:
URL: https://github.com/apache/lucene/pull/12829#discussion_r1415586966


##
lucene/core/src/test/org/apache/lucene/index/TestAddIndexes.java:
##
@@ -1678,6 +1678,51 @@ public void testIllegalIndexSortChange2() throws 
Exception {
 IOUtils.close(r1, dir1, w2, dir2);
   }
 
+  public void testIllegalIndexSortChange3() throws Exception {
+Directory dir1 = newDirectory();
+IndexWriterConfig iwc1 = newIndexWriterConfig(new MockAnalyzer(random()));
+iwc1.setIndexSort(new Sort("foobar", new SortField("foo", 
SortField.Type.INT)));
+
+RandomIndexWriter w1 = new RandomIndexWriter(random(), dir1, iwc1);
+Document parent = new Document();
+w1.addDocuments(Arrays.asList(new Document(), new Document(), parent));
+w1.commit();
+w1.addDocuments(Arrays.asList(new Document(), new Document(), parent));
+w1.commit();
+// so the index sort is in fact burned into the index:
+w1.forceMerge(1);
+w1.close();
+
+Directory dir2 = newDirectory();
+IndexWriterConfig iwc2 = newIndexWriterConfig(new MockAnalyzer(random()));
+iwc2.setIndexSort(new Sort(new SortField("foo", SortField.Type.INT)));
+RandomIndexWriter w2 = new RandomIndexWriter(random(), dir2, iwc2);
+
+IndexReader r1 = DirectoryReader.open(dir1);
+String message =
+expectThrows(
+IllegalArgumentException.class,
+() -> {
+  w2.addIndexes((SegmentReader) getOnlyLeafReader(r1));
+})
+.getMessage();
+assertEquals(
+"cannot change index sort from parent field: foobar  to 
",
+message);
+
+message =
+expectThrows(
+IllegalArgumentException.class,
+() -> {
+  w2.addIndexes(dir1);

Review Comment:
   Can we also test that if you `.addIndexes` with the correct configuration 
(same parent sort field) things are good?



##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -1198,33 +1198,42 @@ public static Status.IndexSortStatus testSort(
 comparators[i] = fields[i].getComparator(1, 
Pruning.NONE).getLeafComparator(readerContext);
   }
 
-  int maxDoc = reader.maxDoc();
-
   try {
-
-for (int docID = 1; docID < maxDoc; docID++) {
-
+LeafMetaData metaData = reader.getMetaData();
+if (metaData.hasBlocks()
+&& sort.getParentField() == null
+&& metaData.getCreatedVersionMajor() >= 
Version.LUCENE_10_0_0.major) {
+  throw new IllegalStateException(
+  "parent field is not set but the index has block and was created 
with version: "

Review Comment:
   `has block` -> `has document blocks`?



##
lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestBackwardsCompatibility.java:
##
@@ -2164,6 +2166,83 @@ public void testSortedIndex() throws Exception {
 }
   }
 
+  public void testSortedIndexAddDocBlocks() throws Exception {
+for (String name : oldSortedNames) {
+  Path path = createTempDir("sorted");
+  InputStream resource = 
TestBackwardsCompatibility.class.getResourceAsStream(name + ".zip");
+  assertNotNull("Sorted index index " + name + " not found", resource);
+  TestUtil.unzip(resource, path);
+
+  try (Directory dir = newFSDirectory(path)) {
+final Sort sort;
+try (DirectoryReader reader = DirectoryReader.open(dir)) {
+  assertEquals(1, reader.leaves().size());
+  sort = reader.leaves().get(0).reader().getMetaData().getSort();
+  assertNotNull(sort);
+  searchExampleIndex(reader);
+}
+// open writer
+try (IndexWriter writer =
+new IndexWriter(
+dir,
+newIndexWriterConfig(new MockAnalyzer(random()))
+.setOpenMode(OpenMode.APPEND)
+.setIndexSort(sort)
+.setMergePolicy(newLogMergePolicy( {
+  // add 10 docs
+  for (int i = 0; i < 10; i++) {
+Document child = new Document();
+child.add(new StringField("relation", "child", Field.Store.NO));
+child.add(new StringField("bid", "" + i, Field.Store.NO));
+child.add(new NumericDocValuesField("dateDV", i));
+Document parent = new Document();
+parent.add(new StringField("relation", "parent", Field.Store.NO));
+parent.add(new StringField("bid", "" + i, Field.Store.NO));
+parent.add(new NumericDocValuesField("dateDV", i));
+writer.addDocuments(Arrays.asList(child, child, parent));
+if (random().nextBoolean()) {
+  writer.flush();
+}
+  }
+  if (random().nextBoolean()) {
+writer.forceMerge(1);
+  }
+  writer.commit();
+  try (IndexReader reader = DirectoryReader.open(dir)) {
+I

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-12-05 Thread via GitHub


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1415627354


##
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java:
##
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import org.apache.lucene.store.ByteBuffersDataInput;
+import org.apache.lucene.store.ByteBuffersDataOutput;
+import org.apache.lucene.store.DataOutput;
+
+/**
+ * An adapter class to use {@link ByteBuffersDataOutput} as a {@link 
FSTReader}. It allows the FST
+ * to be readable immediately after writing
+ */
+final class ReadWriteDataOutput extends DataOutput implements FSTReader, 
Freezable {
+
+  private final ByteBuffersDataOutput dataOutput;
+  // the DataInput to read from in case the DataOutput has multiple blocks
+  private ByteBuffersDataInput dataInput;
+  // the ByteBuffers to read from in case the DataOutput has a single block
+  private ByteBuffer byteBuffer;
+
+  public ReadWriteDataOutput(ByteBuffersDataOutput dataOutput) {
+this.dataOutput = dataOutput;
+  }
+
+  @Override
+  public void writeByte(byte b) {
+dataOutput.writeByte(b);
+  }
+
+  @Override
+  public void writeBytes(byte[] b, int offset, int length) {
+dataOutput.writeBytes(b, offset, length);
+  }
+
+  @Override
+  public long ramBytesUsed() {
+return dataOutput.ramBytesUsed();
+  }
+
+  @Override
+  public void freeze() {
+// these operations are costly, so we want to compute it once and cache
+List byteBuffers = dataOutput.toWriteableBufferList();
+if (byteBuffers.size() == 1) {

Review Comment:
   I think it didn't handle the single buffer use case, and 
ByteBuffersDataInput would fall into the same performance regression problem as 
BytesStore with multiple blocks (which due to the fact that the method is size 
is larger and won't be inlined by JVM).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-12-05 Thread via GitHub


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1415641634


##
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##
@@ -500,6 +502,12 @@ public FSTMetadata getMetadata() {
 return metadata;
   }
 
+  /**
+   * Save the FST to DataOutput.
+   *
+   * @param metaOut the DataOutput to write the metadata to
+   * @param out the DataOutput to write the FST bytes to
+   */
   public void save(DataOutput metaOut, DataOutput out) throws IOException {

Review Comment:
   Yeah. It's also not supported (throwing exception). I'll add a comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-12-05 Thread via GitHub


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1415666711


##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -218,13 +279,19 @@ public Builder allowFixedLengthArcs(boolean 
allowFixedLengthArcs) {
 }
 
 /**
- * How many bits wide to make each byte[] block in the BytesStore; if you 
know the FST will be
- * large then make this larger. For example 15 bits = 32768 byte pages.
+ * Set the {@link DataOutput} which is used for low-level writing of FST. 
If you want the FST to
+ * be immediately readable, you need to use a DataOutput that also 
implements {@link FSTReader},

Review Comment:
   I think when we change the compile() to only return metadata, users need to 
create the FST with the on-heap FSTReader, and thus it needs to be public.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-05 Thread via GitHub


easyice commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1415479977


##
lucene/core/src/java/org/apache/lucene/store/DataOutput.java:
##
@@ -324,4 +326,42 @@ public void writeSetOfStrings(Set set) throws 
IOException {
   writeString(value);
 }
   }
+
+  /**
+   * Encode integers using group-varint. It uses {@link DataOutput#writeVInt 
VInt} to encode tail
+   * values that are not enough for a group. we need a long[] because this is 
what postings are
+   * using, all longs are actually required to be integers.
+   *
+   * @param values the values to write
+   * @param limit the number of values to write.
+   */
+  public void writeGroupVInts(long[] values, int limit) throws IOException {

Review Comment:
   okay!



##
lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataInput.java:
##
@@ -212,6 +213,46 @@ public long readLong() throws IOException {
 }
   }
 
+  @Override
+  public void readGroupVInts(long[] dst, int limit) throws IOException {
+int i;
+for (i = 0; i <= limit - 4; i += 4) {
+  readGroupVInt(dst, i);
+}
+for (; i < limit; ++i) {
+  dst[i] = readVInt();
+}
+  }
+
+  private void readGroupVInt(long[] dst, int offset) throws IOException {
+ByteBuffer block = blocks[blockIndex(pos)];
+int blockOffset = blockOffset(pos);
+if (block.limit() - blockOffset < GroupVIntUtil.MAX_LENGTH_PER_GROUP) {
+  GroupVIntUtil.fallbackReadGroupVInt(this, dst, offset);
+  return;
+}
+
+final int flag = readByte() & 0xFF;
+
+final int n1Minus1 = flag >> 6;
+final int n2Minus1 = (flag >> 4) & 0x03;
+final int n3Minus1 = (flag >> 2) & 0x03;
+final int n4Minus1 = flag & 0x03;
+
+blockOffset = blockOffset(pos);

Review Comment:
   +1, i did the similar change in `BufferedIndexInput#readGroupVInt`



##
lucene/core/src/java/org/apache/lucene/store/DataOutput.java:
##
@@ -29,6 +30,7 @@
  * internal state like file position).
  */
 public abstract class DataOutput {
+  private BytesRefBuilder groupVIntBytes = new BytesRefBuilder();

Review Comment:
   +1, I will pay more attention to this issue.



##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -303,6 +304,48 @@ public byte readByte(long pos) throws IOException {
 }
   }
 
+  @Override
+  public void readGroupVInts(long[] dst, int limit) throws IOException {
+int i;
+for (i = 0; i <= limit - 4; i += 4) {
+  readGroupVInt(dst, i);
+}
+for (; i < limit; ++i) {
+  dst[i] = readVInt();
+}
+  }
+
+  private void readGroupVInt(long[] dst, int offset) throws IOException {
+if (curSegment.byteSize() - curPosition < 
GroupVIntUtil.MAX_LENGTH_PER_GROUP) {
+  GroupVIntUtil.fallbackReadGroupVInt(this, dst, offset);
+  return;
+}
+
+try {
+  final int flag = readByte() & 0xFF;

Review Comment:
   +1



##
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java:
##
@@ -303,6 +304,48 @@ public byte readByte(long pos) throws IOException {
 }
   }
 
+  @Override
+  public void readGroupVInts(long[] dst, int limit) throws IOException {
+int i;
+for (i = 0; i <= limit - 4; i += 4) {
+  readGroupVInt(dst, i);
+}
+for (; i < limit; ++i) {
+  dst[i] = readVInt();
+}
+  }
+
+  private void readGroupVInt(long[] dst, int offset) throws IOException {
+if (curSegment.byteSize() - curPosition < 
GroupVIntUtil.MAX_LENGTH_PER_GROUP) {
+  GroupVIntUtil.fallbackReadGroupVInt(this, dst, offset);
+  return;
+}
+

Review Comment:
   done, i wonder what the impact on JVM  if the variable maybe updated? Thanks!



##
lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseDirectoryTestCase.java:
##
@@ -1438,4 +1440,68 @@ public void testListAllIsSorted() throws IOException {
   assertArrayEquals(expected, actual);
 }
   }
+
+  public void testDataTypes() throws IOException {
+final long[] values = new long[] {43, 12345, 123456, 1234567890};
+try (Directory dir = getDirectory(createTempDir("testDataTypes"))) {
+  IndexOutput out = dir.createOutput("test", IOContext.DEFAULT);
+  out.writeByte((byte) 43);
+  out.writeShort((short) 12345);
+  out.writeInt(1234567890);
+  out.writeGroupVInts(values, 4);
+  out.writeLong(1234567890123456789L);
+  out.close();
+
+  long[] restored = new long[4];
+  IndexInput in = dir.openInput("test", IOContext.DEFAULT);
+  assertEquals(43, in.readByte());
+  assertEquals(12345, in.readShort());
+  assertEquals(1234567890, in.readInt());
+  in.readGroupVInts(restored, 4);
+  assertArrayEquals(values, restored);
+  assertEquals(1234567890123456789L, in.readLong());
+  in.close();
+}
+  }
+
+  public void testGroupVInt() throws IOException {
+try (Directory dir = getDirectory(createTempDir("t

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-05 Thread via GitHub


rmuir commented on PR #12841:
URL: https://github.com/apache/lucene/pull/12841#issuecomment-1840885611

   > Thank you very much for your suggestions, i had fixed the comments from 
@jpountz, but the related to Mmapdir will be later(such as java19, java20 
support), because we need to confirm whether to change it. i agree that we 
should be cautious here. in my understanding the optimized code is faster than 
the main branch probably because the optimized code doesn't need a `switch` 
statement in `GroupVIntUtil#readLongInGroup`, although it was inlined.
   > 
   > The JMH output for currently, it's similar to before we replaced 
`varHander` with `ByteBuffer#getInt()` java17
   > 
   > ```
   > Benchmark (size)   Mode  
Cnt   Score   Error   Units
   > GroupVIntBenchmark.byteBuffersReadGroupVInt   64  thrpt
5   4.698 ± 2.039  ops/us
   > GroupVIntBenchmark.byteBuffersReadGroupVIntBaseline   64  thrpt
5   2.167 ± 0.064  ops/us
   > GroupVIntBenchmark.mmap_byteBufferReadGroupVInt   64  thrpt
5   7.386 ± 0.524  ops/us
   > GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline   64  thrpt
5   7.358 ± 1.316  ops/us
   > GroupVIntBenchmark.nioReadGroupVInt   64  thrpt
5   8.236 ± 1.052  ops/us
   > GroupVIntBenchmark.nioReadGroupVIntBaseline   64  thrpt
5   5.381 ± 1.134  ops/us
   > ```
   > 
   > java21
   > 
   > ```
   > Benchmark (size)   Mode  
Cnt   Score   Error   Units
   > GroupVIntBenchmark.byteBuffersReadGroupVInt   64  thrpt
5   4.622 ± 0.880  ops/us
   > GroupVIntBenchmark.byteBuffersReadGroupVIntBaseline   64  thrpt
5   1.674 ± 0.245  ops/us
   > GroupVIntBenchmark.mmap_byteBufferReadGroupVInt   64  thrpt
5  10.165 ± 1.083  ops/us
   > GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline   64  thrpt
5   5.104 ± 0.388  ops/us
   > GroupVIntBenchmark.nioReadGroupVInt   64  thrpt
5   9.527 ± 1.556  ops/us
   > GroupVIntBenchmark.nioReadGroupVIntBaseline   64  thrpt
5   5.351 ± 0.621  ops/us
   > ```
   > 
   > @uschindler Thank you for take a look at this, here is the assembly output 
for `MemorySegmentIndexInput#readGroupVInt`, using command: `java 
-XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly --module-path 
lucene/benchmark-jmh/build/benchmarks --module org.apache.lucene.benchmark.jmh 
GroupVIntBenchmark.mmap`
   > 
   > assembly code
   
   you need to install hsdis to be able to see instructions and not just binary 
data from printassembly. see 
https://github.com/apache/lucene/blob/main/help/jmh.txt


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Fix the declared Exceptions of Expression#evaluate to match those of DoubleValues#doubleValue [lucene]

2023-12-05 Thread via GitHub


uschindler opened a new pull request, #12878:
URL: https://github.com/apache/lucene/pull/12878

   This PR fixes the issue found while coding the benchmark of Expressions 
module in #12873: The expressions module looks up the variables using the 
`DoubleValues#doubleValue` method, which throws `IOException`. Because the JVM 
does not do any exceptions checks, only the type system of javac compiler, some 
code calling `Expression#evaluate` could suddenly be confronted with a checked 
`IOException`, although this is not declared. It is not even possible to catch 
the Exception because compiler complains that it is not declared.
   
   Actually this should also be fixed in 9.x but this changes the method 
signature of the `Expression#evaluate` method and the impact is not too high. 
So it is for Lucene 10 only.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce dynamic segment efSearch to Knn{Byte|Float}VectorQuery [lucene]

2023-12-05 Thread via GitHub


jimczi commented on PR #12551:
URL: https://github.com/apache/lucene/pull/12551#issuecomment-1841063019

   Superseded by https://github.com/apache/lucene/pull/12794


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce dynamic segment efSearch to Knn{Byte|Float}VectorQuery [lucene]

2023-12-05 Thread via GitHub


jimczi closed pull request #12551: Introduce dynamic segment efSearch to 
Knn{Byte|Float}VectorQuery
URL: https://github.com/apache/lucene/pull/12551


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-12-05 Thread via GitHub


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1415869383


##
lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java:
##
@@ -120,22 +125,44 @@ public class FSTCompiler {
   final float directAddressingMaxOversizingFactor;
   long directAddressingExpansionCredit;
 
-  final BytesStore bytes;
+  // the DataOutput to stream the FST bytes to
+  final DataOutput dataOutput;
+
+  // buffer to store bytes for the one node we are currently writing
+  final GrowableByteArrayDataOutput scratchBytes = new 
GrowableByteArrayDataOutput();
+
+  private long numBytesWritten;
+
+  /**
+   * Get an on-heap DataOutput that allows the FST to be read immediately 
after writing.

Review Comment:
   I added a new Test2BFSTOffHeap and running it:
   
   ```
   10: 424 RAM bytes used; 39257811 FST bytes; 19189176 nodes; took 23 
seconds
   20: 424 RAM bytes used; 78522623 FST bytes; 38378071 nodes; took 49 
seconds
   30: 424 RAM bytes used; 117788163 FST bytes; 57567190 nodes; took 80 
seconds
   40: 424 RAM bytes used; 157053095 FST bytes; 76756389 nodes; took 107 
seconds
   50: 424 RAM bytes used; 196318494 FST bytes; 95945639 nodes; took 133 
seconds
   60: 424 RAM bytes used; 235583412 FST bytes; 115134691 nodes; took 170 
seconds
   70: 480 RAM bytes used; 274866378 FST bytes; 134324199 nodes; took 198 
seconds
   80: 480 RAM bytes used; 314246540 FST bytes; 153513668 nodes; took 222 
seconds
   90: 480 RAM bytes used; 353626848 FST bytes; 172703151 nodes; took 245 
seconds
   100: 480 RAM bytes used; 393006717 FST bytes; 191892620 nodes; took 277 
seconds
   110: 480 RAM bytes used; 432387052 FST bytes; 211082115 nodes; took 311 
seconds
   120: 480 RAM bytes used; 471766692 FST bytes; 230271461 nodes; took 334 
seconds
   130: 480 RAM bytes used; 511147081 FST bytes; 249461034 nodes; took 357 
seconds
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-12-05 Thread via GitHub


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1415871453


##
lucene/test-framework/src/java/org/apache/lucene/tests/util/fst/FSTTester.java:
##
@@ -316,6 +313,15 @@ public FST doTest() throws IOException {
 return fst;
   }
 
+  protected FST compile(FSTCompiler fstCompiler) throws IOException {
+return fstCompiler.compile();
+  }
+
+  protected FSTCompiler.Builder getFSTBuilder() {
+return new FSTCompiler.Builder<>(

Review Comment:
   I added this test along with the 2B nodes off-heap



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-12-05 Thread via GitHub


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1415876855


##
lucene/core/src/test/org/apache/lucene/util/fst/Test2BFSTOffHeap.java:
##
@@ -0,0 +1,341 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite;
+import java.util.Arrays;
+import java.util.Random;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.IOContext;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.MMapDirectory;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.tests.util.LuceneTestCase.SuppressSysoutChecks;
+import org.apache.lucene.tests.util.TimeUnits;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.IntsRef;
+
+// Similar to Test2BFST but will build and read the FST off-heap and can be 
run with small heap
+
+// Run something like this:
+//./gradlew test --tests Test2BFSTOffHeap -Dtests.verbose=true 
--max-workers=1
+
+// @Ignore("Requires tons of heap to run (30 GB hits OOME but 35 GB passes 
after ~4.5 hours)")

Review Comment:
   Ah I should have turned on this @Ignore on the PR!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-05 Thread via GitHub


easyice commented on PR #12841:
URL: https://github.com/apache/lucene/pull/12841#issuecomment-1841102583

   
   Thank you @rmuir, the new assembly output is here:)   CC @uschindler 
   
   
https://github.com/easyice/lucene_files/blob/main/MemorySegmentIndexInput.readGroupVInt.asm.txt


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-05 Thread via GitHub


uschindler commented on PR #12841:
URL: https://github.com/apache/lucene/pull/12841#issuecomment-1841347419

   Hi,
   Did you create this one with the jmh option `-perfasm`? This looks like 
unoptimized code generated only by C1, not by the C2 compiler.
   
   It would be good to have 2 asm dumps: one with your patch and one where the 
benchmark just calls the inherited default impl on Mmapdir.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix the declared Exceptions of Expression#evaluate to match those of DoubleValues#doubleValue [lucene]

2023-12-05 Thread via GitHub


uschindler commented on PR #12878:
URL: https://github.com/apache/lucene/pull/12878#issuecomment-1841418704

   > LGTM. I think an explicit IOException is better than wrapping in 
UncheckedIOException.
   
   The UncheckedIOException was just a workaround when I implemented the 
benchmark code. Wrapping inside ASM bytecode generation would be a desaster, so 
fix the bug correct.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix the declared Exceptions of Expression#evaluate to match those of DoubleValues#doubleValue [lucene]

2023-12-05 Thread via GitHub


uschindler commented on code in PR #12878:
URL: https://github.com/apache/lucene/pull/12878#discussion_r1416143382


##
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/ExpressionsBenchmark.java:
##
@@ -83,12 +82,8 @@ private static double ident(double v) {
   private static final Expression NATIVE_IDENTITY_EXPRESSION =
   new Expression(NATIVE_IDENTITY_NAME, new String[] {"x"}) {
 @Override
-public double evaluate(DoubleValues[] functionValues) {
-  try {
-return functionValues[0].doubleValue();
-  } catch (IOException e) {
-throw new UncheckedIOException(e);

Review Comment:
   this was just the workaround I applied that made me aware of this bug in the 
expressions module.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-12-05 Thread via GitHub


dweiss commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1416181846


##
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java:
##
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import org.apache.lucene.store.ByteBuffersDataInput;
+import org.apache.lucene.store.ByteBuffersDataOutput;
+import org.apache.lucene.store.DataOutput;
+
+/**
+ * An adapter class to use {@link ByteBuffersDataOutput} as a {@link 
FSTReader}. It allows the FST
+ * to be readable immediately after writing
+ */
+final class ReadWriteDataOutput extends DataOutput implements FSTReader, 
Freezable {
+
+  private final ByteBuffersDataOutput dataOutput;
+  // the DataInput to read from in case the DataOutput has multiple blocks
+  private ByteBuffersDataInput dataInput;
+  // the ByteBuffers to read from in case the DataOutput has a single block
+  private ByteBuffer byteBuffer;
+
+  public ReadWriteDataOutput(ByteBuffersDataOutput dataOutput) {
+this.dataOutput = dataOutput;
+  }
+
+  @Override
+  public void writeByte(byte b) {
+dataOutput.writeByte(b);
+  }
+
+  @Override
+  public void writeBytes(byte[] b, int offset, int length) {
+dataOutput.writeBytes(b, offset, length);
+  }
+
+  @Override
+  public long ramBytesUsed() {
+return dataOutput.ramBytesUsed();
+  }
+
+  @Override
+  public void freeze() {
+// these operations are costly, so we want to compute it once and cache
+List byteBuffers = dataOutput.toWriteableBufferList();
+if (byteBuffers.size() == 1) {

Review Comment:
   I don't think it has anything to do with method sizes. I think it's got more 
to do with ByteBuffersDataInput wrapping input buffers with asReadOnlyBuffer:
   ```
 public ByteBuffersDataInput(List buffers) {
   ensureAssumptions(buffers);
   
   this.blocks = buffers.toArray(ByteBuffer[]::new);
   for (int i = 0; i < blocks.length; ++i) {
 blocks[i] = 
blocks[i].asReadOnlyBuffer().order(ByteOrder.LITTLE_ENDIAN);
   }
   ```
   It also eagerly preallocates arrays for float and long buffers - something 
that was added later (I wasn't aware), which may also contribute to performance 
if you call this method millions of times. But this is only my gut feeling, 
it'd have to be verified with a profile run. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Rewrite JavaScriptCompiler to use modern JVM features (Java 17) [lucene]

2023-12-05 Thread via GitHub


uschindler commented on PR #12873:
URL: https://github.com/apache/lucene/pull/12873#issuecomment-1841494970

   > While writing the benchmark I figured out that the Expression base class 
has some problems with a missing `IOException` on the 
`evaluate(DoubleValues[])` method. As bytecode knows no exceptions in throws 
clauses, the evaluation method may throw `IOException` unexpected because 
`DoubleValues#doubleValue()` is defined to throw `IOException`. This is a bug 
and should be fixed separately. I will open an issue to fix this in main.
   
   This is the follow-up PR: #12878


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [10.0] Removing TermInSetQuery varargs ctor [lucene]

2023-12-05 Thread via GitHub


gsmiller commented on PR #12837:
URL: https://github.com/apache/lucene/pull/12837#issuecomment-1841521890

   I think what you said sounds right @slow-J. Having another look at this and 
the associated bp PR today. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [9.10] Mark TermInSetQuery ctors with varargs terms as deprecated [lucene]

2023-12-05 Thread via GitHub


gsmiller commented on code in PR #12864:
URL: https://github.com/apache/lucene/pull/12864#discussion_r1416215309


##
lucene/CHANGES.txt:
##
@@ -7,7 +7,8 @@ http://s.apache.org/luceneversions
 
 API Changes
 -
-(No changes)
+* GITHUB#12243: Mark TermInSetQuery ctors with varargs terms as @Deprecated. 
SortedSetDocValuesField#newSlowSetQuery,
+SortedDocValuesField#newSlowSetQuery, KeywordField#newSetQuery now take a 
collection of terms as a param. (Jakub Slowinski)

Review Comment:
   minor: let's left-pad with some whitespace like we do with other changes 
entries



##
lucene/core/src/java/org/apache/lucene/document/KeywordField.java:
##
@@ -170,7 +171,9 @@ public static Query newExactQuery(String field, String 
value) {
* @param values the set of values to match
* @throws NullPointerException if {@code field} is null.
* @return a query matching documents with this exact value
+   * @deprecated Use the method with a collection of values instead.

Review Comment:
   minor: let's use the `{@link ...}` javadoc to reference the methods users 
should use (in all places where we're using this annotation)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [10.0] Removing TermInSetQuery varargs ctor [lucene]

2023-12-05 Thread via GitHub


gsmiller commented on code in PR #12837:
URL: https://github.com/apache/lucene/pull/12837#discussion_r1416221972


##
lucene/CHANGES.txt:
##
@@ -67,6 +67,8 @@ API Changes
 
 * GITHUB#11023: Adding -level param to CheckIndex, making the old -fast param 
the default behaviour. (Jakub Slowinski)
 
+* GITHUB#12243: Remove TermInSetQuery ctors taking varargs param. 
KeywordField#newSetQuery now takes a collection. (Jakub Slowinski)

Review Comment:
   I wonder if we should "forward-port" the CHANGES entry from your associated 
back-port PR as well? Otherwise the CHANGES message under 9.10 noting the 
deprecation will get lost with 10.0.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [9.10] Mark TermInSetQuery ctors with varargs terms as deprecated [lucene]

2023-12-05 Thread via GitHub


slow-J commented on PR #12864:
URL: https://github.com/apache/lucene/pull/12864#issuecomment-1841640724

   > Mark TermInSetQuery ctors with varargs terms as @Deprecated.
   
   Ah I initially thought of doing this, but didn't know which was the right 
way, I'll add now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [9.10] Mark TermInSetQuery ctors with varargs terms as deprecated [lucene]

2023-12-05 Thread via GitHub


slow-J commented on code in PR #12864:
URL: https://github.com/apache/lucene/pull/12864#discussion_r1416299738


##
lucene/core/src/java/org/apache/lucene/document/KeywordField.java:
##
@@ -170,7 +171,9 @@ public static Query newExactQuery(String field, String 
value) {
* @param values the set of values to match
* @throws NullPointerException if {@code field} is null.
* @return a query matching documents with this exact value
+   * @deprecated Use the method with a collection of values instead.

Review Comment:
   Will do!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [9.10] Mark TermInSetQuery ctors with varargs terms as deprecated [lucene]

2023-12-05 Thread via GitHub


slow-J commented on code in PR #12864:
URL: https://github.com/apache/lucene/pull/12864#discussion_r1416300070


##
lucene/CHANGES.txt:
##
@@ -7,7 +7,8 @@ http://s.apache.org/luceneversions
 
 API Changes
 -
-(No changes)
+* GITHUB#12243: Mark TermInSetQuery ctors with varargs terms as @Deprecated. 
SortedSetDocValuesField#newSlowSetQuery,
+SortedDocValuesField#newSlowSetQuery, KeywordField#newSetQuery now take a 
collection of terms as a param. (Jakub Slowinski)

Review Comment:
   Good catch :D You mean on the second line, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [9.10] Mark TermInSetQuery ctors with varargs terms as deprecated [lucene]

2023-12-05 Thread via GitHub


slow-J commented on PR #12864:
URL: https://github.com/apache/lucene/pull/12864#issuecomment-1841663134

   Thanks for the review @gsmiller !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [10.0] Removing TermInSetQuery varargs ctor [lucene]

2023-12-05 Thread via GitHub


slow-J commented on code in PR #12837:
URL: https://github.com/apache/lucene/pull/12837#discussion_r1416313463


##
lucene/CHANGES.txt:
##
@@ -67,6 +67,8 @@ API Changes
 
 * GITHUB#11023: Adding -level param to CheckIndex, making the old -fast param 
the default behaviour. (Jakub Slowinski)
 
+* GITHUB#12243: Remove TermInSetQuery ctors taking varargs param. 
KeywordField#newSetQuery now takes a collection. (Jakub Slowinski)

Review Comment:
   Good point! Can add now while resolving conflicts :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [10.0] Removing TermInSetQuery varargs ctor [lucene]

2023-12-05 Thread via GitHub


slow-J commented on PR #12837:
URL: https://github.com/apache/lucene/pull/12837#issuecomment-1841683335

   > Looks good! One minor question about CHANGES. Thanks!
   
   Fixed!
   
   Thanks for the review and for all the help with backporting!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-05 Thread via GitHub


jpountz commented on PR #12841:
URL: https://github.com/apache/lucene/pull/12841#issuecomment-1841712905

   @uschindler FYI this is what I'm getting: 
https://gist.github.com/jpountz/be81b1eb93c6118aac65c3679911f1d8. There are two 
files: baseline.txt for the default impl, and contender.txt for the optimized 
impl.
   
   > In general I am not sure if we really need specialization in Mmapdir.
   > We do not even implement readVInt for memory segments. Why should we 
implement it for this complex case?
   
   To be clear, this specialization is not about doing the same kind of 
optimization that the JVM would do by inlining, it's about taking advantage of 
the fact that many of our Directory implementations can do unaligned random 
reads under the hood to decode group-varints more efficiently, fewer 
conditionals specifically.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [9.10] Mark TermInSetQuery ctors with varargs terms as deprecated [lucene]

2023-12-05 Thread via GitHub


gsmiller merged PR #12864:
URL: https://github.com/apache/lucene/pull/12864


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [10.0] Removing TermInSetQuery varargs ctor [lucene]

2023-12-05 Thread via GitHub


gsmiller merged PR #12837:
URL: https://github.com/apache/lucene/pull/12837


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add a new static method for KeywordField#newSetQuery to support collections parameter [lucene]

2023-12-05 Thread via GitHub


gsmiller closed issue #12243: Add a new static method for 
KeywordField#newSetQuery to support collections parameter
URL: https://github.com/apache/lucene/issues/12243


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-12-05 Thread via GitHub


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1416379828


##
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java:
##
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import org.apache.lucene.store.ByteBuffersDataInput;
+import org.apache.lucene.store.ByteBuffersDataOutput;
+import org.apache.lucene.store.DataOutput;
+
+/**
+ * An adapter class to use {@link ByteBuffersDataOutput} as a {@link 
FSTReader}. It allows the FST
+ * to be readable immediately after writing
+ */
+final class ReadWriteDataOutput extends DataOutput implements FSTReader, 
Freezable {
+
+  private final ByteBuffersDataOutput dataOutput;
+  // the DataInput to read from in case the DataOutput has multiple blocks
+  private ByteBuffersDataInput dataInput;
+  // the ByteBuffers to read from in case the DataOutput has a single block
+  private ByteBuffer byteBuffer;
+
+  public ReadWriteDataOutput(ByteBuffersDataOutput dataOutput) {
+this.dataOutput = dataOutput;
+  }
+
+  @Override
+  public void writeByte(byte b) {
+dataOutput.writeByte(b);
+  }
+
+  @Override
+  public void writeBytes(byte[] b, int offset, int length) {
+dataOutput.writeBytes(b, offset, length);
+  }
+
+  @Override
+  public long ramBytesUsed() {
+return dataOutput.ramBytesUsed();
+  }
+
+  @Override
+  public void freeze() {
+// these operations are costly, so we want to compute it once and cache
+List byteBuffers = dataOutput.toWriteableBufferList();
+if (byteBuffers.size() == 1) {

Review Comment:
   Sorry for the missing context, I was referring to a regression I found at 
https://github.com/apache/lucene/issues/10520 instead. 
   
   Maybe it was different for ByteBuffersDataInput, or there would be no 
regression if the block size is just 1. I'll think we need to run some test to 
verify. Let me know if there's a process for this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-12-05 Thread via GitHub


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1416403950


##
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java:
##
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import org.apache.lucene.store.ByteBuffersDataInput;
+import org.apache.lucene.store.ByteBuffersDataOutput;
+import org.apache.lucene.store.DataOutput;
+
+/**
+ * An adapter class to use {@link ByteBuffersDataOutput} as a {@link 
FSTReader}. It allows the FST
+ * to be readable immediately after writing
+ */
+final class ReadWriteDataOutput extends DataOutput implements FSTReader, 
Freezable {
+
+  private final ByteBuffersDataOutput dataOutput;
+  // the DataInput to read from in case the DataOutput has multiple blocks
+  private ByteBuffersDataInput dataInput;
+  // the ByteBuffers to read from in case the DataOutput has a single block
+  private ByteBuffer byteBuffer;
+
+  public ReadWriteDataOutput(ByteBuffersDataOutput dataOutput) {
+this.dataOutput = dataOutput;
+  }
+
+  @Override
+  public void writeByte(byte b) {
+dataOutput.writeByte(b);
+  }
+
+  @Override
+  public void writeBytes(byte[] b, int offset, int length) {
+dataOutput.writeBytes(b, offset, length);
+  }
+
+  @Override
+  public long ramBytesUsed() {
+return dataOutput.ramBytesUsed();
+  }
+
+  @Override
+  public void freeze() {
+// these operations are costly, so we want to compute it once and cache
+List byteBuffers = dataOutput.toWriteableBufferList();
+if (byteBuffers.size() == 1) {

Review Comment:
   Even for ByteBuffersDataInput in multi-block cases, there's a slight 
regression over the BytesStore implementation (note that we already called 
`freeze()`). The diff is https://github.com/dungba88/lucene/pull/13/files.
   
   ### ReverseRandomBytesReader with ByteBuffersDataInput
   
   ```
 1> TEST: now verify [fst size=4621076364; nodeCount=2252341486; 
arcCount=2264078585]
 1> 0...: took 0 seconds
 1> 100...: took 27 seconds
 1> 200...: took 54 seconds
 1> 300...: took 82 seconds
 1> 400...: took 109 seconds
 1> 500...: took 137 seconds
 1> 600...: took 165 seconds
 1> 700...: took 192 seconds
 1> 800...: took 219 seconds
 1> 900...: took 247 seconds
 1> 1000...: took 275 seconds
 1> 1100...: took 300 seconds
   ```
   
   ### BytesStore-like BytesReader
   
   ```
 1> TEST: now verify [fst size=4621076364; nodeCount=2252341486; 
arcCount=2264078585]
 1> 0...: took 0 seconds
 1> 100...: took 22 seconds
 1> 200...: took 44 seconds
 1> 300...: took 66 seconds
 1> 400...: took 89 seconds
 1> 500...: took 111 seconds
 1> 600...: took 133 seconds
 1> 700...: took 155 seconds
 1> 800...: took 178 seconds
 1> 900...: took 200 seconds
 1> 1000...: took 222 seconds
 1> 1100...: took 245 seconds
   ```
   
   As this PR is already rather long, I can make the first implementation 
simple by solely using ByteBuffersDataInput, then open an issue to benchmark 
different alternatives thoroughly. WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-12-05 Thread via GitHub


stefanvodita commented on PR #12844:
URL: https://github.com/apache/lucene/pull/12844#issuecomment-1841841701

   @zhaih - I also tried your 
[idea](https://github.com/apache/lucene/pull/12844#discussion_r1412533096) 
about `beamWidth`. I passed `beamWidth` in to `NeighborArray` and then changed 
the initial capacity of the arrays there, like so:
   ```
   - int arraySize = Math.min(INITIAL_CAPACITY, maxSize);
   + int arraySize = Math.min(maxSize, beamWidth);
   ```
   
   In the profiler, it doesn't look very different from baseline. I didn't see 
any resizing. If I understood you correctly, you were expecting there to be 
occasional resizing of arrays when `beamWidth < maxSize`.
   
   The latency isn't measurably better either. Results are averaged from 5 runs 
(baseline is 
[here](https://github.com/apache/lucene/pull/12844#issuecomment-1836064137)).
   ```
   recall   latency nDocfanout  maxConn beamWidth   visited index ms
   0.720 0.16   1   0   64  250 100 25311.00
post-filter
   0.542 0.23   10  0   64  250 100 43960   1.00
post-filter
   0.512 0.28   20  0   64  250 100 111647  1.00
post-filter
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Allow FST builder to use different writer (alternative reverse BytesReader) [lucene]

2023-12-05 Thread via GitHub


dungba88 opened a new pull request, #12879:
URL: https://github.com/apache/lucene/pull/12879

   ### Description
   
   This is the same with #12624, except for a slight change in implementation 
of `ReadWriteDataOutput`. This PR keeps the original implementation that 
BytesStore did, to make sure there is no regression. My theory is that 
ReverseRandomAccessReader need to seek the correct buffer on every `readByte`, 
and reading from the ByteBuffer has some (small) overhead comparing to 
accessing from the byte array directly. But these should have insignificant 
impact on the performance. More extensive benchmark is needed.
   
   Using Test2BFST, there is some improvement over the `ByteBuffersDataInput`:
   
   ### ReverseRandomAccessReader with ByteBuffersDataInput
   
   ```
 1> TEST: now verify [fst size=4621076364; nodeCount=2252341486; 
arcCount=2264078585]
 1> 0...: took 0 seconds
 1> 100...: took 27 seconds
 1> 200...: took 54 seconds
 1> 300...: took 82 seconds
 1> 400...: took 109 seconds
 1> 500...: took 137 seconds
 1> 600...: took 165 seconds
 1> 700...: took 192 seconds
 1> 800...: took 219 seconds
 1> 900...: took 247 seconds
 1> 1000...: took 275 seconds
 1> 1100...: took 300 seconds
   ```
   
   ### This PR
   
   ```
 1> TEST: now verify [fst size=4621076364; nodeCount=2252341486; 
arcCount=2264078585]
 1> 0...: took 0 seconds
 1> 100...: took 22 seconds
 1> 200...: took 44 seconds
 1> 300...: took 66 seconds
 1> 400...: took 89 seconds
 1> 500...: took 111 seconds
 1> 600...: took 133 seconds
 1> 700...: took 155 seconds
 1> 800...: took 178 seconds
 1> 900...: took 200 seconds
 1> 1000...: took 222 seconds
 1> 1100...: took 245 seconds
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-12-05 Thread via GitHub


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1416403950


##
lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java:
##
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import org.apache.lucene.store.ByteBuffersDataInput;
+import org.apache.lucene.store.ByteBuffersDataOutput;
+import org.apache.lucene.store.DataOutput;
+
+/**
+ * An adapter class to use {@link ByteBuffersDataOutput} as a {@link 
FSTReader}. It allows the FST
+ * to be readable immediately after writing
+ */
+final class ReadWriteDataOutput extends DataOutput implements FSTReader, 
Freezable {
+
+  private final ByteBuffersDataOutput dataOutput;
+  // the DataInput to read from in case the DataOutput has multiple blocks
+  private ByteBuffersDataInput dataInput;
+  // the ByteBuffers to read from in case the DataOutput has a single block
+  private ByteBuffer byteBuffer;
+
+  public ReadWriteDataOutput(ByteBuffersDataOutput dataOutput) {
+this.dataOutput = dataOutput;
+  }
+
+  @Override
+  public void writeByte(byte b) {
+dataOutput.writeByte(b);
+  }
+
+  @Override
+  public void writeBytes(byte[] b, int offset, int length) {
+dataOutput.writeBytes(b, offset, length);
+  }
+
+  @Override
+  public long ramBytesUsed() {
+return dataOutput.ramBytesUsed();
+  }
+
+  @Override
+  public void freeze() {
+// these operations are costly, so we want to compute it once and cache
+List byteBuffers = dataOutput.toWriteableBufferList();
+if (byteBuffers.size() == 1) {

Review Comment:
   Even for ByteBuffersDataInput in multi-block cases, there's a slight 
regression over the BytesStore implementation (note that we already called 
`freeze()`). The diff is https://github.com/dungba88/lucene/pull/13/files.
   
   ### ReverseRandomBytesReader with ByteBuffersDataInput
   
   ```
 1> TEST: now verify [fst size=4621076364; nodeCount=2252341486; 
arcCount=2264078585]
 1> 0...: took 0 seconds
 1> 100...: took 27 seconds
 1> 200...: took 54 seconds
 1> 300...: took 82 seconds
 1> 400...: took 109 seconds
 1> 500...: took 137 seconds
 1> 600...: took 165 seconds
 1> 700...: took 192 seconds
 1> 800...: took 219 seconds
 1> 900...: took 247 seconds
 1> 1000...: took 275 seconds
 1> 1100...: took 300 seconds
   ```
   
   ### BytesStore-like BytesReader
   
   ```
 1> TEST: now verify [fst size=4621076364; nodeCount=2252341486; 
arcCount=2264078585]
 1> 0...: took 0 seconds
 1> 100...: took 22 seconds
 1> 200...: took 44 seconds
 1> 300...: took 66 seconds
 1> 400...: took 89 seconds
 1> 500...: took 111 seconds
 1> 600...: took 133 seconds
 1> 700...: took 155 seconds
 1> 800...: took 178 seconds
 1> 900...: took 200 seconds
 1> 1000...: took 222 seconds
 1> 1100...: took 245 seconds
   ```
   
   As this PR is already rather long, I can make the first implementation 
simple by solely using ByteBuffersDataInput, then open an issue to benchmark 
different alternatives thoroughly. WDYT?
   
   For this PR, I'll use the more simplified version by solely using 
ByteBuffersDataInput. I opened another one for the alternative way: 
https://github.com/apache/lucene/pull/12879. I did see improvement as shown in 
the log above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (alternative reverse BytesReader) [lucene]

2023-12-05 Thread via GitHub


dungba88 commented on PR #12879:
URL: https://github.com/apache/lucene/pull/12879#issuecomment-1841967956

   I think we'll likely go with the approach in #12624 (simpler despite some 
potential regression). But this approach is to have some comparison, and in 
case someone has strong argument for favoring the latency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org