Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]
shubhamvishu commented on code in PR #12868: URL: https://github.com/apache/lucene/pull/12868#discussion_r1414510733 ## lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java: ## @@ -150,9 +150,10 @@ private FuzzySet(FixedBitSet filter, int bloomSize, int hashCount) { * @return NO or MAYBE */ public ContainsResult contains(BytesRef value) { -long hash = hashFunction.hash(value); -int msb = (int) (hash >>> Integer.SIZE); -int lsb = (int) hash; +long[] hash = hashFunction.hash128(value); + +int msb = ((int) hash[0] >>> Integer.SIZE) >>> 1 + ((int) hash[1] >>> Integer.SIZE) >>> 1; +int lsb = ((int) hash[0]) >>> 1 + ((int) hash[1]) >>> 1; Review Comment: I can try to run benchmarks with that and see how that turns out.but with the current logic (below) we see 7-9% better performance consistently(see above benchmark results). Could that be misleading by any chance because this is more probable to produce 0 value or maybe worth sticking to that since benchmarks say a different story? ```java int msb = ((int) hash[0] >>> Integer.SIZE) >>> 1 + ((int) hash[1] >>> Integer.SIZE) >>> 1; int lsb = ((int) hash[0]) >>> 1 + ((int) hash[1]) >>> 1; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make FSTCompiler.compile() to only return the FSTMetadata [lucene]
dungba88 commented on PR #12831: URL: https://github.com/apache/lucene/pull/12831#issuecomment-1840269923 I'll put the CHANGES.txt for #12624 together with this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add a bulk scorer for disjunctions that does dynamic pruning [LUCENE-9335] [lucene]
jpountz closed issue #10375: Add a bulk scorer for disjunctions that does dynamic pruning [LUCENE-9335] URL: https://github.com/apache/lucene/issues/10375 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add a bulk scorer for disjunctions that does dynamic pruning [LUCENE-9335] [lucene]
jpountz commented on issue #10375: URL: https://github.com/apache/lucene/issues/10375#issuecomment-1840276633 Implemented. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Add a MergePolicy wrapper that preserves search concurrency? [lucene]
jpountz opened a new issue, #12877: URL: https://github.com/apache/lucene/issues/12877 ### Description We have an issue about decoupling search concurrency from index geometry (#9721), but this comes with trade-offs as the per-segment bit of search is hard to parallelize. Maybe we should also introduce a merge policy wrapper that tries to preserve a search concurrency of `N` by preventing the creation of segments of more than `maxDoc/N` docs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
dungba88 commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1415164737 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java: ## @@ -351,7 +351,8 @@ public int[] getBulkOrdinals(FacetLabel... categoryPaths) throws IOException { } } else { indexesMissingFromCache = -ArrayUtil.grow(indexesMissingFromCache, numberOfMissingFromCache + 1); +ArrayUtil.growInRange( Review Comment: I think we could add a comment here saying that indexesMissingFromCache cannot grow beyond categoryPaths. Was also thinking if an assert is needed here but the check in growInRange should be enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for index sorting with document blocks [lucene]
s1monw commented on code in PR #12829: URL: https://github.com/apache/lucene/pull/12829#discussion_r1415196168 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java: ## @@ -262,6 +277,73 @@ long updateDocuments( } } + private interface DocValidator extends Consumer { Review Comment: I removed it and went down a different route -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
jpountz commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1415213374 ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataInput.java: ## @@ -212,6 +213,46 @@ public long readLong() throws IOException { } } + @Override + public void readGroupVInts(long[] dst, int limit) throws IOException { +int i; +for (i = 0; i <= limit - 4; i += 4) { + readGroupVInt(dst, i); +} +for (; i < limit; ++i) { + dst[i] = readVInt(); +} + } + + private void readGroupVInt(long[] dst, int offset) throws IOException { +ByteBuffer block = blocks[blockIndex(pos)]; +int blockOffset = blockOffset(pos); +if (block.limit() - blockOffset < GroupVIntUtil.MAX_LENGTH_PER_GROUP) { + GroupVIntUtil.fallbackReadGroupVInt(this, dst, offset); + return; +} + +final int flag = readByte() & 0xFF; + +final int n1Minus1 = flag >> 6; +final int n2Minus1 = (flag >> 4) & 0x03; +final int n3Minus1 = (flag >> 2) & 0x03; +final int n4Minus1 = flag & 0x03; + +blockOffset = blockOffset(pos); Review Comment: We already computed `blockOffset()` above, can we avoid computing it twice, e.g. by replacing the `readByte()` call with `block.get(blockOffset++)`? ## lucene/core/src/java/org/apache/lucene/store/DataOutput.java: ## @@ -29,6 +30,7 @@ * internal state like file position). */ public abstract class DataOutput { + private BytesRefBuilder groupVIntBytes = new BytesRefBuilder(); Review Comment: Can it be made final? ## lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseDirectoryTestCase.java: ## @@ -1438,4 +1440,68 @@ public void testListAllIsSorted() throws IOException { assertArrayEquals(expected, actual); } } + + public void testDataTypes() throws IOException { +final long[] values = new long[] {43, 12345, 123456, 1234567890}; +try (Directory dir = getDirectory(createTempDir("testDataTypes"))) { + IndexOutput out = dir.createOutput("test", IOContext.DEFAULT); + out.writeByte((byte) 43); + out.writeShort((short) 12345); + out.writeInt(1234567890); + out.writeGroupVInts(values, 4); + out.writeLong(1234567890123456789L); + out.close(); + + long[] restored = new long[4]; + IndexInput in = dir.openInput("test", IOContext.DEFAULT); + assertEquals(43, in.readByte()); + assertEquals(12345, in.readShort()); + assertEquals(1234567890, in.readInt()); + in.readGroupVInts(restored, 4); + assertArrayEquals(values, restored); + assertEquals(1234567890123456789L, in.readLong()); + in.close(); +} + } + + public void testGroupVInt() throws IOException { +try (Directory dir = getDirectory(createTempDir("testGroupVInt"))) { + // test fallbackReadGroupVInt + doTestGroupVInt(dir, 5, 1, 6, 8); + + // test large data to covers multiple blocks in ByteBuffersDataInput Review Comment: adding it explicitly since this is probably the most important case to cover ```suggestion // test large data to covers multiple blocks in ByteBuffersDataInput and MMapDirectory via TestMultiMMap ``` ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -303,6 +304,48 @@ public byte readByte(long pos) throws IOException { } } + @Override + public void readGroupVInts(long[] dst, int limit) throws IOException { +int i; +for (i = 0; i <= limit - 4; i += 4) { + readGroupVInt(dst, i); +} +for (; i < limit; ++i) { + dst[i] = readVInt(); +} + } + + private void readGroupVInt(long[] dst, int offset) throws IOException { +if (curSegment.byteSize() - curPosition < GroupVIntUtil.MAX_LENGTH_PER_GROUP) { + GroupVIntUtil.fallbackReadGroupVInt(this, dst, offset); + return; +} + +try { + final int flag = readByte() & 0xFF; Review Comment: Since we already did the job of checking that we have enough bytes left, we don't need `readByte()` to do it for us? ```suggestion final int flag = curSegment.get(LAYOUT_BYTE, curPosition++) & 0xFF; ``` ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -303,6 +304,48 @@ public byte readByte(long pos) throws IOException { } } + @Override + public void readGroupVInts(long[] dst, int limit) throws IOException { +int i; +for (i = 0; i <= limit - 4; i += 4) { + readGroupVInt(dst, i); +} +for (; i < limit; ++i) { + dst[i] = readVInt(); +} + } + + private void readGroupVInt(long[] dst, int offset) throws IOException { +if (curSegment.byteSize() - curPosition < GroupVIntUtil.MAX_LENGTH_PER_GROUP) { + GroupVIntUtil.fallbackReadGroupVInt(this, dst, offset); + return; +} + Review Comment: Maybe copy `curSegment` to a local variable here, e.g. `MemorySegm
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
jpountz commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1415234949 ## lucene/core/src/java/org/apache/lucene/store/DataOutput.java: ## @@ -324,4 +326,42 @@ public void writeSetOfStrings(Set set) throws IOException { writeString(value); } } + + /** + * Encode integers using group-varint. It uses {@link DataOutput#writeVInt VInt} to encode tail + * values that are not enough for a group. we need a long[] because this is what postings are + * using, all longs are actually required to be integers. + * + * @param values the values to write + * @param limit the number of values to write. + */ + public void writeGroupVInts(long[] values, int limit) throws IOException { Review Comment: I'm tempted to mark it `@lucene.experimental` to allow us to change the signature to use ints in the future. And likewise for its counterpart on `DataInput`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Rewrite JavaScriptCompiler to use modern JVM features (Java 17) [lucene]
uschindler merged PR #12873: URL: https://github.com/apache/lucene/pull/12873 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Maybe expression compiler should cache recently compiled expressions? [LUCENE-7882] [lucene]
uschindler closed issue #8933: Maybe expression compiler should cache recently compiled expressions? [LUCENE-7882] URL: https://github.com/apache/lucene/issues/8933 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Maybe expression compiler should cache recently compiled expressions? [LUCENE-7882] [lucene]
uschindler commented on issue #8933: URL: https://github.com/apache/lucene/issues/8933#issuecomment-1840518911 The new code also fixes the problem with hidden stack frames caused by hidden class feature. I am still hoping to have an option in future to turn on some stack frames like possible inside the jdk with annotations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10641: IndexSearcher#setTimeout should also abort query rewrites, point ranges and vector searches [lucene]
mikemccand commented on code in PR #12345: URL: https://github.com/apache/lucene/pull/12345#discussion_r1415450166 ## lucene/core/src/java/org/apache/lucene/index/ExitableIndexReader.java: ## @@ -0,0 +1,436 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import org.apache.lucene.util.Bits; + +/** + * The {@link ExitableIndexReader} is used to timeout I/O operation which is done during query + * rewrite. After this time is exceeded, the search thread is stopped by throwing a {@link + * ExitableIndexReader.TimeExceededException} + */ +public final class ExitableIndexReader extends IndexReader { + private final IndexReader indexReader; + private final QueryTimeout queryTimeout; + + /** + * Create a ExitableIndexReader wrapper over another {@link IndexReader} with a specified timeout. + * + * @param indexReader the wrapped {@link IndexReader} + * @param queryTimeout max time allowed for collecting hits after which {@link + * ExitableIndexReader.TimeExceededException} is thrown + */ + public ExitableIndexReader(IndexReader indexReader, QueryTimeout queryTimeout) { +this.indexReader = indexReader; +this.queryTimeout = queryTimeout; +doWrapIndexReader(indexReader, queryTimeout); + } + + /** Thrown when elapsed search time exceeds allowed search time. */ + @SuppressWarnings("serial") + static class TimeExceededException extends RuntimeException { +private TimeExceededException() { + super("TimeLimit Exceeded"); +} + +private TimeExceededException(Exception e) { + super(e); +} + } + + @Override + public TermVectors termVectors() throws IOException { +if (queryTimeout.shouldExit()) { + throw new ExitableIndexReader.TimeExceededException(); +} +return indexReader.termVectors(); + } + + @Override + public int numDocs() { +if (queryTimeout.shouldExit()) { + throw new ExitableIndexReader.TimeExceededException(); +} +return indexReader.numDocs(); + } + + @Override + public int maxDoc() { +if (queryTimeout.shouldExit()) { + throw new ExitableIndexReader.TimeExceededException(); +} +return indexReader.maxDoc(); + } + + @Override + public StoredFields storedFields() throws IOException { +if (queryTimeout.shouldExit()) { + throw new ExitableIndexReader.TimeExceededException(); +} +return indexReader.storedFields(); + } + + @Override + protected void doClose() throws IOException { +if (queryTimeout.shouldExit()) { + throw new ExitableIndexReader.TimeExceededException(); +} +indexReader.doClose(); + } + + @Override + public IndexReaderContext getContext() { +if (queryTimeout.shouldExit()) { + throw new ExitableIndexReader.TimeExceededException(); +} +return indexReader.getContext(); + } + + @Override + public CacheHelper getReaderCacheHelper() { +if (queryTimeout.shouldExit()) { + throw new ExitableIndexReader.TimeExceededException(); +} +return indexReader.getReaderCacheHelper(); + } + + @Override + public int docFreq(Term term) throws IOException { +if (queryTimeout.shouldExit()) { + throw new ExitableIndexReader.TimeExceededException(); +} +return indexReader.docFreq(term); + } + + @Override + public long totalTermFreq(Term term) throws IOException { +if (queryTimeout.shouldExit()) { + throw new ExitableIndexReader.TimeExceededException(); +} +return indexReader.totalTermFreq(term); + } + + @Override + public long getSumDocFreq(String field) throws IOException { +if (queryTimeout.shouldExit()) { + throw new ExitableIndexReader.TimeExceededException(); +} +return indexReader.getSumDocFreq(field); + } + + @Override + public int getDocCount(String field) throws IOException { +if (queryTimeout.shouldExit()) { + throw new ExitableIndexReader.TimeExceededException(); +} +return indexReader.getDocCount(field); + } + + @Override + public long getSumTotalTermFreq(String field) throws IOException { +
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
mikemccand commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1840646081 > > I tested just how much slower the ByteBuffer based store is than the FST's BytesStore: > > I assume this is before the last iteration that does the freeze, is that right? What do you think about the last results? Sorry yes -- I'll try to come back to this soon, and re-test. Thanks @dungba88 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Do we suboptimally call ByteBuffersDataOutput.toDataInput()? [lucene]
mikemccand commented on issue #12852: URL: https://github.com/apache/lucene/issues/12852#issuecomment-1840648672 At least we should add a warning to the javadocs here. And we should audit other places that are frequently calling `.toDataInput()` (e.g. block tree terms dict writing?). This method is surprisingly slow compared to simply juggling our own `byte[]` under the hood, at least in FST's (well, `Test2BFSTs`) usage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]
shubhamvishu commented on code in PR #12868: URL: https://github.com/apache/lucene/pull/12868#discussion_r1415528939 ## lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java: ## @@ -150,9 +150,10 @@ private FuzzySet(FixedBitSet filter, int bloomSize, int hashCount) { * @return NO or MAYBE */ public ContainsResult contains(BytesRef value) { -long hash = hashFunction.hash(value); -int msb = (int) (hash >>> Integer.SIZE); -int lsb = (int) hash; +long[] hash = hashFunction.hash128(value); + +int msb = ((int) hash[0] >>> Integer.SIZE) >>> 1 + ((int) hash[1] >>> Integer.SIZE) >>> 1; +int lsb = ((int) hash[0]) >>> 1 + ((int) hash[1]) >>> 1; Review Comment: Ok I tried using simple `(hash[0] >>> Integer.SIZE)` and `(int) hash[0]` but I still see the same 2.6% regression. I think the current way seem to produce better positive hash value and more random than directly using the 2 long values which is why we are seeing `PKLookup` performance getting improved by 7-9%. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
mikemccand commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1415505550 ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -153,6 +180,40 @@ private FSTCompiler( } } + // Get the respective FSTReader of the DataOutput. If the DataOutput is also a FSTReader then we + // will use it, otherwise we will return a NullFSTReader. Attempting to read from a FST with + // NullFSTReader + // will throw UnsupportedOperationException + private FSTReader toFSTReader(DataOutput dataOutput) { +if (dataOutput instanceof FSTReader) { + return (FSTReader) dataOutput; +} +return NULL_FST_READER; + } + + /** + * This class is used for FST backed by non-FSTReader DataOutput. It does not allow getting the + * reverse BytesReader nor writing to a DataOutput. + */ + private static final class NullFSTReader implements FSTReader { + +@Override +public long ramBytesUsed() { + return 0; +} + +@Override +public FST.BytesReader getReverseBytesReader() { + throw new UnsupportedOperationException( + "NullFSTReader does not support getReverseBytesReader()"); +} + +@Override +public void writeTo(DataOutput out) { + throw new UnsupportedOperationException("NullFSTReader does not support writeTo(DataOutput)"); Review Comment: Could we improve this message, e.g. something like `FST was not constructed with getOnHeapReaderWriter` or so? `NullFSTReader` isn't something the user should even know about (it's a private class implementation detail). ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -218,13 +279,19 @@ public Builder allowFixedLengthArcs(boolean allowFixedLengthArcs) { } /** - * How many bits wide to make each byte[] block in the BytesStore; if you know the FST will be - * large then make this larger. For example 15 bits = 32768 byte pages. + * Set the {@link DataOutput} which is used for low-level writing of FST. If you want the FST to + * be immediately readable, you need to use a DataOutput that also implements {@link FSTReader}, Review Comment: Does `FSTReader` need to be public and known to users? Could we make it package private and change this to say "you need to use `FSTCompiler#getOnHeapReaderWriter`"? ## lucene/test-framework/src/java/org/apache/lucene/tests/util/fst/FSTTester.java: ## @@ -316,6 +313,15 @@ public FST doTest() throws IOException { return fst; } + protected FST compile(FSTCompiler fstCompiler) throws IOException { +return fstCompiler.compile(); + } + + protected FSTCompiler.Builder getFSTBuilder() { +return new FSTCompiler.Builder<>( Review Comment: Could we randomize whether the "on disk" vs "on heap" `DataOutput` is used, so that `TestFSTs` randomly picks? We should randomize all three posibilities: * Create & use on-heap FST * Create on heap, save to disk, load FST from disk * Stream to disk, then load FST from disk ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -165,7 +226,7 @@ public static class Builder { private final Outputs outputs; private double suffixRAMLimitMB = 32.0; private boolean allowFixedLengthArcs = true; -private int bytesPageBits = 15; +private DataOutput dataOutput = getOnHeapReaderWriter(15); Review Comment: Hmm can we avoid even creating this, unless the caller didn't pass a `Builder.dataOutput` themselves? ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -120,22 +125,44 @@ public class FSTCompiler { final float directAddressingMaxOversizingFactor; long directAddressingExpansionCredit; - final BytesStore bytes; + // the DataOutput to stream the FST bytes to + final DataOutput dataOutput; + + // buffer to store bytes for the one node we are currently writing + final GrowableByteArrayDataOutput scratchBytes = new GrowableByteArrayDataOutput(); + + private long numBytesWritten; + + /** + * Get an on-heap DataOutput that allows the FST to be read immediately after writing. Review Comment: Add `, and also optionally saved to an external DataOutput`? I.e. with these changes we support these possible FST workflows: * Build FST and use it immediately entirely in RAM and then discard it * Build FST and use it immediately entirely in RAM and also save it to disk (any other `DataOutput`), and load it later and use it * Build FST but stream it immediately to disk (except the `FSTMetaData`, saved at the end) and you cannot use it when done unless you go and open your own `DataInput` on the backing store and make a new FST from that Could we include this enumeration in `FSTCompiler''s class javadocs? Also: could you update `Test2BFSTs` to also test the 3rd bullet above? Right now it onl
Re: [PR] LUCENE-10236: Update field-weight used in CombinedFieldQuery scoring calculation (9.0.1 Backporting) [lucene]
mikemccand commented on PR #587: URL: https://github.com/apache/lucene/pull/587#issuecomment-1840761468 @zacharymorn given that it's been so long, and we are unlikely to release another 9.0.x bugfix release, I think we should close this. The fix is present in Lucene 9.1.0 and above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10236: Update field-weight used in CombinedFieldQuery scoring calculation (9.0.1 Backporting) [lucene]
mikemccand closed pull request #587: LUCENE-10236: Update field-weight used in CombinedFieldQuery scoring calculation (9.0.1 Backporting) URL: https://github.com/apache/lucene/pull/587 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for index sorting with document blocks [lucene]
mikemccand commented on PR #12829: URL: https://github.com/apache/lucene/pull/12829#issuecomment-1840782006 Thanks @s1monw -- I'll try to review soon. > The only think I am torn on is, if we set the num of children as a value for the DV field then I guess we should have a good usecase for that? One thing this would be helpful for is stronger consistency check in `CheckIndex`? Without doing this we can just validate that the parent docs are sorted correctly, whereas today we verify that all docs are sorted correctly (since user must provide a sort congruent with their doc-blocks). So we are weakening `CheckIndex` a bit by skipping checking the sort of the child docs ... if we stored the number of children in the new DV field, we could gain that back? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]
jpountz commented on code in PR #12868: URL: https://github.com/apache/lucene/pull/12868#discussion_r1415603528 ## lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java: ## @@ -150,9 +150,10 @@ private FuzzySet(FixedBitSet filter, int bloomSize, int hashCount) { * @return NO or MAYBE */ public ContainsResult contains(BytesRef value) { -long hash = hashFunction.hash(value); -int msb = (int) (hash >>> Integer.SIZE); -int lsb = (int) hash; +long[] hash = hashFunction.hash128(value); + +int msb = ((int) hash[0] >>> Integer.SIZE) >>> 1 + ((int) hash[1] >>> Integer.SIZE) >>> 1; +int lsb = ((int) hash[0]) >>> 1 + ((int) hash[1]) >>> 1; Review Comment: If what you are saying is true, this would be a bug in murmur3. I'm assuming that murmur3 doesn't have such a bug, so either there is actually a bug in our murmur3 implementation, or this regression is "bad luck"? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Do we suboptimally call ByteBuffersDataOutput.toDataInput()? [lucene]
dweiss commented on issue #12852: URL: https://github.com/apache/lucene/issues/12852#issuecomment-1840786092 I don't think it was ever meant to be called repeatedly in fast bursts. It was meant to provide a direct reader for previously written buffers, once the writing is completed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Do we suboptimally call ByteBuffersDataOutput.toDataInput()? [lucene]
dweiss commented on issue #12852: URL: https://github.com/apache/lucene/issues/12852#issuecomment-1840793857 Also, not sure why it's slow - perhaps because of the wrapping in read-only buffers when toBufferList is called? If so, then this could be internally tweaked by calling toWriteableBufferList inside toDataInput (since we know data input is not going to be messing with the buffers...). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1415625009 ## lucene/core/src/test/org/apache/lucene/util/fst/TestFSTDataOutputWriter.java: ## @@ -0,0 +1,230 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.fst; + +import static org.apache.lucene.tests.util.fst.FSTTester.toIntsRef; + +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.Random; +import org.apache.lucene.store.DataInput; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.InputStreamDataInput; +import org.apache.lucene.store.OutputStreamDataOutput; +import org.apache.lucene.tests.store.MockDirectoryWrapper; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.TestUtil; +import org.apache.lucene.tests.util.fst.FSTTester; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.IntsRef; + +public class TestFSTDataOutputWriter extends LuceneTestCase { + + private MockDirectoryWrapper dir; + + @Override + public void setUp() throws Exception { +super.setUp(); +dir = newMockDirectory(); + } + + @Override + public void tearDown() throws Exception { +// can be null if we force simpletext (funky, some kind of bug in test runner maybe) +if (dir != null) { + dir.close(); +} +super.tearDown(); + } + + public void testRandom() throws Exception { + +final int iters = atLeast(10); +final int maxBytes = TEST_NIGHTLY ? 20 : 2; +for (int iter = 0; iter < iters; iter++) { + final int numBytes = TestUtil.nextInt(random(), 1, maxBytes); + final byte[] expected = new byte[numBytes]; + final ByteArrayOutputStream baos = new ByteArrayOutputStream(); + final DataOutput dataOutput = new OutputStreamDataOutput(baos); Review Comment: Ah I think it was left over code from previous implementation. Will remove it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1415627354 ## lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java: ## @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.fst; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import org.apache.lucene.store.ByteBuffersDataInput; +import org.apache.lucene.store.ByteBuffersDataOutput; +import org.apache.lucene.store.DataOutput; + +/** + * An adapter class to use {@link ByteBuffersDataOutput} as a {@link FSTReader}. It allows the FST + * to be readable immediately after writing + */ +final class ReadWriteDataOutput extends DataOutput implements FSTReader, Freezable { + + private final ByteBuffersDataOutput dataOutput; + // the DataInput to read from in case the DataOutput has multiple blocks + private ByteBuffersDataInput dataInput; + // the ByteBuffers to read from in case the DataOutput has a single block + private ByteBuffer byteBuffer; + + public ReadWriteDataOutput(ByteBuffersDataOutput dataOutput) { +this.dataOutput = dataOutput; + } + + @Override + public void writeByte(byte b) { +dataOutput.writeByte(b); + } + + @Override + public void writeBytes(byte[] b, int offset, int length) { +dataOutput.writeBytes(b, offset, length); + } + + @Override + public long ramBytesUsed() { +return dataOutput.ramBytesUsed(); + } + + @Override + public void freeze() { +// these operations are costly, so we want to compute it once and cache +List byteBuffers = dataOutput.toWriteableBufferList(); +if (byteBuffers.size() == 1) { Review Comment: I think it didn't handle the single buffer use case, and ByteBuffersDataInput would fall into the same performance regression problem as BytesStore with multiple blocks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for index sorting with document blocks [lucene]
mikemccand commented on code in PR #12829: URL: https://github.com/apache/lucene/pull/12829#discussion_r1415586966 ## lucene/core/src/test/org/apache/lucene/index/TestAddIndexes.java: ## @@ -1678,6 +1678,51 @@ public void testIllegalIndexSortChange2() throws Exception { IOUtils.close(r1, dir1, w2, dir2); } + public void testIllegalIndexSortChange3() throws Exception { +Directory dir1 = newDirectory(); +IndexWriterConfig iwc1 = newIndexWriterConfig(new MockAnalyzer(random())); +iwc1.setIndexSort(new Sort("foobar", new SortField("foo", SortField.Type.INT))); + +RandomIndexWriter w1 = new RandomIndexWriter(random(), dir1, iwc1); +Document parent = new Document(); +w1.addDocuments(Arrays.asList(new Document(), new Document(), parent)); +w1.commit(); +w1.addDocuments(Arrays.asList(new Document(), new Document(), parent)); +w1.commit(); +// so the index sort is in fact burned into the index: +w1.forceMerge(1); +w1.close(); + +Directory dir2 = newDirectory(); +IndexWriterConfig iwc2 = newIndexWriterConfig(new MockAnalyzer(random())); +iwc2.setIndexSort(new Sort(new SortField("foo", SortField.Type.INT))); +RandomIndexWriter w2 = new RandomIndexWriter(random(), dir2, iwc2); + +IndexReader r1 = DirectoryReader.open(dir1); +String message = +expectThrows( +IllegalArgumentException.class, +() -> { + w2.addIndexes((SegmentReader) getOnlyLeafReader(r1)); +}) +.getMessage(); +assertEquals( +"cannot change index sort from parent field: foobar to ", +message); + +message = +expectThrows( +IllegalArgumentException.class, +() -> { + w2.addIndexes(dir1); Review Comment: Can we also test that if you `.addIndexes` with the correct configuration (same parent sort field) things are good? ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -1198,33 +1198,42 @@ public static Status.IndexSortStatus testSort( comparators[i] = fields[i].getComparator(1, Pruning.NONE).getLeafComparator(readerContext); } - int maxDoc = reader.maxDoc(); - try { - -for (int docID = 1; docID < maxDoc; docID++) { - +LeafMetaData metaData = reader.getMetaData(); +if (metaData.hasBlocks() +&& sort.getParentField() == null +&& metaData.getCreatedVersionMajor() >= Version.LUCENE_10_0_0.major) { + throw new IllegalStateException( + "parent field is not set but the index has block and was created with version: " Review Comment: `has block` -> `has document blocks`? ## lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestBackwardsCompatibility.java: ## @@ -2164,6 +2166,83 @@ public void testSortedIndex() throws Exception { } } + public void testSortedIndexAddDocBlocks() throws Exception { +for (String name : oldSortedNames) { + Path path = createTempDir("sorted"); + InputStream resource = TestBackwardsCompatibility.class.getResourceAsStream(name + ".zip"); + assertNotNull("Sorted index index " + name + " not found", resource); + TestUtil.unzip(resource, path); + + try (Directory dir = newFSDirectory(path)) { +final Sort sort; +try (DirectoryReader reader = DirectoryReader.open(dir)) { + assertEquals(1, reader.leaves().size()); + sort = reader.leaves().get(0).reader().getMetaData().getSort(); + assertNotNull(sort); + searchExampleIndex(reader); +} +// open writer +try (IndexWriter writer = +new IndexWriter( +dir, +newIndexWriterConfig(new MockAnalyzer(random())) +.setOpenMode(OpenMode.APPEND) +.setIndexSort(sort) +.setMergePolicy(newLogMergePolicy( { + // add 10 docs + for (int i = 0; i < 10; i++) { +Document child = new Document(); +child.add(new StringField("relation", "child", Field.Store.NO)); +child.add(new StringField("bid", "" + i, Field.Store.NO)); +child.add(new NumericDocValuesField("dateDV", i)); +Document parent = new Document(); +parent.add(new StringField("relation", "parent", Field.Store.NO)); +parent.add(new StringField("bid", "" + i, Field.Store.NO)); +parent.add(new NumericDocValuesField("dateDV", i)); +writer.addDocuments(Arrays.asList(child, child, parent)); +if (random().nextBoolean()) { + writer.flush(); +} + } + if (random().nextBoolean()) { +writer.forceMerge(1); + } + writer.commit(); + try (IndexReader reader = DirectoryReader.open(dir)) { +I
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1415627354 ## lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java: ## @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.fst; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import org.apache.lucene.store.ByteBuffersDataInput; +import org.apache.lucene.store.ByteBuffersDataOutput; +import org.apache.lucene.store.DataOutput; + +/** + * An adapter class to use {@link ByteBuffersDataOutput} as a {@link FSTReader}. It allows the FST + * to be readable immediately after writing + */ +final class ReadWriteDataOutput extends DataOutput implements FSTReader, Freezable { + + private final ByteBuffersDataOutput dataOutput; + // the DataInput to read from in case the DataOutput has multiple blocks + private ByteBuffersDataInput dataInput; + // the ByteBuffers to read from in case the DataOutput has a single block + private ByteBuffer byteBuffer; + + public ReadWriteDataOutput(ByteBuffersDataOutput dataOutput) { +this.dataOutput = dataOutput; + } + + @Override + public void writeByte(byte b) { +dataOutput.writeByte(b); + } + + @Override + public void writeBytes(byte[] b, int offset, int length) { +dataOutput.writeBytes(b, offset, length); + } + + @Override + public long ramBytesUsed() { +return dataOutput.ramBytesUsed(); + } + + @Override + public void freeze() { +// these operations are costly, so we want to compute it once and cache +List byteBuffers = dataOutput.toWriteableBufferList(); +if (byteBuffers.size() == 1) { Review Comment: I think it didn't handle the single buffer use case, and ByteBuffersDataInput would fall into the same performance regression problem as BytesStore with multiple blocks (which due to the fact that the method is size is larger and won't be inlined by JVM). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1415641634 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -500,6 +502,12 @@ public FSTMetadata getMetadata() { return metadata; } + /** + * Save the FST to DataOutput. + * + * @param metaOut the DataOutput to write the metadata to + * @param out the DataOutput to write the FST bytes to + */ public void save(DataOutput metaOut, DataOutput out) throws IOException { Review Comment: Yeah. It's also not supported (throwing exception). I'll add a comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1415666711 ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -218,13 +279,19 @@ public Builder allowFixedLengthArcs(boolean allowFixedLengthArcs) { } /** - * How many bits wide to make each byte[] block in the BytesStore; if you know the FST will be - * large then make this larger. For example 15 bits = 32768 byte pages. + * Set the {@link DataOutput} which is used for low-level writing of FST. If you want the FST to + * be immediately readable, you need to use a DataOutput that also implements {@link FSTReader}, Review Comment: I think when we change the compile() to only return metadata, users need to create the FST with the on-heap FSTReader, and thus it needs to be public. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
easyice commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1415479977 ## lucene/core/src/java/org/apache/lucene/store/DataOutput.java: ## @@ -324,4 +326,42 @@ public void writeSetOfStrings(Set set) throws IOException { writeString(value); } } + + /** + * Encode integers using group-varint. It uses {@link DataOutput#writeVInt VInt} to encode tail + * values that are not enough for a group. we need a long[] because this is what postings are + * using, all longs are actually required to be integers. + * + * @param values the values to write + * @param limit the number of values to write. + */ + public void writeGroupVInts(long[] values, int limit) throws IOException { Review Comment: okay! ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataInput.java: ## @@ -212,6 +213,46 @@ public long readLong() throws IOException { } } + @Override + public void readGroupVInts(long[] dst, int limit) throws IOException { +int i; +for (i = 0; i <= limit - 4; i += 4) { + readGroupVInt(dst, i); +} +for (; i < limit; ++i) { + dst[i] = readVInt(); +} + } + + private void readGroupVInt(long[] dst, int offset) throws IOException { +ByteBuffer block = blocks[blockIndex(pos)]; +int blockOffset = blockOffset(pos); +if (block.limit() - blockOffset < GroupVIntUtil.MAX_LENGTH_PER_GROUP) { + GroupVIntUtil.fallbackReadGroupVInt(this, dst, offset); + return; +} + +final int flag = readByte() & 0xFF; + +final int n1Minus1 = flag >> 6; +final int n2Minus1 = (flag >> 4) & 0x03; +final int n3Minus1 = (flag >> 2) & 0x03; +final int n4Minus1 = flag & 0x03; + +blockOffset = blockOffset(pos); Review Comment: +1, i did the similar change in `BufferedIndexInput#readGroupVInt` ## lucene/core/src/java/org/apache/lucene/store/DataOutput.java: ## @@ -29,6 +30,7 @@ * internal state like file position). */ public abstract class DataOutput { + private BytesRefBuilder groupVIntBytes = new BytesRefBuilder(); Review Comment: +1, I will pay more attention to this issue. ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -303,6 +304,48 @@ public byte readByte(long pos) throws IOException { } } + @Override + public void readGroupVInts(long[] dst, int limit) throws IOException { +int i; +for (i = 0; i <= limit - 4; i += 4) { + readGroupVInt(dst, i); +} +for (; i < limit; ++i) { + dst[i] = readVInt(); +} + } + + private void readGroupVInt(long[] dst, int offset) throws IOException { +if (curSegment.byteSize() - curPosition < GroupVIntUtil.MAX_LENGTH_PER_GROUP) { + GroupVIntUtil.fallbackReadGroupVInt(this, dst, offset); + return; +} + +try { + final int flag = readByte() & 0xFF; Review Comment: +1 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -303,6 +304,48 @@ public byte readByte(long pos) throws IOException { } } + @Override + public void readGroupVInts(long[] dst, int limit) throws IOException { +int i; +for (i = 0; i <= limit - 4; i += 4) { + readGroupVInt(dst, i); +} +for (; i < limit; ++i) { + dst[i] = readVInt(); +} + } + + private void readGroupVInt(long[] dst, int offset) throws IOException { +if (curSegment.byteSize() - curPosition < GroupVIntUtil.MAX_LENGTH_PER_GROUP) { + GroupVIntUtil.fallbackReadGroupVInt(this, dst, offset); + return; +} + Review Comment: done, i wonder what the impact on JVM if the variable maybe updated? Thanks! ## lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseDirectoryTestCase.java: ## @@ -1438,4 +1440,68 @@ public void testListAllIsSorted() throws IOException { assertArrayEquals(expected, actual); } } + + public void testDataTypes() throws IOException { +final long[] values = new long[] {43, 12345, 123456, 1234567890}; +try (Directory dir = getDirectory(createTempDir("testDataTypes"))) { + IndexOutput out = dir.createOutput("test", IOContext.DEFAULT); + out.writeByte((byte) 43); + out.writeShort((short) 12345); + out.writeInt(1234567890); + out.writeGroupVInts(values, 4); + out.writeLong(1234567890123456789L); + out.close(); + + long[] restored = new long[4]; + IndexInput in = dir.openInput("test", IOContext.DEFAULT); + assertEquals(43, in.readByte()); + assertEquals(12345, in.readShort()); + assertEquals(1234567890, in.readInt()); + in.readGroupVInts(restored, 4); + assertArrayEquals(values, restored); + assertEquals(1234567890123456789L, in.readLong()); + in.close(); +} + } + + public void testGroupVInt() throws IOException { +try (Directory dir = getDirectory(createTempDir("t
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
rmuir commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1840885611 > Thank you very much for your suggestions, i had fixed the comments from @jpountz, but the related to Mmapdir will be later(such as java19, java20 support), because we need to confirm whether to change it. i agree that we should be cautious here. in my understanding the optimized code is faster than the main branch probably because the optimized code doesn't need a `switch` statement in `GroupVIntUtil#readLongInGroup`, although it was inlined. > > The JMH output for currently, it's similar to before we replaced `varHander` with `ByteBuffer#getInt()` java17 > > ``` > Benchmark (size) Mode Cnt Score Error Units > GroupVIntBenchmark.byteBuffersReadGroupVInt 64 thrpt 5 4.698 ± 2.039 ops/us > GroupVIntBenchmark.byteBuffersReadGroupVIntBaseline 64 thrpt 5 2.167 ± 0.064 ops/us > GroupVIntBenchmark.mmap_byteBufferReadGroupVInt 64 thrpt 5 7.386 ± 0.524 ops/us > GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline 64 thrpt 5 7.358 ± 1.316 ops/us > GroupVIntBenchmark.nioReadGroupVInt 64 thrpt 5 8.236 ± 1.052 ops/us > GroupVIntBenchmark.nioReadGroupVIntBaseline 64 thrpt 5 5.381 ± 1.134 ops/us > ``` > > java21 > > ``` > Benchmark (size) Mode Cnt Score Error Units > GroupVIntBenchmark.byteBuffersReadGroupVInt 64 thrpt 5 4.622 ± 0.880 ops/us > GroupVIntBenchmark.byteBuffersReadGroupVIntBaseline 64 thrpt 5 1.674 ± 0.245 ops/us > GroupVIntBenchmark.mmap_byteBufferReadGroupVInt 64 thrpt 5 10.165 ± 1.083 ops/us > GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline 64 thrpt 5 5.104 ± 0.388 ops/us > GroupVIntBenchmark.nioReadGroupVInt 64 thrpt 5 9.527 ± 1.556 ops/us > GroupVIntBenchmark.nioReadGroupVIntBaseline 64 thrpt 5 5.351 ± 0.621 ops/us > ``` > > @uschindler Thank you for take a look at this, here is the assembly output for `MemorySegmentIndexInput#readGroupVInt`, using command: `java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly --module-path lucene/benchmark-jmh/build/benchmarks --module org.apache.lucene.benchmark.jmh GroupVIntBenchmark.mmap` > > assembly code you need to install hsdis to be able to see instructions and not just binary data from printassembly. see https://github.com/apache/lucene/blob/main/help/jmh.txt -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix the declared Exceptions of Expression#evaluate to match those of DoubleValues#doubleValue [lucene]
uschindler opened a new pull request, #12878: URL: https://github.com/apache/lucene/pull/12878 This PR fixes the issue found while coding the benchmark of Expressions module in #12873: The expressions module looks up the variables using the `DoubleValues#doubleValue` method, which throws `IOException`. Because the JVM does not do any exceptions checks, only the type system of javac compiler, some code calling `Expression#evaluate` could suddenly be confronted with a checked `IOException`, although this is not declared. It is not even possible to catch the Exception because compiler complains that it is not declared. Actually this should also be fixed in 9.x but this changes the method signature of the `Expression#evaluate` method and the impact is not too high. So it is for Lucene 10 only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce dynamic segment efSearch to Knn{Byte|Float}VectorQuery [lucene]
jimczi commented on PR #12551: URL: https://github.com/apache/lucene/pull/12551#issuecomment-1841063019 Superseded by https://github.com/apache/lucene/pull/12794 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce dynamic segment efSearch to Knn{Byte|Float}VectorQuery [lucene]
jimczi closed pull request #12551: Introduce dynamic segment efSearch to Knn{Byte|Float}VectorQuery URL: https://github.com/apache/lucene/pull/12551 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1415869383 ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -120,22 +125,44 @@ public class FSTCompiler { final float directAddressingMaxOversizingFactor; long directAddressingExpansionCredit; - final BytesStore bytes; + // the DataOutput to stream the FST bytes to + final DataOutput dataOutput; + + // buffer to store bytes for the one node we are currently writing + final GrowableByteArrayDataOutput scratchBytes = new GrowableByteArrayDataOutput(); + + private long numBytesWritten; + + /** + * Get an on-heap DataOutput that allows the FST to be read immediately after writing. Review Comment: I added a new Test2BFSTOffHeap and running it: ``` 10: 424 RAM bytes used; 39257811 FST bytes; 19189176 nodes; took 23 seconds 20: 424 RAM bytes used; 78522623 FST bytes; 38378071 nodes; took 49 seconds 30: 424 RAM bytes used; 117788163 FST bytes; 57567190 nodes; took 80 seconds 40: 424 RAM bytes used; 157053095 FST bytes; 76756389 nodes; took 107 seconds 50: 424 RAM bytes used; 196318494 FST bytes; 95945639 nodes; took 133 seconds 60: 424 RAM bytes used; 235583412 FST bytes; 115134691 nodes; took 170 seconds 70: 480 RAM bytes used; 274866378 FST bytes; 134324199 nodes; took 198 seconds 80: 480 RAM bytes used; 314246540 FST bytes; 153513668 nodes; took 222 seconds 90: 480 RAM bytes used; 353626848 FST bytes; 172703151 nodes; took 245 seconds 100: 480 RAM bytes used; 393006717 FST bytes; 191892620 nodes; took 277 seconds 110: 480 RAM bytes used; 432387052 FST bytes; 211082115 nodes; took 311 seconds 120: 480 RAM bytes used; 471766692 FST bytes; 230271461 nodes; took 334 seconds 130: 480 RAM bytes used; 511147081 FST bytes; 249461034 nodes; took 357 seconds ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1415871453 ## lucene/test-framework/src/java/org/apache/lucene/tests/util/fst/FSTTester.java: ## @@ -316,6 +313,15 @@ public FST doTest() throws IOException { return fst; } + protected FST compile(FSTCompiler fstCompiler) throws IOException { +return fstCompiler.compile(); + } + + protected FSTCompiler.Builder getFSTBuilder() { +return new FSTCompiler.Builder<>( Review Comment: I added this test along with the 2B nodes off-heap -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1415876855 ## lucene/core/src/test/org/apache/lucene/util/fst/Test2BFSTOffHeap.java: ## @@ -0,0 +1,341 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.fst; + +import com.carrotsearch.randomizedtesting.annotations.TimeoutSuite; +import java.util.Arrays; +import java.util.Random; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.MMapDirectory; +import org.apache.lucene.tests.util.LuceneTestCase; +import org.apache.lucene.tests.util.LuceneTestCase.SuppressSysoutChecks; +import org.apache.lucene.tests.util.TimeUnits; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.IntsRef; + +// Similar to Test2BFST but will build and read the FST off-heap and can be run with small heap + +// Run something like this: +//./gradlew test --tests Test2BFSTOffHeap -Dtests.verbose=true --max-workers=1 + +// @Ignore("Requires tons of heap to run (30 GB hits OOME but 35 GB passes after ~4.5 hours)") Review Comment: Ah I should have turned on this @Ignore on the PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
easyice commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1841102583 Thank you @rmuir, the new assembly output is here:) CC @uschindler https://github.com/easyice/lucene_files/blob/main/MemorySegmentIndexInput.readGroupVInt.asm.txt -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1841347419 Hi, Did you create this one with the jmh option `-perfasm`? This looks like unoptimized code generated only by C1, not by the C2 compiler. It would be good to have 2 asm dumps: one with your patch and one where the benchmark just calls the inherited default impl on Mmapdir. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix the declared Exceptions of Expression#evaluate to match those of DoubleValues#doubleValue [lucene]
uschindler commented on PR #12878: URL: https://github.com/apache/lucene/pull/12878#issuecomment-1841418704 > LGTM. I think an explicit IOException is better than wrapping in UncheckedIOException. The UncheckedIOException was just a workaround when I implemented the benchmark code. Wrapping inside ASM bytecode generation would be a desaster, so fix the bug correct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix the declared Exceptions of Expression#evaluate to match those of DoubleValues#doubleValue [lucene]
uschindler commented on code in PR #12878: URL: https://github.com/apache/lucene/pull/12878#discussion_r1416143382 ## lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/ExpressionsBenchmark.java: ## @@ -83,12 +82,8 @@ private static double ident(double v) { private static final Expression NATIVE_IDENTITY_EXPRESSION = new Expression(NATIVE_IDENTITY_NAME, new String[] {"x"}) { @Override -public double evaluate(DoubleValues[] functionValues) { - try { -return functionValues[0].doubleValue(); - } catch (IOException e) { -throw new UncheckedIOException(e); Review Comment: this was just the workaround I applied that made me aware of this bug in the expressions module. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dweiss commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1416181846 ## lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java: ## @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.fst; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import org.apache.lucene.store.ByteBuffersDataInput; +import org.apache.lucene.store.ByteBuffersDataOutput; +import org.apache.lucene.store.DataOutput; + +/** + * An adapter class to use {@link ByteBuffersDataOutput} as a {@link FSTReader}. It allows the FST + * to be readable immediately after writing + */ +final class ReadWriteDataOutput extends DataOutput implements FSTReader, Freezable { + + private final ByteBuffersDataOutput dataOutput; + // the DataInput to read from in case the DataOutput has multiple blocks + private ByteBuffersDataInput dataInput; + // the ByteBuffers to read from in case the DataOutput has a single block + private ByteBuffer byteBuffer; + + public ReadWriteDataOutput(ByteBuffersDataOutput dataOutput) { +this.dataOutput = dataOutput; + } + + @Override + public void writeByte(byte b) { +dataOutput.writeByte(b); + } + + @Override + public void writeBytes(byte[] b, int offset, int length) { +dataOutput.writeBytes(b, offset, length); + } + + @Override + public long ramBytesUsed() { +return dataOutput.ramBytesUsed(); + } + + @Override + public void freeze() { +// these operations are costly, so we want to compute it once and cache +List byteBuffers = dataOutput.toWriteableBufferList(); +if (byteBuffers.size() == 1) { Review Comment: I don't think it has anything to do with method sizes. I think it's got more to do with ByteBuffersDataInput wrapping input buffers with asReadOnlyBuffer: ``` public ByteBuffersDataInput(List buffers) { ensureAssumptions(buffers); this.blocks = buffers.toArray(ByteBuffer[]::new); for (int i = 0; i < blocks.length; ++i) { blocks[i] = blocks[i].asReadOnlyBuffer().order(ByteOrder.LITTLE_ENDIAN); } ``` It also eagerly preallocates arrays for float and long buffers - something that was added later (I wasn't aware), which may also contribute to performance if you call this method millions of times. But this is only my gut feeling, it'd have to be verified with a profile run. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Rewrite JavaScriptCompiler to use modern JVM features (Java 17) [lucene]
uschindler commented on PR #12873: URL: https://github.com/apache/lucene/pull/12873#issuecomment-1841494970 > While writing the benchmark I figured out that the Expression base class has some problems with a missing `IOException` on the `evaluate(DoubleValues[])` method. As bytecode knows no exceptions in throws clauses, the evaluation method may throw `IOException` unexpected because `DoubleValues#doubleValue()` is defined to throw `IOException`. This is a bug and should be fixed separately. I will open an issue to fix this in main. This is the follow-up PR: #12878 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [10.0] Removing TermInSetQuery varargs ctor [lucene]
gsmiller commented on PR #12837: URL: https://github.com/apache/lucene/pull/12837#issuecomment-1841521890 I think what you said sounds right @slow-J. Having another look at this and the associated bp PR today. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [9.10] Mark TermInSetQuery ctors with varargs terms as deprecated [lucene]
gsmiller commented on code in PR #12864: URL: https://github.com/apache/lucene/pull/12864#discussion_r1416215309 ## lucene/CHANGES.txt: ## @@ -7,7 +7,8 @@ http://s.apache.org/luceneversions API Changes - -(No changes) +* GITHUB#12243: Mark TermInSetQuery ctors with varargs terms as @Deprecated. SortedSetDocValuesField#newSlowSetQuery, +SortedDocValuesField#newSlowSetQuery, KeywordField#newSetQuery now take a collection of terms as a param. (Jakub Slowinski) Review Comment: minor: let's left-pad with some whitespace like we do with other changes entries ## lucene/core/src/java/org/apache/lucene/document/KeywordField.java: ## @@ -170,7 +171,9 @@ public static Query newExactQuery(String field, String value) { * @param values the set of values to match * @throws NullPointerException if {@code field} is null. * @return a query matching documents with this exact value + * @deprecated Use the method with a collection of values instead. Review Comment: minor: let's use the `{@link ...}` javadoc to reference the methods users should use (in all places where we're using this annotation)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [10.0] Removing TermInSetQuery varargs ctor [lucene]
gsmiller commented on code in PR #12837: URL: https://github.com/apache/lucene/pull/12837#discussion_r1416221972 ## lucene/CHANGES.txt: ## @@ -67,6 +67,8 @@ API Changes * GITHUB#11023: Adding -level param to CheckIndex, making the old -fast param the default behaviour. (Jakub Slowinski) +* GITHUB#12243: Remove TermInSetQuery ctors taking varargs param. KeywordField#newSetQuery now takes a collection. (Jakub Slowinski) Review Comment: I wonder if we should "forward-port" the CHANGES entry from your associated back-port PR as well? Otherwise the CHANGES message under 9.10 noting the deprecation will get lost with 10.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [9.10] Mark TermInSetQuery ctors with varargs terms as deprecated [lucene]
slow-J commented on PR #12864: URL: https://github.com/apache/lucene/pull/12864#issuecomment-1841640724 > Mark TermInSetQuery ctors with varargs terms as @Deprecated. Ah I initially thought of doing this, but didn't know which was the right way, I'll add now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [9.10] Mark TermInSetQuery ctors with varargs terms as deprecated [lucene]
slow-J commented on code in PR #12864: URL: https://github.com/apache/lucene/pull/12864#discussion_r1416299738 ## lucene/core/src/java/org/apache/lucene/document/KeywordField.java: ## @@ -170,7 +171,9 @@ public static Query newExactQuery(String field, String value) { * @param values the set of values to match * @throws NullPointerException if {@code field} is null. * @return a query matching documents with this exact value + * @deprecated Use the method with a collection of values instead. Review Comment: Will do! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [9.10] Mark TermInSetQuery ctors with varargs terms as deprecated [lucene]
slow-J commented on code in PR #12864: URL: https://github.com/apache/lucene/pull/12864#discussion_r1416300070 ## lucene/CHANGES.txt: ## @@ -7,7 +7,8 @@ http://s.apache.org/luceneversions API Changes - -(No changes) +* GITHUB#12243: Mark TermInSetQuery ctors with varargs terms as @Deprecated. SortedSetDocValuesField#newSlowSetQuery, +SortedDocValuesField#newSlowSetQuery, KeywordField#newSetQuery now take a collection of terms as a param. (Jakub Slowinski) Review Comment: Good catch :D You mean on the second line, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [9.10] Mark TermInSetQuery ctors with varargs terms as deprecated [lucene]
slow-J commented on PR #12864: URL: https://github.com/apache/lucene/pull/12864#issuecomment-1841663134 Thanks for the review @gsmiller ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [10.0] Removing TermInSetQuery varargs ctor [lucene]
slow-J commented on code in PR #12837: URL: https://github.com/apache/lucene/pull/12837#discussion_r1416313463 ## lucene/CHANGES.txt: ## @@ -67,6 +67,8 @@ API Changes * GITHUB#11023: Adding -level param to CheckIndex, making the old -fast param the default behaviour. (Jakub Slowinski) +* GITHUB#12243: Remove TermInSetQuery ctors taking varargs param. KeywordField#newSetQuery now takes a collection. (Jakub Slowinski) Review Comment: Good point! Can add now while resolving conflicts :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [10.0] Removing TermInSetQuery varargs ctor [lucene]
slow-J commented on PR #12837: URL: https://github.com/apache/lucene/pull/12837#issuecomment-1841683335 > Looks good! One minor question about CHANGES. Thanks! Fixed! Thanks for the review and for all the help with backporting! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
jpountz commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1841712905 @uschindler FYI this is what I'm getting: https://gist.github.com/jpountz/be81b1eb93c6118aac65c3679911f1d8. There are two files: baseline.txt for the default impl, and contender.txt for the optimized impl. > In general I am not sure if we really need specialization in Mmapdir. > We do not even implement readVInt for memory segments. Why should we implement it for this complex case? To be clear, this specialization is not about doing the same kind of optimization that the JVM would do by inlining, it's about taking advantage of the fact that many of our Directory implementations can do unaligned random reads under the hood to decode group-varints more efficiently, fewer conditionals specifically. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [9.10] Mark TermInSetQuery ctors with varargs terms as deprecated [lucene]
gsmiller merged PR #12864: URL: https://github.com/apache/lucene/pull/12864 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [10.0] Removing TermInSetQuery varargs ctor [lucene]
gsmiller merged PR #12837: URL: https://github.com/apache/lucene/pull/12837 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add a new static method for KeywordField#newSetQuery to support collections parameter [lucene]
gsmiller closed issue #12243: Add a new static method for KeywordField#newSetQuery to support collections parameter URL: https://github.com/apache/lucene/issues/12243 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1416379828 ## lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java: ## @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.fst; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import org.apache.lucene.store.ByteBuffersDataInput; +import org.apache.lucene.store.ByteBuffersDataOutput; +import org.apache.lucene.store.DataOutput; + +/** + * An adapter class to use {@link ByteBuffersDataOutput} as a {@link FSTReader}. It allows the FST + * to be readable immediately after writing + */ +final class ReadWriteDataOutput extends DataOutput implements FSTReader, Freezable { + + private final ByteBuffersDataOutput dataOutput; + // the DataInput to read from in case the DataOutput has multiple blocks + private ByteBuffersDataInput dataInput; + // the ByteBuffers to read from in case the DataOutput has a single block + private ByteBuffer byteBuffer; + + public ReadWriteDataOutput(ByteBuffersDataOutput dataOutput) { +this.dataOutput = dataOutput; + } + + @Override + public void writeByte(byte b) { +dataOutput.writeByte(b); + } + + @Override + public void writeBytes(byte[] b, int offset, int length) { +dataOutput.writeBytes(b, offset, length); + } + + @Override + public long ramBytesUsed() { +return dataOutput.ramBytesUsed(); + } + + @Override + public void freeze() { +// these operations are costly, so we want to compute it once and cache +List byteBuffers = dataOutput.toWriteableBufferList(); +if (byteBuffers.size() == 1) { Review Comment: Sorry for the missing context, I was referring to a regression I found at https://github.com/apache/lucene/issues/10520 instead. Maybe it was different for ByteBuffersDataInput, or there would be no regression if the block size is just 1. I'll think we need to run some test to verify. Let me know if there's a process for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1416403950 ## lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java: ## @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.fst; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import org.apache.lucene.store.ByteBuffersDataInput; +import org.apache.lucene.store.ByteBuffersDataOutput; +import org.apache.lucene.store.DataOutput; + +/** + * An adapter class to use {@link ByteBuffersDataOutput} as a {@link FSTReader}. It allows the FST + * to be readable immediately after writing + */ +final class ReadWriteDataOutput extends DataOutput implements FSTReader, Freezable { + + private final ByteBuffersDataOutput dataOutput; + // the DataInput to read from in case the DataOutput has multiple blocks + private ByteBuffersDataInput dataInput; + // the ByteBuffers to read from in case the DataOutput has a single block + private ByteBuffer byteBuffer; + + public ReadWriteDataOutput(ByteBuffersDataOutput dataOutput) { +this.dataOutput = dataOutput; + } + + @Override + public void writeByte(byte b) { +dataOutput.writeByte(b); + } + + @Override + public void writeBytes(byte[] b, int offset, int length) { +dataOutput.writeBytes(b, offset, length); + } + + @Override + public long ramBytesUsed() { +return dataOutput.ramBytesUsed(); + } + + @Override + public void freeze() { +// these operations are costly, so we want to compute it once and cache +List byteBuffers = dataOutput.toWriteableBufferList(); +if (byteBuffers.size() == 1) { Review Comment: Even for ByteBuffersDataInput in multi-block cases, there's a slight regression over the BytesStore implementation (note that we already called `freeze()`). The diff is https://github.com/dungba88/lucene/pull/13/files. ### ReverseRandomBytesReader with ByteBuffersDataInput ``` 1> TEST: now verify [fst size=4621076364; nodeCount=2252341486; arcCount=2264078585] 1> 0...: took 0 seconds 1> 100...: took 27 seconds 1> 200...: took 54 seconds 1> 300...: took 82 seconds 1> 400...: took 109 seconds 1> 500...: took 137 seconds 1> 600...: took 165 seconds 1> 700...: took 192 seconds 1> 800...: took 219 seconds 1> 900...: took 247 seconds 1> 1000...: took 275 seconds 1> 1100...: took 300 seconds ``` ### BytesStore-like BytesReader ``` 1> TEST: now verify [fst size=4621076364; nodeCount=2252341486; arcCount=2264078585] 1> 0...: took 0 seconds 1> 100...: took 22 seconds 1> 200...: took 44 seconds 1> 300...: took 66 seconds 1> 400...: took 89 seconds 1> 500...: took 111 seconds 1> 600...: took 133 seconds 1> 700...: took 155 seconds 1> 800...: took 178 seconds 1> 900...: took 200 seconds 1> 1000...: took 222 seconds 1> 1100...: took 245 seconds ``` As this PR is already rather long, I can make the first implementation simple by solely using ByteBuffersDataInput, then open an issue to benchmark different alternatives thoroughly. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
stefanvodita commented on PR #12844: URL: https://github.com/apache/lucene/pull/12844#issuecomment-1841841701 @zhaih - I also tried your [idea](https://github.com/apache/lucene/pull/12844#discussion_r1412533096) about `beamWidth`. I passed `beamWidth` in to `NeighborArray` and then changed the initial capacity of the arrays there, like so: ``` - int arraySize = Math.min(INITIAL_CAPACITY, maxSize); + int arraySize = Math.min(maxSize, beamWidth); ``` In the profiler, it doesn't look very different from baseline. I didn't see any resizing. If I understood you correctly, you were expecting there to be occasional resizing of arrays when `beamWidth < maxSize`. The latency isn't measurably better either. Results are averaged from 5 runs (baseline is [here](https://github.com/apache/lucene/pull/12844#issuecomment-1836064137)). ``` recall latency nDocfanout maxConn beamWidth visited index ms 0.720 0.16 1 0 64 250 100 25311.00 post-filter 0.542 0.23 10 0 64 250 100 43960 1.00 post-filter 0.512 0.28 20 0 64 250 100 111647 1.00 post-filter ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Allow FST builder to use different writer (alternative reverse BytesReader) [lucene]
dungba88 opened a new pull request, #12879: URL: https://github.com/apache/lucene/pull/12879 ### Description This is the same with #12624, except for a slight change in implementation of `ReadWriteDataOutput`. This PR keeps the original implementation that BytesStore did, to make sure there is no regression. My theory is that ReverseRandomAccessReader need to seek the correct buffer on every `readByte`, and reading from the ByteBuffer has some (small) overhead comparing to accessing from the byte array directly. But these should have insignificant impact on the performance. More extensive benchmark is needed. Using Test2BFST, there is some improvement over the `ByteBuffersDataInput`: ### ReverseRandomAccessReader with ByteBuffersDataInput ``` 1> TEST: now verify [fst size=4621076364; nodeCount=2252341486; arcCount=2264078585] 1> 0...: took 0 seconds 1> 100...: took 27 seconds 1> 200...: took 54 seconds 1> 300...: took 82 seconds 1> 400...: took 109 seconds 1> 500...: took 137 seconds 1> 600...: took 165 seconds 1> 700...: took 192 seconds 1> 800...: took 219 seconds 1> 900...: took 247 seconds 1> 1000...: took 275 seconds 1> 1100...: took 300 seconds ``` ### This PR ``` 1> TEST: now verify [fst size=4621076364; nodeCount=2252341486; arcCount=2264078585] 1> 0...: took 0 seconds 1> 100...: took 22 seconds 1> 200...: took 44 seconds 1> 300...: took 66 seconds 1> 400...: took 89 seconds 1> 500...: took 111 seconds 1> 600...: took 133 seconds 1> 700...: took 155 seconds 1> 800...: took 178 seconds 1> 900...: took 200 seconds 1> 1000...: took 222 seconds 1> 1100...: took 245 seconds ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (#12543) [lucene]
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1416403950 ## lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java: ## @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.fst; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import org.apache.lucene.store.ByteBuffersDataInput; +import org.apache.lucene.store.ByteBuffersDataOutput; +import org.apache.lucene.store.DataOutput; + +/** + * An adapter class to use {@link ByteBuffersDataOutput} as a {@link FSTReader}. It allows the FST + * to be readable immediately after writing + */ +final class ReadWriteDataOutput extends DataOutput implements FSTReader, Freezable { + + private final ByteBuffersDataOutput dataOutput; + // the DataInput to read from in case the DataOutput has multiple blocks + private ByteBuffersDataInput dataInput; + // the ByteBuffers to read from in case the DataOutput has a single block + private ByteBuffer byteBuffer; + + public ReadWriteDataOutput(ByteBuffersDataOutput dataOutput) { +this.dataOutput = dataOutput; + } + + @Override + public void writeByte(byte b) { +dataOutput.writeByte(b); + } + + @Override + public void writeBytes(byte[] b, int offset, int length) { +dataOutput.writeBytes(b, offset, length); + } + + @Override + public long ramBytesUsed() { +return dataOutput.ramBytesUsed(); + } + + @Override + public void freeze() { +// these operations are costly, so we want to compute it once and cache +List byteBuffers = dataOutput.toWriteableBufferList(); +if (byteBuffers.size() == 1) { Review Comment: Even for ByteBuffersDataInput in multi-block cases, there's a slight regression over the BytesStore implementation (note that we already called `freeze()`). The diff is https://github.com/dungba88/lucene/pull/13/files. ### ReverseRandomBytesReader with ByteBuffersDataInput ``` 1> TEST: now verify [fst size=4621076364; nodeCount=2252341486; arcCount=2264078585] 1> 0...: took 0 seconds 1> 100...: took 27 seconds 1> 200...: took 54 seconds 1> 300...: took 82 seconds 1> 400...: took 109 seconds 1> 500...: took 137 seconds 1> 600...: took 165 seconds 1> 700...: took 192 seconds 1> 800...: took 219 seconds 1> 900...: took 247 seconds 1> 1000...: took 275 seconds 1> 1100...: took 300 seconds ``` ### BytesStore-like BytesReader ``` 1> TEST: now verify [fst size=4621076364; nodeCount=2252341486; arcCount=2264078585] 1> 0...: took 0 seconds 1> 100...: took 22 seconds 1> 200...: took 44 seconds 1> 300...: took 66 seconds 1> 400...: took 89 seconds 1> 500...: took 111 seconds 1> 600...: took 133 seconds 1> 700...: took 155 seconds 1> 800...: took 178 seconds 1> 900...: took 200 seconds 1> 1000...: took 222 seconds 1> 1100...: took 245 seconds ``` As this PR is already rather long, I can make the first implementation simple by solely using ByteBuffersDataInput, then open an issue to benchmark different alternatives thoroughly. WDYT? For this PR, I'll use the more simplified version by solely using ByteBuffersDataInput. I opened another one for the alternative way: https://github.com/apache/lucene/pull/12879. I did see improvement as shown in the log above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (alternative reverse BytesReader) [lucene]
dungba88 commented on PR #12879: URL: https://github.com/apache/lucene/pull/12879#issuecomment-1841967956 I think we'll likely go with the approach in #12624 (simpler despite some potential regression). But this approach is to have some comparison, and in case someone has strong argument for favoring the latency. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org