[I] java.lang.AssertionError in backward compat tests ("failed to parse ... as date") [lucene]
dweiss opened a new issue, #13073: URL: https://github.com/apache/lucene/issues/13073 ### Description These failures do reproduce but you need the linedocsfile. I'll take a look. ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] java.lang.AssertionError in backward compat tests ("failed to parse ... as date") [lucene]
dweiss commented on issue #13073: URL: https://github.com/apache/lucene/issues/13073#issuecomment-1926587718 The embedded "linedocsfile" (europarl) has a different date field format compared to the "large" enwiki used on jenkins. ``` (1) 2004-03-30 Istituzioni europee proteggerlo in tutti i campi. [...] ``` ``` APL (codepage) 11-JAN-2010 16:54:37.000 languages. From a user's standpoint ``` After #13046, TestIndexSortBackwardsCompatibility tries to parse the date and fails because it only recognizes "-MM-dd". I wonder what the best fix here should be - make the test recognize both date format or correct the test-resource-creation script and make dates there more consistent? I think correcting the test will be simpler... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]
vsop-479 commented on PR #11888: URL: https://github.com/apache/lucene/pull/11888#issuecomment-1926589435 @jpountz Can we push on this change by checking whether our test case has covered all the status, that `TermsEnum.seekExact` or `TermsEnum.seekCeil` may emit? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make date parsing more flexible for linedocsfile (europarl, enwiki) [lucene]
dweiss commented on PR #13075: URL: https://github.com/apache/lucene/pull/13075#issuecomment-1926668465 This makes date parsing accept both europarl and enwiki. The tests still assume the docs come from europarl though and fail on the large enwiki.random.lines.txt: ``` gradlew -p lucene\backward-codecs test -Dtests.linedocsfile=/../enwiki.random.lines.txt ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] java.lang.AssertionError in backward compat tests ("failed to parse ... as date") [lucene]
dweiss commented on issue #13073: URL: https://github.com/apache/lucene/issues/13073#issuecomment-1926670843 @s1monw - some of those assertions added in #13046 only hold for the built-in europarl. I fixed date parsing but I'm not sure how to deal with the problem that the line resource can be different. Thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] java.lang.AssertionError in backward compat tests ("failed to parse ... as date") [lucene]
dweiss commented on issue #13073: URL: https://github.com/apache/lucene/issues/13073#issuecomment-1926679038 There was actually just a single assertion that caused the problems. I've just remove it since it seems to be duplicated anyway with a term (body:the) that exists in both data sets. There is a marginal possibility of these tests failing in the future if the test resources change. Don't know if it's worth considering. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make date parsing more flexible for linedocsfile (europarl, enwiki) [lucene]
uschindler commented on code in PR #13075: URL: https://github.com/apache/lucene/pull/13075#discussion_r1478028235 ## lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestIndexSortBackwardsCompatibility.java: ## @@ -147,6 +150,36 @@ public void testSortedIndex() throws Exception { } } + /** + * Converts date formats for europarl ("2023-02-23") and enwiki ("12-JAN-2010 12:32:45.000") into + * {@link LocalDateTime}. + */ + static Function DATE_STRING_TO_LOCALDATETIME = + new Function<>() { +final DateTimeFormatter euroParl = +new DateTimeFormatterBuilder() +.parseStrict() +.parseCaseInsensitive() +.appendPattern("-MM-dd") +.toFormatter(Locale.ROOT); + +final DateTimeFormatter enwiki = +new DateTimeFormatterBuilder() +.parseStrict() +.parseCaseInsensitive() +.appendPattern("dd-MMM- HH:mm:ss['.'SSS]") +.toFormatter(Locale.ROOT); + +@Override +public LocalDateTime apply(String s) { + if (s.matches("^[0-9]{4}-[0-9]{2}-[0-9]{2}$")) { +return euroParl.parse(s, LocalDate::from).atTime(LocalTime.MIDNIGHT); Review Comment: `.atStartOfDay()` would be shorter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make date parsing more flexible for linedocsfile (europarl, enwiki) [lucene]
uschindler commented on PR #13075: URL: https://github.com/apache/lucene/pull/13075#issuecomment-1926758404 P.S.: Maybe add a quick test for the parsing logic to check that both formats are accepted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make date parsing more flexible for linedocsfile (europarl, enwiki) [lucene]
dweiss commented on PR #13075: URL: https://github.com/apache/lucene/pull/13075#issuecomment-1926938848 > Looks fine. Why did you create a `Function` for parsing instead of a simple static method? I think you did this to hide the formatter instances? Yes, correct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]
pmpailis opened a new pull request, #13076: URL: https://github.com/apache/lucene/pull/13076 This PR adds support for binary Hamming distance as a similarity metric for byte vectors. The drive behind this is that there is an increasing interest in applying hashing techniques for embeddings (both in text & image based search applications e.g. [Binary Passage Retriever](https://arxiv.org/abs/2106.00882)) due to their much reduced size & performance gains that one might get. A natural way to compare binary vectors is to use Hamming distance by computing the population count of the XOR result of two embeddings, i.e. count how many different bits exist in the two binary vectors. In Lucene, we can leverage the existing `byte[]` support, and use that to store the binary vectors. The size of the byte vectors would be `d / 8` or `(d / 8) + 1` if `d % 8 > 0`, where d is the dimension of a binary vector. So for example, a binary vector of 64 bits, could use a `KnnByteVectorField` with `vectorDimension=8`. However, this transformation is currently outside of the scope of this PR. To compute the Hamming distance, since we have a bounded pool of values ranging from `Byte.MIN_VALUE` to `Byte.MAX_VALUE`, this PR makes use of a look up table to retrieve the appropriate population count. Similarly, for the Panama implementation, we rely on the approach discussed [here](https://arxiv.org/abs/1611.07612) to compute the population count of low & high bits of the vectors' XOR result using a look-up table as well. To convert the computed distance to a similarity score we finaly normalize through `1 / (1 + hamming_distance)` Benchmarks for the scalar & vectorized implementations running on my M2 Pro (Neon) dev machine: ``` Benchmark(size) Mode CntScore Error Units VectorUtilBenchmark.binaryHammingDistanceScalar 1 thrpt 15 514.613 ± 6.496 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 8 thrpt 15 216.716 ± 1.682 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 16 thrpt 15 135.528 ± 1.606 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 32 thrpt 15 76.745 ± 0.654 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 50 thrpt 15 52.226 ± 0.444 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 64 thrpt 15 41.246 ± 0.139 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 100 thrpt 15 29.119 ± 0.086 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 128 thrpt 15 22.639 ± 0.138 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 207 thrpt 15 14.382 ± 0.058 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 256 thrpt 15 11.813 ± 0.060 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 300 thrpt 15 10.253 ± 0.257 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 512 thrpt 156.145 ± 0.015 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 702 thrpt 154.461 ± 0.133 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar1024 thrpt 153.091 ± 0.003 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 1 thrpt 15 499.861 ± 0.476 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 8 thrpt 15 191.430 ± 3.243 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 16 thrpt 15 298.697 ± 4.448 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 32 thrpt 15 222.700 ± 5.461 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 50 thrpt 15 129.853 ± 0.325 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 64 thrpt 15 156.657 ± 4.337 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 100 thrpt 15 83.879 ± 1.864 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 128 thrpt 15 88.028 ± 2.156 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 207 thrpt 15 43.573 ± 1.085 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 256 thrpt 15 49.415 ± 0.865 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 300 thrpt 15 34.535 ± 0.408 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 512 thrpt 15 25.953 ± 0.197 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 702 thrpt 15 17.483 ± 0.033 ops/us VectorUtilBenchmark.binaryHammingDistanceVector1024 thrpt 15 13.305 ± 0.085 ops/us ``` `BINARY_HAMMING_DISTANCE` is currently only supported for `byte[]` vectors so I had to slightly refactor some tests to distinguish between float and byte versions of knn-search. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mai
Re: [PR] clean up smoketester GPG leaks [lucene]
hurutoriya commented on PR #12882: URL: https://github.com/apache/lucene/pull/12882#issuecomment-1926941192 @janhoy Thank you for suggestion. I've never do the QA. > If not, we need to QA that the script is not broken by this change. OK, let me try the QA 🙏 . How should I do for QA? (I will try understand a concrete step) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make date parsing more flexible for linedocsfile (europarl, enwiki) [lucene]
dweiss commented on PR #13075: URL: https://github.com/apache/lucene/pull/13075#issuecomment-1926946487 > P.S.: Maybe add a quick test for the parsing logic to check that both formats are accepted. Maybe it'd be better to move this logic into LineFileDocs so that the date field's value is normalized somehow? Then we can add TestLineFileDocs and test it there. Adding a test method to TestIndexSortBackwardsCompatibility will multiply it over all the parameters, which doesn't make sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]
benwtrent commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1478196742 ## lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java: ## @@ -24,15 +24,8 @@ import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.index.VectorEncoding; import org.apache.lucene.index.VectorSimilarityFunction; -import org.apache.lucene.search.DocIdSetIterator; -import org.apache.lucene.search.HitQueue; -import org.apache.lucene.search.KnnByteVectorQuery; -import org.apache.lucene.search.KnnCollector; -import org.apache.lucene.search.Query; -import org.apache.lucene.search.ScoreDoc; -import org.apache.lucene.search.TopDocs; -import org.apache.lucene.search.TopDocsCollector; -import org.apache.lucene.search.TotalHits; +import org.apache.lucene.search.*; Review Comment: Don't think we want `*` here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make date parsing more flexible for linedocsfile (europarl, enwiki) [lucene]
dweiss commented on PR #13075: URL: https://github.com/apache/lucene/pull/13075#issuecomment-1927005366 I've moved that utility function to LineFileDocs and added a basic test case to TestLineFileDocs (good idea). I feel tempted to normalize the date field's value in LineFileDocs but this may break other stuff so better to leave it for a follow-up issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Enable parent field in sorted bwc tests [lucene]
jpountz commented on code in PR #13067: URL: https://github.com/apache/lucene/pull/13067#discussion_r1478307290 ## lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestIndexSortBackwardsCompatibility.java: ## @@ -82,14 +84,20 @@ public void testSortedIndexAddDocBlocks() throws Exception { .setOpenMode(IndexWriterConfig.OpenMode.APPEND) .setIndexSort(sort) .setMergePolicy(newLogMergePolicy()); +if (this.version.onOrAfter(FIRST_PARENT_DOC_VERSION)) { + indexWriterConfig.setParentField(PARENT_FIELD_NAME); +} // open writer try (IndexWriter writer = new IndexWriter(directory, indexWriterConfig)) { // add 10 docs for (int i = 0; i < 10; i++) { Document child = new Document(); child.add(new StringField("relation", "child", Field.Store.NO)); child.add(new StringField("bid", "" + i, Field.Store.NO)); -child.add(new NumericDocValuesField("dateDV", i)); +if (version.onOrAfter(FIRST_PARENT_DOC_VERSION) +== false) { // only add this to earlier versions + child.add(new NumericDocValuesField("dateDV", i)); +} Review Comment: Why do we need this change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]
benwtrent commented on code in PR #13076: URL: https://github.com/apache/lucene/pull/13076#discussion_r1478274441 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -214,4 +214,11 @@ public static float[] checkFinite(float[] v) { } return v; } + + public static float binaryHammingDistance(byte[] a, byte[] b) { +if (a.length != b.length) { + throw new IllegalArgumentException("vector dimensions differ: " + a.length + "!=" + b.length); +} +return 1f / (1 + IMPL.binaryHammingDistance(a, b)); Review Comment: This should return `IMPL.binaryHammingDistance(a, b)`. The user's of this function will have to transform the distance to a score separately. ## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ## @@ -116,4 +143,14 @@ public float compare(byte[] v1, byte[] v2) { * @return the value of the similarity function applied to the two vectors */ public abstract float compare(byte[] v1, byte[] v2); + + /** + * Defines which encodings are supported by the similarity function - used in tests to control + * randomization + * + * @return a list of all supported VectorEncodings for the given similarity + */ + public Set supportedVectorEncodings() { +return EnumSet.of(VectorEncoding.BYTE, VectorEncoding.FLOAT32); + } Review Comment: Do we do anything with the set except check membership? I think a simple `bool supportsVectorEncoding(VectorEncoding)` function would be preferred. ## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ## @@ -94,6 +98,29 @@ public float compare(float[] v1, float[] v2) { public float compare(byte[] v1, byte[] v2) { return scaleMaxInnerProductScore(dotProduct(v1, v2)); } + }, + /** + * Binary Hamming distance; Computes how many bits are different in two bytes. + * + * Only supported for bytes. To convert the distance to a similarity score we normalize using 1 + * / (1 + hammingDistance) + */ + BINARY_HAMMING_DISTANCE { +@Override +public float compare(float[] v1, float[] v2) { + throw new UnsupportedOperationException( + BINARY_HAMMING_DISTANCE.name() + " is only supported for byte vectors"); +} + +@Override +public float compare(byte[] v1, byte[] v2) { + return binaryHammingDistance(v1, v2); Review Comment: The score transformation for the distance should be here. Note `squareDistance` ## lucene/core/src/test/org/apache/lucene/index/KnnGraphTestCase.java: ## @@ -54,35 +55,65 @@ import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.IOUtils; -import org.apache.lucene.util.VectorUtil; import org.apache.lucene.util.hnsw.HnswGraph; -import org.apache.lucene.util.hnsw.HnswGraph.NodesIterator; import org.apache.lucene.util.hnsw.HnswGraphBuilder; import org.junit.After; import org.junit.Before; /** Tests indexing of a knn-graph */ -public class TestKnnGraph extends LuceneTestCase { Review Comment: This is a fairly large refactor adding test coverage, but I think it detracts from this PR. Could you change all this back and restrict the similarity function when using floats? There are so many abstract functions, etc. it seems like this entire test case should be rewritten from the ground up if we were to 100% test byte vs. float for it. ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizedVectorSimilarity.java: ## @@ -40,6 +40,8 @@ static ScalarQuantizedVectorSimilarity fromVectorSimilarity( case EUCLIDEAN -> new Euclidean(constMultiplier); case COSINE, DOT_PRODUCT -> new DotProduct(constMultiplier); case MAXIMUM_INNER_PRODUCT -> new MaximumInnerProduct(constMultiplier); + case BINARY_HAMMING_DISTANCE -> throw new IllegalArgumentException( + "Cannot use Hamming distance with scalar quantization"); }; Review Comment: It would be good if the error message had the literal `BINARY_HAMMING_DISTANCE` string to indicate the similarity enumeration we failed at. ## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ## @@ -94,6 +98,29 @@ public float compare(float[] v1, float[] v2) { public float compare(byte[] v1, byte[] v2) { return scaleMaxInnerProductScore(dotProduct(v1, v2)); } + }, + /** + * Binary Hamming distance; Computes how many bits are different in two bytes. + * + * Only supported for bytes. To convert the distance to a similarity score we normalize using 1 + * / (1 + hammingDistance) + */ + BINARY_HAMMING_DISTANCE { +@Override +public float compare(float[] v1, float[] v2) { + throw new UnsupportedOperationException( + BINARY_HAMMING_DISTANCE.name() + " is only supported for byte vectors"); +} + +@Override +public float compare(byte[] v1, byte[
Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]
benwtrent commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927115056 @pmpailis could you also push a `CHANGES.txt` update? It is would be under `New Features` for `Lucene 9.10.0` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Bump release to Java 21 [lucene]
mikemccand commented on PR #12753: URL: https://github.com/apache/lucene/pull/12753#issuecomment-1927152977 Is this ready to go? Thank you for all the hard work here @ChrisHegarty and @rmuir! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]
rmuir commented on code in PR #13076: URL: https://github.com/apache/lucene/pull/13076#discussion_r1478382295 ## lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -576,4 +578,114 @@ private int squareDistanceBody128(byte[] a, byte[] b, int limit) { // reduce return acc1.add(acc2).reduceLanes(ADD); } + + private static final ByteVector lookup128 = + ByteVector.fromArray( + ByteVector.SPECIES_128, + new byte[] { +0, 1, 1, 2, 1, 2, 2, 3, +1, 2, 2, 3, 2, 3, 3, 4, + }, + 0); + + private static final ByteVector lookup256 = + (ByteVector) lookup128.castShape(ByteVector.SPECIES_256, 0); + private static final ByteVector lookup512 = + (ByteVector) lookup128.castShape(ByteVector.SPECIES_512, 0); + + @Override + public int binaryHammingDistance(byte[] a, byte[] b) { +int res = 0; +int i = 0; +if (a.length >= 16) { + if (VECTOR_BITSIZE >= 512) { +i += ByteVector.SPECIES_512.loopBound(a.length); +res += binaryHammingDistanceBody512(a, b, i); + } else if (VECTOR_BITSIZE == 256) { +i += ByteVector.SPECIES_256.loopBound(a.length); +res += binaryHammingDistanceBody256(a, b, i); + } else { +i += ByteVector.SPECIES_128.loopBound(a.length); +res += binaryHammingDistanceBody128(a, b, i); + } +} + +// scalar tail +for (; i < a.length; i++) { + res += HAMMING_DISTANCE_LOOKUP_TABLE[(a[i] ^ b[i]) & 0xFF]; +} +return res; + } + + private int binaryHammingDistanceBody512(byte[] a, byte[] b, int limit) { +int res = 0; +for (int i = 0; i < limit; i += ByteVector.SPECIES_512.length()) { + ByteVector bva64 = ByteVector.fromArray(ByteVector.SPECIES_512, a, i); + ByteVector bvb64 = ByteVector.fromArray(ByteVector.SPECIES_512, b, i); + ByteVector xor64 = bva64.lanewise(XOR, bvb64); + + ByteVector low_mask = ByteVector.broadcast(ByteVector.SPECIES_512, 0x0f); + ByteVector low_bits = xor64.and(low_mask); + ByteVector high_bits = xor64.lanewise(LSHR, 4).and(low_mask); + + var popcnt1 = lookup512.rearrange(low_bits.toShuffle()); + var popcnt2 = lookup512.rearrange(high_bits.toShuffle()); + + var total = popcnt1.add(popcnt2); + + // Need to break up the total ByteVector as the result might not + // fit in a byte + var acc1 = total.castShape(ShortVector.SPECIES_512, 0); + var acc2 = total.castShape(ShortVector.SPECIES_512, 1); Review Comment: Vector castShape() with part number > 0 really needs to be avoided. It is incredibly slow. Have you benchmarked non-mac machines with 256 or 512-bit vectors? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Compute multiple float aggregations in one go [lucene]
mikemccand commented on code in PR #12547: URL: https://github.com/apache/lucene/pull/12547#discussion_r1478386039 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java: ## @@ -37,33 +37,43 @@ abstract class FloatTaxonomyFacets extends TaxonomyFacets { // TODO: also use native hash map for sparse collection, like IntTaxonomyFacets - /** Aggregation function used for combining values. */ - final AssociationAggregationFunction aggregationFunction; + /** Aggregation functions used for combining values. */ + final List aggregationFunctions; /** Per-ordinal value. */ - float[] values; + float[][] values; + + @Override + boolean hasValues() { +return values != null; + } + + void initializeValueCounters() { +if (values == null) { + values = new float[aggregationFunctions.size()][taxoReader.getSize()]; Review Comment: Maybe open a spinoff to implement sparse values for `FloatTaxonomyFacets`? But let's not block this otherwise great change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]
rmuir commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927213508 I'm confused about the use of lookup table. naively, i'd try to just xor + popcnt: https://docs.oracle.com/en/java/javase/21/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#XOR https://docs.oracle.com/en/java/javase/21/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#BIT_COUNT I'm curious if any explicit vector code is needed actually at all. Integer.bitCount() has autovectorization support in hotspot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]
rmuir commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927228480 even if it doesn't autovectorize, i suspect just gathering e.g. 4/8 bytes at a time with BitUtil varhandle and using single int/long xor + popcount would perform very well as a baseline. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Bump release to Java 21 [lucene]
rmuir commented on PR #12753: URL: https://github.com/apache/lucene/pull/12753#issuecomment-1927294356 @mikemccand I think it is heading in the right direction. There are a few more tasks to do here I think though, e.g. we still need to update `releaseWizard.py` and `smokeTestRelease.py`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make FSTCompiler.compile() to only return the FSTMetadata [lucene]
mikemccand commented on PR #12831: URL: https://github.com/apache/lucene/pull/12831#issuecomment-1927303409 This is technically an API break, but `FSTCompiler` is an experimental API and effectively an internal Lucene datastructure, so I think we can safely backport to 9.x without deprecation path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make FSTCompiler.compile() to only return the FSTMetadata [lucene]
mikemccand merged PR #12831: URL: https://github.com/apache/lucene/pull/12831 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927329824 Hi, I don't want to discuss about sense/nonsense of this disatance, but the implementation could been made very simple and then we may not even need to have a Panama Vector variant: - This is just `Long#bitCount(a ^ b)` - a, b should be longs to work effective, so the best would be to replace the loop by using https://lucene.apache.org/core/9_9_1/core/org/apache/lucene/util/BitUtil.html#VH_LE_LONG and iterate using the varhandle. We have another PR already that provides native byte order VarHandles (see #888). As the bitcount of longs does not depend on byte order we can use native byte order. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests) [lucene]
uschindler commented on PR #888: URL: https://github.com/apache/lucene/pull/888#issuecomment-1927340132 I forgot about this PR, we should really apply it. #13076 is another candidate that could make use of this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Forbidden Thread.sleep API [lucene]
shubhamvishu commented on PR #13001: URL: https://github.com/apache/lucene/pull/13001#issuecomment-1927343001 @uschindler I have removed the added file. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Forbidden Thread.sleep API [lucene]
uschindler merged PR #13001: URL: https://github.com/apache/lucene/pull/13001 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Can we ban `Thread.sleep`? [lucene]
uschindler closed issue #12946: Can we ban `Thread.sleep`? URL: https://github.com/apache/lucene/issues/12946 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Forbidden Thread.sleep API [lucene]
uschindler commented on PR #13001: URL: https://github.com/apache/lucene/pull/13001#issuecomment-1927410583 I moved the changes entry to Lucene 10.0, as to me it makes no sense to apply this to Lucene 9.x which is not used for active development. New code enters main first and I had trouble in merging this (and 9.x has many more sleeps). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10641: IndexSearcher#setTimeout should also abort query rewrites, point ranges and vector searches [lucene]
mikemccand commented on code in PR #12345: URL: https://github.com/apache/lucene/pull/12345#discussion_r1478503121 ## lucene/core/src/java/org/apache/lucene/index/ExitableIndexReader.java: ## @@ -0,0 +1,539 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Objects; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.automaton.CompiledAutomaton; + +/** + * The {@link ExitableIndexReader} is used to timeout I/O operation which is done during query + * rewrite. After this time is exceeded, the search thread is stopped by throwing a {@link + * ExitableIndexReader.TimeExceededException} + */ +public final class ExitableIndexReader extends IndexReader { + private final IndexReader indexReader; + private final QueryTimeout queryTimeout; + + /** + * Create a ExitableIndexReader wrapper over another {@link IndexReader} with a specified timeout. + * + * @param indexReader the wrapped {@link IndexReader} + * @param queryTimeout max time allowed for collecting hits after which {@link + * ExitableIndexReader.TimeExceededException} is thrown + */ + public ExitableIndexReader(IndexReader indexReader, QueryTimeout queryTimeout) + throws IOException { +this.indexReader = new ExitableIndexReaderWrapper((DirectoryReader) indexReader, queryTimeout); +this.queryTimeout = queryTimeout; + } + + /** Returns queryTimeout instance. */ + public QueryTimeout getQueryTimeout() { +return queryTimeout; + } + + /** Thrown when elapsed search time exceeds allowed search time. */ + @SuppressWarnings("serial") + static class TimeExceededException extends RuntimeException { +private TimeExceededException() { + super("TimeLimit Exceeded"); +} + +private TimeExceededException(Exception e) { + super(e); +} + } + + @Override + public TermVectors termVectors() throws IOException { +if (queryTimeout.shouldExit()) { + throw new ExitableIndexReader.TimeExceededException(); +} +return indexReader.termVectors(); + } + + @Override + public int numDocs() { +if (queryTimeout.shouldExit()) { + throw new ExitableIndexReader.TimeExceededException(); +} +return indexReader.numDocs(); + } + + @Override + public int maxDoc() { +if (queryTimeout.shouldExit()) { + throw new ExitableIndexReader.TimeExceededException(); +} +return indexReader.maxDoc(); + } + + @Override + public StoredFields storedFields() throws IOException { +if (queryTimeout.shouldExit()) { Review Comment: Can we factor out a `private` helper method that does this `if` and the `throw`? Just like `ExitableDirectoryReader.checkAndThrow`. ## lucene/core/src/java/org/apache/lucene/index/ExitableIndexReader.java: ## @@ -0,0 +1,539 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Objects; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.automaton.CompiledAuto
Re: [PR] LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests) [lucene]
uschindler merged PR #888: URL: https://github.com/apache/lucene/pull/888 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927511826 The native order PR was merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927626207 Hi, I modified the scalar variant like that: ```java @Override public int binaryHammingDistance(byte[] a, byte[] b) { int distance = 0, i = 0; for (; i < a.length + 1 - Long.BYTES; i += Long.BYTES) { distance += Long.bitCount(((long) BitUtil.VH_NATIVE_LONG.get(a, i) ^ (long) BitUtil.VH_NATIVE_LONG.get(b, i)) & 0xL); } // int tail for (; i < a.length + 1 - Integer.BYTES; i += Integer.BYTES) { distance += Integer.bitCount(((int) BitUtil.VH_NATIVE_INT.get(a, i) ^ (int) BitUtil.VH_NATIVE_INT.get(b, i)) & 0x); } // byte tail for (; i < a.length; i++) { distance += Integer.bitCount((a[i] ^ b[i]) & 0xFF); } return distance; } ``` This one only uses popcnt CPU instruction. I then ran your benchmakrk to compare the panama-vectorized one with my new implementation as above: ``` Benchmark(size) Mode CntScore Error Units VectorUtilBenchmark.binaryHammingDistanceScalar 1 thrpt 15 258,511 ± 17,969 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 128 thrpt 15 62,364 ± 0,723 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 207 thrpt 15 40,302 ± 0,703 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 256 thrpt 15 42,025 ± 0,891 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 300 thrpt 15 35,065 ± 3,125 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 512 thrpt 15 24,391 ± 1,987 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 702 thrpt 15 17,505 ± 0,684 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar1024 thrpt 15 13,806 ± 0,102 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 1 thrpt 15 231,651 ± 9,975 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 128 thrpt 15 16,760 ± 0,251 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 207 thrpt 15 10,317 ± 0,251 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 256 thrpt 158,887 ± 0,559 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 300 thrpt 157,466 ± 0,345 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 512 thrpt 154,706 ± 0,080 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 702 thrpt 153,062 ± 0,566 ops/us VectorUtilBenchmark.binaryHammingDistanceVector1024 thrpt 152,404 ± 0,024 ops/us Please not the SCALAR one here is the above variant. As this one four (!) times faster than the Panama variant, there is no need to have a panama-vectorized one and it looks like it is not working at all. I think the lookup table is a bad idea. To make it short: - Remove the impl and the table from VectorUtilSupport (scalar and vectorized) - Implement my above methoid directly in VectorUtil class This will be a short PR. Make sure to add more tests (there's only one for the VectorUtil method). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make date parsing more flexible for linedocsfile (europarl, enwiki) [lucene]
dweiss merged PR #13075: URL: https://github.com/apache/lucene/pull/13075 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make date parsing more flexible for linedocsfile (europarl, enwiki) [lucene]
dweiss commented on PR #13075: URL: https://github.com/apache/lucene/pull/13075#issuecomment-1927650998 I'll merge this in so that we can avoid jenkins failures. If there has to be a follow-up, I'll open another issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] java.lang.AssertionError in backward compat tests ("failed to parse ... as date") [lucene]
dweiss closed issue #13073: java.lang.AssertionError in backward compat tests ("failed to parse ... as date") URL: https://github.com/apache/lucene/issues/13073 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Throw CorruptSegmentInfoException on encountering missing segment info (_N.si) file in CheckIndex [lucene]
mikemccand commented on code in PR #12872: URL: https://github.com/apache/lucene/pull/12872#discussion_r1478678480 ## lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java: ## @@ -389,13 +386,25 @@ private static void parseSegmentInfos( } long totalDocs = 0; +SegmentInfo info; for (int seg = 0; seg < numSegments; seg++) { String segName = input.readString(); byte[] segmentID = new byte[StringHelper.ID_LENGTH]; input.readBytes(segmentID, 0, segmentID.length); Codec codec = readCodec(input); - SegmentInfo info = - codec.segmentInfoFormat().read(directory, segName, segmentID, IOContext.READ); + try { +info = codec.segmentInfoFormat().read(directory, segName, segmentID, IOContext.READ); + } catch (ThreadInterruptedException e) { +throw e; + } catch (Exception e) { Review Comment: Thanks @gokaai! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927678421 I am not sure if we really need the Integer tail. Mabye only implement the Long variant and the tail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Throw CorruptSegmentInfoException on encountering missing segment info (_N.si) file in CheckIndex [lucene]
mikemccand commented on PR #12872: URL: https://github.com/apache/lucene/pull/12872#issuecomment-1927677782 Thanks @gokaai -- I'll try to review soon! If possible please try not to force-push: it removes the history of the past commits and makes it harder to see what changed on this iteration :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]
rmuir commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927739974 Seems to autovectorize just fine, i took uwe's branch and dumped assembly on my AVX2 machine and see e.g. 256-bit xor and population count logic. I checked the logic in openjdk and it will use `vpopcntdq` on AVX-512 if available, etc. So this solution is much better than some explicit vector stuff because it will do the right thing depending on CPU. ``` ... 0.35%0x7fffe0141fa3: vmovdqu 0x10(%rax,%r8,1),%ymm9 0x7fffe0141faa: vpxor 0x10(%rdx,%r8,1),%ymm9,%ymm9 0.04%0x7fffe0141fb1: movabs $0xf0f0f0f,%r8 0.35%0x7fffe0141fbb: vmovq %r8,%xmm10 0x7fffe0141fc0: vpbroadcastd %xmm10,%ymm10 0x7fffe0141fc5: vpsrlw $0x4,%ymm9,%ymm11 0x7fffe0141fcb: vpand %ymm10,%ymm11,%ymm11 0.40%0x7fffe0141fd0: vpand %ymm10,%ymm9,%ymm10 0x7fffe0141fd5: vmovdqu -0x59829d(%rip),%ymm12 # Stub::popcount_lut ; {external_word} 0x7fffe0141fdd: vpshufb %ymm10,%ymm12,%ymm10 0.02%0x7fffe0141fe2: vpshufb %ymm11,%ymm12,%ymm11 0.48%0x7fffe0141fe7: vpaddb %ymm10,%ymm11,%ymm11 0x7fffe0141fec: vpxor %ymm12,%ymm12,%ymm12 0x7fffe0141ff1: vpsadbw %ymm12,%ymm11,%ymm10 0.07%0x7fffe0141ff6: vpermilps $0x8,%ymm10,%ymm9 0.35%0x7fffe0141ffc: vpermpd $0x8,%ymm9,%ymm9 0x7fffe0142002: vpaddd %xmm9,%xmm1,%xmm1 ... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori [lucene]
mikemccand commented on PR #740: URL: https://github.com/apache/lucene/pull/740#issuecomment-1927759464 @mocobeta just checking: it looks like this was never backported to 9.x (I hit unexpected merge conflicts while backporting an FST change) -- was that intentional? Were there API breaks that prevented backport maybe? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927790053 I removed the integer tail and have see no difference (especially looked also at the non-aligned sizes): ```java @Override public int binaryHammingDistance(byte[] a, byte[] b) { int distance = 0, i = 0; for (int upperBound = a.length & ~(Long.BYTES - 1); i < upperBound; i += Long.BYTES) { distance += Long.bitCount(((long) BitUtil.VH_NATIVE_LONG.get(a, i) ^ (long) BitUtil.VH_NATIVE_LONG.get(b, i)) & 0xL); } // tail: for (; i < a.length; i++) { distance += Integer.bitCount((a[i] ^ b[i]) & 0xFF); } return distance; } ``` I think this code is simpliest and most effective. It may not be the best for vector `(length % 8) == 5...7`, but I think we can live with that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927824571 Here's my branch: https://github.com/apache/lucene/compare/main...uschindler:lucene:binary_hamming_distance I can merge this into this branch, but the code cleanup and removal of useless vectorization and those (public) lookup tables needs to be done after the merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori [lucene]
uschindler commented on PR #740: URL: https://github.com/apache/lucene/pull/740#issuecomment-1927857893 Yes this was intentional. It breaks API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]
rmuir commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927946636 Thanks @uschindler , this is the way to go: compiler does a good job. java already has all the necessary logic here to autovectorize and use e.g. `vpopcntdq` or AVX2 lookup-table counting algorithm depending on the cpu features detected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927995019 I figured that the `& 0x` is useless. You only need it when widening into int. Will update my branch and paste code here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1928027541 This my final code: ```java @Override public int binaryHammingDistance(byte[] a, byte[] b) { int distance = 0, i = 0; for (final int upperBound = a.length & ~(Long.BYTES - 1); i < upperBound; i += Long.BYTES) { distance += Long.bitCount((long) BitUtil.VH_NATIVE_LONG.get(a, i) ^ (long) BitUtil.VH_NATIVE_LONG.get(b, i)); } // tail: for (; i < a.length; i++) { distance += Integer.bitCount((a[i] ^ b[i]) & 0xFF); } return distance; } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add getter for SynonymQuery#field [lucene]
AndreyBozhko opened a new pull request, #13077: URL: https://github.com/apache/lucene/pull/13077 ### Description Since all the query terms must have the same field, the value is exposed anyway via ```java synonymQuery.getTerms().get(0).field() ``` but it's cleaner if one could instead do ```java synonymQuery.getField() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Backport SOLR-14765 to branch_8_11 [lucene-solr]
risdenk commented on PR #2682: URL: https://github.com/apache/lucene-solr/pull/2682#issuecomment-1928651515 Sorry for delayed response @HoustonPutman no need to wait for me -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make FSTCompiler.compile() to only return the FSTMetadata [lucene]
dungba88 commented on PR #12831: URL: https://github.com/apache/lucene/pull/12831#issuecomment-1928655191 Thank you for merging @mikemccand ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add getter for SynonymQuery#field [lucene]
AndreyBozhko commented on PR #13077: URL: https://github.com/apache/lucene/pull/13077#issuecomment-1928773466 Thanks for the review @dungba88 - I added the javadoc as well (tried to match the style of other javadocs in the file). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] No static method version() crash in android project [lucene]
bjhexn opened a new issue, #13078: URL: https://github.com/apache/lucene/issues/13078 ### Description private fun createIndex(): IndexWriter { val path = Path(getDir("lunece", 0).path) val fsDirectory = FSDirectory.open(path) val analyzer = StandardAnalyzer() val config = IndexWriterConfig(analyzer) return IndexWriter(fsDirectory, config) } java.lang.NoSuchMethodError: No static method version()Ljava/lang/Runtime$Version; in class Ljava/lang/Runtime; or its super classes (declaration of 'java.lang.Runtime' appears in /apex/com.android.art/javalib/core-oj.jar) at org.apache.lucene.util.Constants.(Constants.java:40) at org.apache.lucene.store.FSDirectory.open(FSDirectory.java:161) at org.apache.lucene.store.FSDirectory.open(FSDirectory.java:156) at com.myapp.newest.MainActivity.createIndex(MainActivity.kt:46) ### Version and environment details version: implementation("org.apache.lucene:lucene-core:9.9.2") implementation("org.apache.lucene:lucene-queryparser:9.9.2") implementation("org.apache.lucene:lucene-analysis-common:9.9.2") environment: Android project -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] No static method version() crash in android project [lucene]
bjhexn commented on issue #13078: URL: https://github.com/apache/lucene/issues/13078#issuecomment-1928929465 Android gradle compileOptions { sourceCompatibility = JavaVersion.VERSION_17 targetCompatibility = JavaVersion.VERSION_17 } kotlinOptions { jvmTarget = "17" } -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] [Suggestion] Short circuit check for queued flushes in preUpdate() when checkPendingFlushOnUpdate is disabled [lucene]
CaptainDredge opened a new issue, #13079: URL: https://github.com/apache/lucene/issues/13079 ### Description `numQueuedFlushes()` is a blocking function which gets called in `DocumentWriter#preUpdate()` to check if there are any queued flushes. If `checkPendingFlushOnUpdate` is disabled i.e. its false then the overall condition `flushControl.numQueuedFlushes() > 0 && config.checkPendingFlushOnUpdate` could be short circuited if we reverse the check to `config.checkPendingFlushOnUpdate && flushControl.numQueuedFlushes() > 0`. This will prevent a blocking call. Code pointer: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/DocumentsWriter.java#L387 Currently while benchmarking I can see this particular issue significantly contributing to locking of write threads which is increasing indexing latency. https://github.com/apache/lucene/assets/20185657/05e6a52d-0811-443b-8637-43e47c67dc6c";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] [Suggestion] Short circuit check for queued flushes in preUpdate() when checkPendingFlushOnUpdate is disabled [lucene]
CaptainDredge commented on issue #13079: URL: https://github.com/apache/lucene/issues/13079#issuecomment-1928949795 cc: @mgodwan, @backslasht -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org