[GitHub] [lucene] jpountz opened a new issue, #11915: Make Lucene smarter about long runs of matches
jpountz opened a new issue, #11915: URL: https://github.com/apache/lucene/issues/11915 ### Description Lucene's abstractions are good at dealing with long runs of documents that do not match a query, but much less at dealing with long runs of documents that match a query. In such cases, Lucene still needs to linearly scan these matches and do some amount of work for each of the matches. Is it actually common to have long runs of matches? For full-text indexes, maybe not so much, only stop words may have runs of adjacent matches. For string fields, this may happen if the field has a default value that is the value of most documents in the collection. Also it's possible for users to use index sorting in order to cluster similar documents together, which increases the likelihood to have long runs of adjacent matches. One idea would be to augment the `DocIdSetIterator` API to add a new `peekNextNonMatchingDocID` method, which would return the next doc ID that may not be a match. The default implementation could return `docID() + 1`. We'd need to implement this API in postings, doc-value iterators and a few other important `DocIdSetIterators` like `BitSetIterator` and `DocIdSetIterator#all`. And propagate this information in disjunctions and conjunctions. Then we could leverage this API in a number of places: - `ReqExclScorer` could ask the prohibited clause for the next doc ID that is worth evaluating on the required clause. - Conjunctions could ignore non-scoring clauses over ranges of doc IDs that all match. - `FixedBitSet#or` could set ranges of doc IDs at once, which would in-turn speed up `DocIdSetBuilder`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rendel commented on issue #11702: Multi-Value Support for Binary DocValues [LUCENE-10666]
rendel commented on issue #11702: URL: https://github.com/apache/lucene/issues/11702#issuecomment-1309968917 > I don't think ESQL is going to be different from existing faceting support: it will still want to use ordinals when it makes sense such as grouping by term. @jpountz This may be correct for aggregate operation, however, if you wish to support join operation in ESQL at some point, then you'll need to perform a scan of the binary values and not of the ordinal values (as they are not compatible with a join operation). > Would you be willing to contribute your multi-valued binary doc value implementation here? I think having multi-value parity with other doc value types is good to support multiple use cases like this. @nknize We do not see any problem in sharing the code, but our implementation is based on the Elasticsearch framework (on the `BinaryFieldMapper.CustomBinaryDocValuesField` to be exact), not Lucene, so likely it will be not relevant for this thread. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #11888: [Fix] Binary search the entries when all suffixes have the same length in a leaf block.
jpountz commented on PR #11888: URL: https://github.com/apache/lucene/pull/11888#issuecomment-1310043887 Thanks @vsop-479. Do you know if the test you added to terms can be improved in such a way that it would have caught this bug? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jfboeuf commented on a diff in pull request #11900: Reduce bloom filter size by using the optimal count for hash functions.
jfboeuf commented on code in PR #11900: URL: https://github.com/apache/lucene/pull/11900#discussion_r1019029965 ## lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java: ## @@ -46,7 +46,9 @@ public class FuzzySet implements Accountable { public static final int VERSION_SPI = 1; // HashFunction used to be loaded through a SPI public static final int VERSION_START = VERSION_SPI; - public static final int VERSION_CURRENT = 2; + public static final int VERSION_MURMUR2 = 2; + private static final int VERSION_MULTI_HASH = 3; + public static final int VERSION_CURRENT = VERSION_MULTI_HASH; Review Comment: Awesome, breaking backward compatibility will allow lots of cleaning up! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jfboeuf commented on a diff in pull request #11900: Reduce bloom filter size by using the optimal count for hash functions.
jfboeuf commented on code in PR #11900: URL: https://github.com/apache/lucene/pull/11900#discussion_r1019030241 ## lucene/codecs/src/java/org/apache/lucene/codecs/bloom/MurmurHash64.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.bloom; + +import org.apache.lucene.util.BytesRef; + +/** + * This is a very fast, non-cryptographic hash suitable for general hash-based lookup. See + * http://murmurhash.googlepages.com/ for more details. + * + * The code from Apache Commons was adapted in the form here to work with BytesRefs with offsets + * and lengths rather than raw byte arrays. + */ +public class MurmurHash64 extends HashFunction { + private static final long M64 = 0xc6a4a7935bd1e995L; + private static final int R64 = 47; + public static final HashFunction INSTANCE = new MurmurHash64(); + + /** + * Generates a 64-bit hash from byte array of the given length and seed. + * + * @param data The input byte array + * @param seed The initial seed value + * @param length The length of the array + * @return The 64-bit hash of the given array + */ + public static long hash64(byte[] data, int seed, int offset, int length) { +long h = (seed & 0xL) ^ (length * M64); + +final int nblocks = length >> 3; + +// body +for (int i = 0; i < nblocks; i++) { + + long k = getLittleEndianLong(data, offset); Review Comment: Thanks, I will do it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11911: improve checkindex to be more thorough for vectors (e.g. test seeking)
rmuir commented on issue #11911: URL: https://github.com/apache/lucene/issues/11911#issuecomment-1310217848 "read every byte of the index" is the promise that checkindex makes. So this bug is really important. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir closed pull request #11906: Add monster test for many knn docs
rmuir closed pull request #11906: Add monster test for many knn docs URL: https://github.com/apache/lucene/pull/11906 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11906: Add monster test for many knn docs
rmuir commented on PR #11906: URL: https://github.com/apache/lucene/pull/11906#issuecomment-1310221605 this test is folded into #11905 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #11905: Fix integer overflow when seeking the vector index for connections
rmuir merged PR #11905: URL: https://github.com/apache/lucene/pull/11905 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #907: LUCENE-10357 Ghost fields and postings/points
jpountz commented on PR #907: URL: https://github.com/apache/lucene/pull/907#issuecomment-1310330397 Apologies Luca, but after looking more at your changes, I'm getting worried that this change is harder than I had anticipated. I was optimistically hoping that never returning null PointValues would automatically help save some `NullPointerException`s but looking at your change it looks like there are many places where we just need to replace these null checks with other checks if we want to keep the logic correct, so things wouldn't magically work on segments that have ghost fields unless we explicitly test ghost fields. So I would suggest closing this PR, apologies for the time you spent on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on issue #11393: Ghost fields and postings/points [LUCENE-10357]
jpountz commented on issue #11393: URL: https://github.com/apache/lucene/issues/11393#issuecomment-1310333691 I had hoped that getting rid of ghost fields would automatically help avoid some bugs but after looking into it for both postings and points (thanks @javanna and @shahrs87 !) it looks like it's not actually possible to make things significantly better. Hence closing this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz closed issue #11393: Ghost fields and postings/points [LUCENE-10357]
jpountz closed issue #11393: Ghost fields and postings/points [LUCENE-10357] URL: https://github.com/apache/lucene/issues/11393 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #11793: Prevent PointValues from returning null for ghost fields
jpountz commented on PR #11793: URL: https://github.com/apache/lucene/pull/11793#issuecomment-1310334991 Apologies @javanna, but after looking more at your changes, I'm getting worried that this change is harder than I had anticipated. I was optimistically hoping that never returning null PointValues would automatically help save some NullPointerExceptions but looking at your change it looks like there are many places where we just need to replace these null checks with other checks if we want to keep the logic correct, so things wouldn't magically work on segments that have ghost fields unless we explicitly test ghost fields. So I would suggest closing this PR, apologies for the time you spent on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna closed pull request #11793: Prevent PointValues from returning null for ghost fields
javanna closed pull request #11793: Prevent PointValues from returning null for ghost fields URL: https://github.com/apache/lucene/pull/11793 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on pull request #11793: Prevent PointValues from returning null for ghost fields
javanna commented on PR #11793: URL: https://github.com/apache/lucene/pull/11793#issuecomment-1310370411 Agreed @jpountz I think it was a good experiment to spend some time on, and I have also been thinking along the same lines, that the changes I ended up making were not solving the problem like we initially thought. No problem at all. I will open a follow-up PR to clarify the current behaviour and expectations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent opened a new pull request, #11916: GITHUB#11911: improve checkindex to be more thorough for vectors
benwtrent opened a new pull request, #11916: URL: https://github.com/apache/lucene/pull/11916 Checkindex with vectors should exercise the graph and seek operations. These are exposed via the search interface. There is the option to search EVERY stored vector value as we iterate it, but this seemed excessive. So, only the first 64 vectors are searched, and we don't limit the search to stop early. closes: https://github.com/apache/lucene/issues/11911 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors
benwtrent commented on PR #11916: URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310547475 @rmuir I took a stab at it. I am unfamiliar with checkindex, but this will search 64 vectors, seeking the graph to catch if there is something obscene broken. A more complicated check would require inner interface changes (allowing seek to be hit directly). But search seems nice as it is a common path and actually exercising the thing as it is typically used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors
rmuir commented on PR #11916: URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310577261 Thank you, yeah this is fine as a start! I think, it would be an improvement in the future to not just search the first 64 vectors but maybe every n'th (just a different form of sampling). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #11917: Automatically preload index files that are both tiny and very hot.
jpountz opened a new pull request, #11917: URL: https://github.com/apache/lucene/pull/11917 The default codec has a number of small and hot files, that actually used to be fully loaded in memory before we moved them off-heap. In the general case, these files are expected to fully fit into the page cache for things to work well. Should we give control over preloading to codecs? This is what this commit does for the following files: - Terms index (`tip`) - Points index (`kdi`) - Stored fields index (`fdx`) - Terms vector index (`tvx`) This only has an effect on `MMapDirectory`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler opened a new pull request, #11918: Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput
uschindler opened a new pull request, #11918: URL: https://github.com/apache/lucene/pull/11918 This also adds incorrect (e.g., negative) positions to exception message. This also fixes some wrong exception messages (seek vs. read) in ByteBufferIndexInput. Sometimes it said "seek" although it should say "read". In MemorySegmentIndexInput I already fixed this last year. This closes #11912. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #11918: Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput
uschindler commented on PR #11918: URL: https://github.com/apache/lucene/pull/11918#issuecomment-1310726064 The new test is a bit bad, but unfortunately, MMapDirectory's multi-input only has an assert in seek(). If that hits, test also passses. In reality negative offsets on slices should also throw correct exception. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on issue #11863: Add large-scale test for kNN vectors
jtibshirani commented on issue #11863: URL: https://github.com/apache/lucene/issues/11863#issuecomment-1310797975 In https://github.com/apache/lucene/pull/11905 we added a test for a large number of documents (with a tiny dimension). It'd also be good to clean up and merge something like https://github.com/apache/lucene/pull/11867, which indexes a large number of bytes (maybe fewer vectors but high dimension). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors
rmuir commented on PR #11916: URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310890561 thanks! I like it. Feel free to add a CHANGES entry if you want, it is a good one for that, because checkindex is user-visible and important. I would suggest in the 9.4.2 section as that's where I want to backport this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors
benwtrent commented on PR #11916: URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310907074 pushed CHANGES under 9.4.2 as an `Improvement` @rmuir -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors
rmuir merged PR #11916: URL: https://github.com/apache/lucene/pull/11916 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir closed issue #11911: improve checkindex to be more thorough for vectors (e.g. test seeking)
rmuir closed issue #11911: improve checkindex to be more thorough for vectors (e.g. test seeking) URL: https://github.com/apache/lucene/issues/11911 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #11918: Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput
uschindler commented on PR #11918: URL: https://github.com/apache/lucene/pull/11918#issuecomment-1310952543 Should I backport this also to 9.4.2 when it gets released next week. I am afraid of more horrible bugs in vectors and I'd like to give people a chance to report it. Problem of Elasticsearch is (LOL): they did not enable `--enable-preview`, so they did not get good exception messages (for that I have a solution where I will come with a PR soon, which enables MemorySegment support without `--enable-preview` by some tricks). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors
rmuir commented on PR #11916: URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310961095 @benwtrent I hit issue upon backporting to branch_9x: it may be nothing specific to 9.x but just a random seed that hasn't been encountered yet on master? The checkindex error message is a bit generic, maybe we can improve it, as I can't tell immediately what went wrong. It seems that it got 0 results when searching for nearest neighbors on a vector... but I'm guessing this happens because we don't check Bits for deleted docs before issuing the query? If we search on a vector that only exists in a deleted doc, its possible all the results could be deleted ones too. So I think we are just missing an "if statement" before issuing the query? ``` think:lucene[branch_9x]$ git cherry-pick 3a506ec87a01556a530eee5eb54ada49fe3cde3f Auto-merging lucene/CHANGES.txt [branch_9x cfbb7e9bd35] GITHUB#11911: improve checkindex to be more thorough for vectors (#11916) Author: Benjamin Trent Date: Thu Nov 10 16:45:47 2022 -0500 2 files changed, 31 insertions(+), 6 deletions(-) think:lucene[branch_9x]$ ./gradlew check ... > Task :lucene:core:test org.apache.lucene.codecs.lucene94.TestLucene94HnswVectorsFormat > testRandomBytes FAILED org.apache.lucene.index.CheckIndex$CheckIndexException: Field "field" failed to search k nearest neighbors at __randomizedtesting.SeedInfo.seed([45F1ECAB3ABC3C78:8EF8167273CC4F95]:0) at app//org.apache.lucene.index.CheckIndex.testVectors(CheckIndex.java:2603) at app//org.apache.lucene.index.CheckIndex.testSegment(CheckIndex.java:1011) at app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:710) at app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:548) at app//org.apache.lucene.tests.util.TestUtil.checkIndex(TestUtil.java:343) at app//org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:909) at app//org.apache.lucene.tests.index.BaseKnnVectorsFormatTestCase.testRandomBytes(BaseKnnVectorsFormatTestCase.java:975) at java.base@17.0.5/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base@17.0.5/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base@17.0.5/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base@17.0.5/java.lang.reflect.Method.invoke(Method.java:568) at app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758) at app//com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946) at app//com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982) at app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996) at app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48) at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) at app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45) at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) at app//org.junit.rules.RunRules.evaluate(RunRules.java:20) at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390) at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843) at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490) at app//com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955) at app//com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840) at app//com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891) at app//com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902) at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at app//org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38) at
[GitHub] [lucene] rmuir commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors
rmuir commented on PR #11916: URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310962304 Another idea, perhaps even simpler, is not to filter deleteddocs at all here in this logic. Because checkindex doesnt normally exclude deleted docs and just checks everything. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors
benwtrent commented on PR #11916: URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310966373 @rmuir 100%. I reproduced it with that seed, then removed the deleted docs check and it cleared up. I bet its because ALL the docs were deleted or something. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent opened a new pull request, #11919: Follow up to GITHUB#11916, remove deleted docs check
benwtrent opened a new pull request, #11919: URL: https://github.com/apache/lucene/pull/11919 There is a chance that all the docs are deleted. This is ok in a checkindex scenario and other checks don't bother with verifying deleted docs like this. Removing the check. This reproduce line failed before this commit: ``` ./gradlew test --tests TestLucene94HnswVectorsFormat.testRandomBytes -Dtests.seed=45F1ECAB3ABC3C78 -Dtests.badapples=true -Dtests.locale=es-BO -Dtests.timezone=America/Fortaleza -Dtests.asserts=true -Dtests.file.encoding=UTF-8 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11919: Follow up to GITHUB#11916, remove deleted docs check
rmuir commented on PR #11919: URL: https://github.com/apache/lucene/pull/11919#issuecomment-1310970223 looks good, thank you for making the PR so fast. The test failure reproduces and with this change it passes again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #11919: Follow up to GITHUB#11916, remove deleted docs check
benwtrent commented on PR #11919: URL: https://github.com/apache/lucene/pull/11919#issuecomment-1310971424 Ran ``` ./gradlew test --tests TestLucene94HnswVectorsFormat -Dtests.iters=1000 ``` Just to be sure we are good. All green locally. @rmuir -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #11919: Follow up to GITHUB#11916, remove deleted docs check
benwtrent commented on PR #11919: URL: https://github.com/apache/lucene/pull/11919#issuecomment-1310971694 Apologies for the noise! Still learning all of Lucene's edges :D -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11917: Automatically preload index files that are both tiny and very hot.
rmuir commented on PR #11917: URL: https://github.com/apache/lucene/pull/11917#issuecomment-1310974545 I think preload is different from mlock, mlock needs way more discussion and personally I'm against it. mlock would be an operational hassle because of default resource limits on linux as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #11917: Automatically preload index files that are both tiny and very hot.
uschindler commented on code in PR #11917: URL: https://github.com/apache/lucene/pull/11917#discussion_r1019662247 ## lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java: ## @@ -235,7 +235,7 @@ public IndexInput openInput(String name, IOContext context) throws IOException { ensureOpen(); ensureCanRead(name); Path path = directory.resolve(name); -return PROVIDER.openInput(path, context, chunkSizePower, preload, useUnmapHack); +return PROVIDER.openInput(path, context, chunkSizePower, preload || context.load, useUnmapHack); Review Comment: See my last message: It may be better in future to leave the decission to preload or map to the implementation of the provider. For exactly that reason the context is passed to the provider! So I would move the `|| context.load` to somewhere in MappedByteBufferIndexInputProvider and MemorySegmentIndexInputProvider so they can have different code. Maybe also rename "load" to something else because it is up to implementation how to guarantee how something is "sticked to memory". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #11917: Automatically preload index files that are both tiny and very hot.
uschindler commented on PR #11917: URL: https://github.com/apache/lucene/pull/11917#issuecomment-1310976902 > I think preload is different from mlock, mlock needs way more discussion and personally I'm against it. mlock would be an operational hassle because of default resource limits on linux as well. This would of course be optional (it must be optional as it also requires extra JVM command line params). I think we should give users the option to do it. Personally I also tend to be against "load". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #11918: Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput
uschindler commented on PR #11918: URL: https://github.com/apache/lucene/pull/11918#issuecomment-1310991790 Ah you added milestone 9.4.2 already to issue. Will do same here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #912: MR-JAR rewrite of MMapDirectory with JDK-19 preview Panama APIs (>= JDK-19-ea+23)
uschindler commented on PR #912: URL: https://github.com/apache/lucene/pull/912#issuecomment-1311033936 After Mike switched to preview mode the results look good. The speed with MemorySegmentIndexInput ist similar to old ByteBuffer code. https://home.apache.org/~mikemccand/lucenebench/ See change EZ. Of course what we can't test is highly concurrent reopening of index and therefore closing shared MemorySessions at high ratio. CC @mcimadamore -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #11919: Follow up to GITHUB#11916, remove deleted docs check
rmuir merged PR #11919: URL: https://github.com/apache/lucene/pull/11919 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11918: Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput
rmuir commented on PR #11918: URL: https://github.com/apache/lucene/pull/11918#issuecomment-1311107401 yes, +1 to backport. This way if there is another problem, it might be easier to debug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org