Re: [PR] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]
kuramitsu commented on PR #12885: URL: https://github.com/apache/lucene/pull/12885#issuecomment-1886834358 @zhaih > could you please add an CHANGES.txt entry under Lucene 9.10? Thank you. I added it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clean up unused code & variables [lucene]
dungba88 commented on PR #12994: URL: https://github.com/apache/lucene/pull/12994#issuecomment-1887052636 Thanks @dweiss for approving! Can you help me to merge if it looks okay? Thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Lazily write the FST padding byte [lucene]
mikemccand merged PR #12981: URL: https://github.com/apache/lucene/pull/12981 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
jpountz commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1887173459 FYI I pushed an annotation to nightly benchmarks, it should show up tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make FSTPostingFormat to build FST off-heap [lucene]
dungba88 commented on PR #12980: URL: https://github.com/apache/lucene/pull/12980#issuecomment-1887211104 I'm not sure why FSTPostingsFormat is different from the rest, that it write both the metadata and data to the same file. I think writing to separate files would be cleaner and more consistent. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make Lucene90 postings format to write FST off heap [lucene]
dungba88 commented on code in PR #12985: URL: https://github.com/apache/lucene/pull/12985#discussion_r1448927527 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java: ## @@ -795,7 +783,12 @@ void writeBlocks(int prefixLength, int count) throws IOException { assert firstBlock.isFloor || newBlocks.size() == 1; - firstBlock.compileIndex(newBlocks, scratchBytes, scratchIntsRef); + boolean isRootBlock = prefixLength == 0 && count == pending.size(); Review Comment: I can make this a parameter to be clearer instead of inferring from prefixLength and count -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid reset BlockDocsEnum#freqBuffer when indexHasFreq is false [lucene]
jpountz merged PR #12997: URL: https://github.com/apache/lucene/pull/12997 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Output well-formed UTF-8 bytes in SimpleTextCodec's segmentinfos [lucene]
jpountz merged PR #12897: URL: https://github.com/apache/lucene/pull/12897 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Output binary doc values as hex array in SimpleTextCodec [lucene]
jpountz commented on code in PR #12987: URL: https://github.com/apache/lucene/pull/12987#discussion_r1448975940 ## lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextDocValuesReader.java: ## @@ -329,9 +330,15 @@ public BytesRef apply(int docID) { } catch (ParseException pe) { throw new CorruptIndexException("failed to parse int length", in, pe); } - term.grow(len); - term.setLength(len); - in.readBytes(term.bytes(), 0, len); + termByteArray.grow(len); + termByteArray.setLength(len); + in.readBytes(termByteArray.bytes(), 0, len); + if (len > 2) { +term.copyBytes( + SimpleTextUtil.fromBytesRefString(termByteArray.get().utf8ToString())); + } else { +term.setLength(0); + } Review Comment: Why do we need this condition? Can't `fromBytesRefString` read empty arrays? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for index sorting with document blocks [lucene]
s1monw merged PR #12829: URL: https://github.com/apache/lucene/pull/12829 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make sure `ConcurrentApproximatePriorityQueue#poll` never returns `null` on a non-empty queue. [lucene]
jpountz commented on PR #12959: URL: https://github.com/apache/lucene/pull/12959#issuecomment-1887441748 If there are no concerns, I plan on merging this PR soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make sure `ConcurrentApproximatePriorityQueue#poll` never returns `null` on a non-empty queue. [lucene]
uschindler commented on PR #12959: URL: https://github.com/apache/lucene/pull/12959#issuecomment-1887464862 Hi, sorry I missed to test this. I started my beasting and try to reproduce it. Will report back. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clean up unused code & variables [lucene]
dweiss merged PR #12994: URL: https://github.com/apache/lucene/pull/12994 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clean up unused code & variables [lucene]
dweiss commented on PR #12994: URL: https://github.com/apache/lucene/pull/12994#issuecomment-1887664512 Thank you and apologies for the delay! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make sure `ConcurrentApproximatePriorityQueue#poll` never returns `null` on a non-empty queue. [lucene]
uschindler commented on PR #12959: URL: https://github.com/apache/lucene/pull/12959#issuecomment-1887730699 FYI, I did not try to run the beasting for hours because previously it was failing very fast on a heavy loaded Ryzen CPU with both Hotspot and OpenJ9. Now its stable, so I trust the results also after 2 hours (1 hour with Hotspot, 1 hour with OpenJ9). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Concurrency bug `DocumentsWriterPerThreadPool.getAndLock()` uncovered by OpenJ9 test failures? [lucene]
uschindler commented on issue #12916: URL: https://github.com/apache/lucene/issues/12916#issuecomment-1887733274 The latest PR seems to fix the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]
zhaih merged PR #12885: URL: https://github.com/apache/lucene/pull/12885 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]
zhaih commented on PR #12885: URL: https://github.com/apache/lucene/pull/12885#issuecomment-1888066363 Merged and backported, thanks for the contribution @kuramitsu ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Output binary doc values as hex array in SimpleTextCodec [lucene]
msfroh commented on code in PR #12987: URL: https://github.com/apache/lucene/pull/12987#discussion_r1449507398 ## lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextDocValuesReader.java: ## @@ -329,9 +330,15 @@ public BytesRef apply(int docID) { } catch (ParseException pe) { throw new CorruptIndexException("failed to parse int length", in, pe); } - term.grow(len); - term.setLength(len); - in.readBytes(term.bytes(), 0, len); + termByteArray.grow(len); + termByteArray.setLength(len); + in.readBytes(termByteArray.bytes(), 0, len); + if (len > 2) { +term.copyBytes( + SimpleTextUtil.fromBytesRefString(termByteArray.get().utf8ToString())); + } else { +term.setLength(0); + } Review Comment: ~~The issue is that on the writing side, if `stringVal == null`, we write the length as `0` and don't output anything. So, `termByteArray.get().utf8ToString()` is the empty string and `fromBytesRefString` throws an exception on that. It was showing up as a unit test failure.~~ ~~I added this condition here, but I suppose it would also be fine to output `[]` followed by `F` for missing binary values (though we'll waste a couple of bytes). Of course, now I'm wondering how we get here if the doc doesn't have the given field...~~ Nope... after debugging it, it wasn't the `null` case (which seems to work fine and skips any doc where the "has value" is `F`). It turns out that `fromBytesRefString` **can't** read empty arrays. The logic extracts from within the `[]`, then tries to split by `,`. The result is the empty string, so it thinks it has 1 "part". I'll follow up with a fix for `fromBytesRefString` instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Output binary doc values as hex array in SimpleTextCodec [lucene]
msfroh commented on code in PR #12987: URL: https://github.com/apache/lucene/pull/12987#discussion_r1449507398 ## lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextDocValuesReader.java: ## @@ -329,9 +330,15 @@ public BytesRef apply(int docID) { } catch (ParseException pe) { throw new CorruptIndexException("failed to parse int length", in, pe); } - term.grow(len); - term.setLength(len); - in.readBytes(term.bytes(), 0, len); + termByteArray.grow(len); + termByteArray.setLength(len); + in.readBytes(termByteArray.bytes(), 0, len); + if (len > 2) { +term.copyBytes( + SimpleTextUtil.fromBytesRefString(termByteArray.get().utf8ToString())); + } else { +term.setLength(0); + } Review Comment: ~~The issue is that on the writing side, if `stringVal == null`, we write the length as `0` and don't output anything. So, `termByteArray.get().utf8ToString()` is the empty string and `fromBytesRefString` throws an exception on that. It was showing up as a unit test failure.~~ ~~I added this condition here, but I suppose it would also be fine to output `[]` followed by `F` for missing binary values (though we'll waste a couple of bytes). Of course, now I'm wondering how we get here if the doc doesn't have the given field...~~ Nope... after debugging it, it wasn't the `null` case (which seems to work fine and skips any doc where the "has value" is `F`). It turns out that `fromBytesRefString` **can't** read empty arrays. The logic extracts from within the `[]`, then tries to split by " ". The result is the empty string, so it thinks it has 1 "part". I'll follow up with a fix for `fromBytesRefString` instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]
kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1888208867 > How is segment merging implemented by Lucene Vamana? I didn't do anything special for Vamana in these experiments, the index construction and merging are practically identical to HNSW. The implication of course here is that the vectors need to be in memory when building the merged graph or performance falls off a cliff. To make this useful operationally you'd want to ensure your max segment size does not exceed available memory, and ideally you would build the index on a dedicated node to not displace pages being used to server queries. The DiskANN paper suggests clustering the vectors and assigning each vector to the two closest clusters, building a graph for each vector, then overlaying the graphs and doing another round of pruning. You can then limit the amount of memory required based off the number of vectors and clusters. I didn't explore this, but it'd be a worthwhile improvement in the long term. > Correct me if I'm wrong, looks like the Java ANN implementations examine the node ids in a more or less random order, filtering requires a `Bits` object. I've been wondering if SPANN solves this by enabling ascending doc id intersection. That's correct, the graph is explored (roughly) by proximity order rather than doc id. It's an interesting question about SPANN, intuitively it seems like you may be able to do as you describe by keeping each vector postings list sorted by doc id order, then essentially doing a streaming merge sort on the relevant centroids' postings as you consume the candidates and iterate through the passed in sorted doc id iterator. > I have general concerns about large copy overheads I believe all of today's IO layer implementations in Lucene ([including mmap on java 21](https://github.com/apache/lucene/blob/8bee41880e41024109bf5729584ebc5dd1003717/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java#L214)) copy the bytes from the page cache onto the heap one way or another. This could potentially be improved in the future by passing a `MemorySegment` around, but it seems to have gotten things pretty far so far without problems. > I know that is a newer Linux kernel module/interface, so that would lead me to believe actually taking this implementation forward would require a bit of comprehensive testing. ... The other concern I have is that I suspect the community would probably need to keep an eye on io_uring given its age. Yeah, I wouldn't suggest that this be the default implementation, or perhaps even included in Lucene core itself at all given how platform specific it is (and this POC at least required a small amount ~50 LOC of C glue code linking in `liburing` to make things smooth). My main goal here was to highlight to the community a few things that may be relevant beyond vector indexes and DiskANN: - modern parallel I/O can provide significant performance improvements if you're able to remove I/O dependencies in the algorithms - the kernel is not doing a great job in this case of determining the best pages to keep in memory There's clearly a ton of perf left on the table by using sequential I/O and only the page cache in this case, but it's of course also much simpler both in terms of implementation and operational complexity. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] org.apache.lucene.search.TestByteVectorSimilarityQuery.testApproximate failing intermittently [lucene]
msfroh opened a new issue, #13009: URL: https://github.com/apache/lucene/issues/13009 ### Description ### Failure ``` org.apache.lucene.search.TestByteVectorSimilarityQuery > testApproximate FAILED java.lang.UnsupportedOperationException at __randomizedtesting.SeedInfo.seed([5D3CBCC19F394DBA:55EAB2739217D945]:0) at org.apache.lucene.search.TestByteVectorSimilarityQuery$1.createVectorScorer(TestByteVectorSimilarityQuery.java:81) at org.apache.lucene.search.AbstractVectorSimilarityQuery$1.scorer(AbstractVectorSimilarityQuery.java:148) at org.apache.lucene.search.Weight.scorerSupplier(Weight.java:135) at org.apache.lucene.search.Weight.bulkScorer(Weight.java:167) at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:930) at org.apache.lucene.tests.search.AssertingWeight.bulkScorer(AssertingWeight.java:122) at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:930) at org.apache.lucene.tests.search.AssertingWeight.bulkScorer(AssertingWeight.java:122) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:678) at org.apache.lucene.tests.search.AssertingIndexSearcher.search(AssertingIndexSearcher.java:79) at org.apache.lucene.search.IndexSearcher.lambda$search$2(IndexSearcher.java:636) at org.apache.lucene.search.TaskExecutor$TaskGroup.lambda$createTask$0(TaskExecutor.java:117) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) ``` ### Reproduce with ``` gradlew :lucene:core:test --tests "org.apache.lucene.search.TestByteVectorSimilarityQuery.testApproximate" -Ptests.jvms=2 \ -Ptests.jvmargs= -Ptests.seed=5D3CBCC19F394DBA -Ptests.gui=true -Ptests.file.encoding=UTF-8 -Ptests.vectorsize=512 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Output binary doc values as hex array in SimpleTextCodec [lucene]
msfroh commented on PR #12987: URL: https://github.com/apache/lucene/pull/12987#issuecomment-1888270133 Test failure is reproducible on main: https://github.com/apache/lucene/issues/13009 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]
rmuir commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1888419082 i would be extremely careful around io_uring, it is disabled in many environments (e.g. by default in container environments) for security reasons: * https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html * https://github.com/moby/moby/pull/46762 * https://github.com/containerd/containerd/pull/9320 To me, use of io_uring seems like a totally separate issue from KNN search. i don't think it is a good idea to entangle the two things. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org