Re: [PR] deps(java): bump org.owasp.dependencycheck from 12.1.2 to 12.1.3 [lucene]
dweiss merged PR #14805: URL: https://github.com/apache/lucene/pull/14805 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a linter flag to suppress warning about incubating vector module. [lucene]
dweiss merged PR #14802: URL: https://github.com/apache/lucene/pull/14802 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Convert more PriorityQueues to use Comparator [lucene]
dweiss merged PR #14761: URL: https://github.com/apache/lucene/pull/14761 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Convert more PriorityQueues to use Comparator [lucene]
dweiss commented on PR #14761: URL: https://github.com/apache/lucene/pull/14761#issuecomment-2982845136 I've merged this into main. Perhaps we should add a marker to benchmarks, @mikemccand ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove all security manager and java security references [lucene]
dweiss merged PR #14801: URL: https://github.com/apache/lucene/pull/14801 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Compression cache of numeric docvalues [lucene]
gf2121 commented on issue #14803: URL: https://github.com/apache/lucene/issues/14803#issuecomment-2982794733 Thanks for feedback! I agree that a transparent compression filesystem is pretty straightforward and helpful. But i suspect it is hard for user to know when Lucene can take charge of compression, and when it should be delegated to filesystem. So what i wondered was "default behavior". To be honest many of our users are moving away to save storage costs. We are implementing our custom codec, but it would be great if Lucene can be improved as well, thought I understand it is not easy to introduce this in Lucene :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Expand TieredMergePolicy deletePctAllowed limits [lucene]
jpountz commented on issue #11761: URL: https://github.com/apache/lucene/issues/11761#issuecomment-2980907072 I think I'd be ok with any lower bound that is strictly greater than 0. However, I am curious if the improvement that you are seeing is actually due to reducing the number of deletions, or if it is due to something else - for instance the index having fewer segments on average as a result of expunging deletes more aggressively. In theory, going from 5% deletes to 2% deletes should bring an improvement of _at most_ 1-(100-2)/(100-5) ~=3%. The fact that you are seeing more may suggest that this speedup may be due to having fewer/bigger segments? If this is the case, then you should be able to get similar results by bumping the floor and max merged segment sizes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix assemble source release [lucene]
github-actions[bot] commented on PR #14800: URL: https://github.com/apache/lucene/pull/14800#issuecomment-2979172131 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Fix regression in assembleSourceTgz [lucene]
dweiss closed issue #14796: Fix regression in assembleSourceTgz URL: https://github.com/apache/lucene/issues/14796 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix assemble source release [lucene]
dweiss merged PR #14800: URL: https://github.com/apache/lucene/pull/14800 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] org.apache.lucene.search.TestPatienceFloatVectorQuery.testFindAll failed [lucene]
tteofili commented on issue #14694: URL: https://github.com/apache/lucene/issues/14694#issuecomment-2979300604 @benwtrent yeah, exactly, I think that's what we're seeing here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Revert back to jgit for collecting git status [lucene]
dweiss commented on issue #14785: URL: https://github.com/apache/lucene/issues/14785#issuecomment-2979190779 Thanks, Uwe. > The "working copy clean" check was faster and better implemented with jgit It's not that bad, really - the format of the git tool's status may be a bit odd but it's ok once you read through the docs [1]. We just parse the output of native git [2]. Should work with all sorts of git extensions, respect local user settings, etc. [1] https://git-scm.com/docs/git-status#_porcelain_format_version_2 [2] https://github.com/apache/lucene/blob/main/build-tools/build-infra/src/main/java/org/apache/lucene/gradle/plugins/gitinfo/GitInfoValueSource.java#L63-L91 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Revert back to jgit for collecting git status [lucene]
dweiss closed issue #14785: Revert back to jgit for collecting git status URL: https://github.com/apache/lucene/issues/14785 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make `pack` methods public for `BigIntegerPoint` and `HalfFloatPoint` [lucene]
prudhvigodithi commented on PR #14784: URL: https://github.com/apache/lucene/pull/14784#issuecomment-2982201768 Just pushed a commit to fix the conflicts. @jpountz a gentle follow up to see if we are ok to merge this change. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a Faiss codec for KNN searches [lucene]
kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2981931299 Thanks @mikemccand, I've made the suggested changes + rebased + improved some documentation! > One could maybe use ulimit so the kernel will return null if the process tries to allocate too much RAM, and subtract the JVM heap from that total. Interesting, I'll try to do this for HNSW (but RAM usage may be wildly different for other types of indexes / transforms / etc.) > Anyway, I don't think this is a blocker for merging to sandbox -- we can learn over time the RAM usage. +1 -- hopefully we'll discover / estimate its usage better over time.. > I don't think I have karma for it I see, I've switched back to [conda-incubator/setup-miniconda](https://github.com/conda-incubator/setup-miniconda) so that we can test the codec regularly.. It does approximately the same thing as [mamba-org/setup-micromamba](https://github.com/mamba-org/setup-micromamba), but slower / with more overhead -- however this GH action is already allowed by Lucene for some reason :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] A multi-tenant ConcurrentMergeScheduler [lucene]
vigyasharma commented on issue #13883: URL: https://github.com/apache/lucene/issues/13883#issuecomment-2982081708 Thanks @yaser-aj , happy to see progress on this project. You're on the right track with understanding the problem. We want CMS to be aware of merge demands across IndexWriters, so that it can allocate/throttle merge threads more optimally. __ > There has to be one MultiTenantConcurrentMergeScheduler object that organizes how all ConcurrentMergeScheduler objects operate and divide resources wisely across them. It should handle addition and deletion of ConcurrentMergeScheduler objects on the go, optimally without the need to restart all ConcurrentMergeScheduler objects every time the number of ConcurrentMergeScheduler objects changes. This sounds like a CMS "Manager" that manages multiple other merge schedulers. It might be more effective to change the merge scheduler itself to be multi-tenant. This `MultiTenantConcurrentMergeScheduler` will consider all the active merges across all index writers whenever merges need to be throttled (see `ConcurrentMergeScheduler#maybeStall`, `ConcurrentMergeScheduler#updateIOThrottle`, `ConcurrentMergeScheduler#updateMergeThreads` etc). Internally, it would maintain some mapping of index writers -> merges to cleanly handle close for a single index writer without affecting merges for other writers. This would give you more direct control over scheduling merges from across index writers. We could then maintain a singleton for this "multiTenantCMS". Index writers would acquire an instance of that singleton and register themselves with it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a DoubleValuesSource for scoring full precision vector similarity [lucene]
vigyasharma merged PR #14708: URL: https://github.com/apache/lucene/pull/14708 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] deps(java): bump org.owasp.dependencycheck from 12.1.2 to 12.1.3 [lucene]
dependabot[bot] opened a new pull request, #14805: URL: https://github.com/apache/lucene/pull/14805 Bumps org.owasp.dependencycheck from 12.1.2 to 12.1.3. [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] UnsupportedOperation when merging `Lucene90BlockTreeTermsWriter` [lucene]
benwtrent commented on issue #14429: URL: https://github.com/apache/lucene/issues/14429#issuecomment-2981454239 Working more on this, we have ran multiple diagnostics on the machines, no hardware issues seem to arise. This issue arises not only on merge, but I have seen it on flush. ``` Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: unsupported_operation_exception: null at org.apache.lucene.util.fst.Outputs.merge(Outputs.java:95) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.util.fst.FSTCompiler.add(FSTCompiler.java:936) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$PendingBlock.append(Lucene90BlockTreeTermsWriter.java:593) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$PendingBlock.compileIndex(Lucene90BlockTreeTermsWriter.java:562) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.writeBlocks(Lucene90BlockTreeTermsWriter.java:776) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.finish(Lucene90BlockTreeTermsWriter.java:1163) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:402) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:172) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:134) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.index.IndexingChain.flush(IndexingChain.java:333) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:445) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:496) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.index.DocumentsWriter.maybeFlush(DocumentsWriter.java:450) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.index.DocumentsWriter.preUpdate(DocumentsWriter.java:391) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:413) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1561) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1519) ~[lucene-core-9.11.1.jar:?] ``` What's even weirder, I have seen it happen during document replication, meaning the primary index seems to have accepted the doc without issue :( and it only failed on replica. I am still trying to get information about the field contents, but this is proving difficult. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a Faiss codec for KNN searches [lucene]
mikemccand commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r2152136427 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/package-info.java: ## @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +/** + * Provides a Faiss-based vector format via {@link Review Comment: Link out to Faiss' source code? ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsWriter.java: ## @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.codecs.faiss; + +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_CODEC_NAME; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_EXTENSION; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_CODEC_NAME; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_EXTENSION; +import static org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.VERSION_CURRENT; +import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.createIndex; +import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.indexWrite; + +import java.io.IOException; +import java.lang.foreign.Arena; +import java.lang.foreign.MemorySegment; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnFieldVectorsWriter; +import org.apache.lucene.codecs.KnnVectorsWriter; +import org.apache.lucene.codecs.hnsw.FlatFieldVectorsWriter; +import org.apache.lucene.codecs.hnsw.FlatVectorsWriter; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.IndexFileNames; +import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.SegmentWriteState; +import org.apache.lucene.index.Sorter; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.search.DocIdSet; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.hnsw.IntToIntFunction; + +/** + * Write per-segment Faiss indexes and associated metadata. + * + * @lucene.experimental + */ +final class FaissKnnVectorsWriter extends KnnVectorsWriter { + private final String description, indexParams; + private final FlatVectorsWriter rawVectorsWriter; + private final IndexOutput meta, data; + private final Map> rawFields; + private boolean closed, finished; + + public FaissKnnVectorsWriter( + String description, + String indexParams, + SegmentWriteState state, + FlatVectorsWriter rawVectorsWriter) + throws IOException { + +this.description = description; +this.indexParams = indexParams; +this.rawVectorsWriter = rawVectorsWriter; +this.rawFields = new HashMap<>(); +this.closed = false; +this.finished = false; + +boolean failure = true; +try { + this.meta = openOutput(state, META_EXTENSION, META_CODEC_NAME); + this.data = openOutput(state, DATA_EXTENSION, DATA_CODEC_NAME); + failure = false; +} finally { + if (failure) { +IOUtils.closeWhileHandlingException(this); + } +} + } + + private IndexOutput openOutput(SegmentWriteState state, String extension, String codecName) + throws IOException { +Strin
Re: [I] Support for DocIdSetBuilder with (min,max) docId [lucene]
prudhvigodithi commented on issue #14485: URL: https://github.com/apache/lucene/issues/14485#issuecomment-2980630462 Filtering out out of range docs prevents giant bit-sets, so we can only add the docs to the `DocIdSetBuilder` that are within the range of `LeafReaderContextPartition` this prevents each thread from storing doc IDs that belong to other partitions. Adding to this should we also update the `maxDoc` (`this.threshold = maxDoc >>> 7;`) with `LeafReaderContextPartition` size so the so the threshold scales correctly, this way the resulting bit-set are sized to the partition itself. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Correct python release scripts for the new location of base version [lucene]
dweiss merged PR #14798: URL: https://github.com/apache/lucene/pull/14798 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce getQuantizedVectorValues method in LeafReader to access QuantizedByteVectorValues [lucene]
msokolov commented on PR #14792: URL: https://github.com/apache/lucene/pull/14792#issuecomment-2981195460 For the return values use case, another choice is to disable it in the case the original vectors were not "stored" in the searchable index. Otherwise, I agree with Ben that we could support "rehydration" in the codec. For example, suppose we see that we have zero full-precision vectors, but nonzero quantized vectors; then we could fall back to "rehydration". For the counting case (get total number of vectors), should we always use the quantized count where today we use the full-precision count? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] detect and ban wildcard imports in Java [lucene]
rmuir commented on PR #14804: URL: https://github.com/apache/lucene/pull/14804#issuecomment-2981726602 Can we consider ast-grep for this? it is really fast and doesn't require regular expressions, has plugins for editors. I wrote a rule for this in less than a minute: ```yaml id: wildcard-import-not-allowed language: java rule: kind: asterisk inside: kind: import_declaration severity: error message: don't use wildcard imports note: please use full import instead url: https://whatever/explanation ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make HNSW merges cheaper on heap [lucene]
ChrisHegarty commented on issue #14208: URL: https://github.com/apache/lucene/issues/14208#issuecomment-2981080469 The on-heap memory used for the per-node neighbour array during building the HNSW graph has been significantly reduced, by approximately 3-4x, see https://github.com/apache/lucene/pull/14527. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Compression cache of numeric docvalues [lucene]
gf2121 opened a new issue, #14803: URL: https://github.com/apache/lucene/issues/14803 ### Description When benchmarking recently with some OLAP engines (no indexes, no stored fields, only column data), the results showed that they only occupy 50-70% of the storage of `NumericDocvalues`, with comparable performance, which is surprising. I looked into their implementation and it turns out they simply use BitShuffle and LZ4 to compress data blocks on the write side, and use a global cache on the read side to cache decompressed data. So in Lucene, we have non-compressed data (MMap) on both disk and in memory, but they have compressed data on disk and decompressed data in memory, which sounds quite reasonable to me. I believe that things like global cache can be easily done in a service (like ES) through a custom codec, but I still wonder if we can do something on our default codec? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] detect and ban wildcard imports in Java [lucene]
dweiss opened a new pull request, #14804: URL: https://github.com/apache/lucene/pull/14804 Fixes #14553. I'm not completely happy with this. For some reason, the custom formatting step always triggers full spotless run - incremental mode doesn't work. ``` > ./gradlew -p lucene/grouping/ spotlessCheck --info > ./gradlew -p lucene/grouping/ spotlessCheck --info ... > Task :lucene:grouping:spotlessJava Caching disabled for task ':lucene:grouping:spotlessJava' because: Build cache is disabled Task ':lucene:grouping:spotlessJava' is not up-to-date because: Value of input property 'stepsInternalEquality' has changed for task ':lucene:grouping:spotlessJava' The input changes require a full rebuild for incremental task ':lucene:grouping:spotlessJava'. Not incremental: removing prior outputs Resolve mutations for :lucene:grouping:spotlessJavaCheck (Thread[#9108,Execution worker Thread 4,5,main]) started. :lucene:grouping:spotlessJavaCheck (Thread[#9108,Execution worker Thread 4,5,main]) started. ``` The detection is also costly (regexp over the entire codebase); this could be probably simplified to line-by-line scanning and a heuristic to short-circuit early when import statements are no longer possible... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] detect and ban wildcard imports in Java [lucene]
github-actions[bot] commented on PR #14804: URL: https://github.com/apache/lucene/pull/14804#issuecomment-2981718232 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Compression cache of numeric docvalues [lucene]
rmuir commented on issue #14803: URL: https://github.com/apache/lucene/issues/14803#issuecomment-2981632167 IMO: just use a filesystem with this feature such as zfs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a Faiss codec for KNN searches [lucene]
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r2153208743 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/package-info.java: ## @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +/** + * Provides a Faiss-based vector format via {@link Review Comment: Done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] build and push release regression [lucene]
dweiss closed issue #14786: build and push release regression URL: https://github.com/apache/lucene/issues/14786 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Remove all security manager and java security references [lucene]
dweiss opened a new pull request, #14801: URL: https://github.com/apache/lucene/pull/14801 these are no-ops in JDK24+. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove all security manager and java security references [lucene]
dweiss commented on code in PR #14801: URL: https://github.com/apache/lucene/pull/14801#discussion_r2151947884 ## build-tools/build-infra/src/main/groovy/lucene.validation.ecj-lint.gradle: ## @@ -74,6 +76,8 @@ def lintTasks = sourceSets.collect { SourceSet sourceSet -> dependsOn sourceSet.compileClasspath dependsOn ecjConfiguration +mustRunAfter tasks.withType(SpotlessApply) Review Comment: piggybacking this so that 'gradlew tidy check' works properly... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adjust base knn format assert assertOffHeapByteSize [lucene]
benwtrent merged PR #14797: URL: https://github.com/apache/lucene/pull/14797 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add a linter flag to suppress warning about incubating vector module. [lucene]
dweiss opened a new pull request, #14802: URL: https://github.com/apache/lucene/pull/14802 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a linter flag to suppress warning about incubating vector module. [lucene]
github-actions[bot] commented on PR #14802: URL: https://github.com/apache/lucene/pull/14802#issuecomment-2979902324 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] .editorconfig [lucene]
dsmiley commented on PR #14740: URL: https://github.com/apache/lucene/pull/14740#issuecomment-2979904591 OMG that's ironic! @rmuir, you added it (in March), and it only configures Python :-) LOL Okay... well I think that file should be removed and it's python section integrated into the top-level file coming with this PR. As to the very specifics of what it configures, I don't care (whatever you guys say) as Python isn't my thing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a Faiss codec for KNN searches [lucene]
mikemccand commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2980130792 > As a follow up, could you allow the [`mamba-org/setup-micromamba`](https://github.com/mamba-org/setup-micromamba) GH action to run on the Lucene repository -- so that the Faiss codec can be tested regularly? (we need `micromamba` to pull Faiss libraries from Conda, as a faster alternative to `miniconda`, `miniforge`, etc). It can be done from `Settings > Code and automation > Actions > General > Actions permissions` +1 for this, but I don't think I have karma for it (I don't see the Settings tab for `apache/lucene` repo) -- I'm not sure who does? @dweiss maybe? Or it might be we need ASF Infra help? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org