Re: [PR] deps(java): bump org.owasp.dependencycheck from 12.1.2 to 12.1.3 [lucene]

2025-06-17 Thread via GitHub


dweiss merged PR #14805:
URL: https://github.com/apache/lucene/pull/14805


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a linter flag to suppress warning about incubating vector module. [lucene]

2025-06-17 Thread via GitHub


dweiss merged PR #14802:
URL: https://github.com/apache/lucene/pull/14802


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Convert more PriorityQueues to use Comparator [lucene]

2025-06-17 Thread via GitHub


dweiss merged PR #14761:
URL: https://github.com/apache/lucene/pull/14761


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Convert more PriorityQueues to use Comparator [lucene]

2025-06-17 Thread via GitHub


dweiss commented on PR #14761:
URL: https://github.com/apache/lucene/pull/14761#issuecomment-2982845136

   I've merged this into main. Perhaps we should add a marker to benchmarks, 
@mikemccand ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove all security manager and java security references [lucene]

2025-06-17 Thread via GitHub


dweiss merged PR #14801:
URL: https://github.com/apache/lucene/pull/14801


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Compression cache of numeric docvalues [lucene]

2025-06-17 Thread via GitHub


gf2121 commented on issue #14803:
URL: https://github.com/apache/lucene/issues/14803#issuecomment-2982794733

   Thanks for feedback! 
   
   I agree that a transparent compression filesystem is pretty straightforward 
and helpful. But i suspect it is hard for user to know when Lucene can take 
charge of compression, and when it should be delegated to filesystem. So what i 
wondered was "default behavior".
   
   To be honest many of our users are moving away to save storage costs. We are 
implementing our custom codec, but it would be great if Lucene can be improved 
as well, thought I understand it is not easy to introduce this in Lucene :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Expand TieredMergePolicy deletePctAllowed limits [lucene]

2025-06-17 Thread via GitHub


jpountz commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-2980907072

   I think I'd be ok with any lower bound that is strictly greater than 0. 
However, I am curious if the improvement that you are seeing is actually due to 
reducing the number of deletions, or if it is due to something else - for 
instance the index having fewer segments on average as a result of expunging 
deletes more aggressively.
   
   In theory, going from 5% deletes to 2% deletes should bring an improvement 
of _at most_ 1-(100-2)/(100-5) ~=3%. The fact that you are seeing more may 
suggest that this speedup may be due to having fewer/bigger segments? If this 
is the case, then you should be able to get similar results by bumping the 
floor and max merged segment sizes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix assemble source release [lucene]

2025-06-17 Thread via GitHub


github-actions[bot] commented on PR #14800:
URL: https://github.com/apache/lucene/pull/14800#issuecomment-2979172131

   This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. 
If the PR doesn't need a changelog entry, then add the skip-changelog label to 
it and you will stop receiving this reminder on future updates to the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Fix regression in assembleSourceTgz [lucene]

2025-06-17 Thread via GitHub


dweiss closed issue #14796: Fix regression in assembleSourceTgz
URL: https://github.com/apache/lucene/issues/14796


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix assemble source release [lucene]

2025-06-17 Thread via GitHub


dweiss merged PR #14800:
URL: https://github.com/apache/lucene/pull/14800


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] org.apache.lucene.search.TestPatienceFloatVectorQuery.testFindAll failed [lucene]

2025-06-17 Thread via GitHub


tteofili commented on issue #14694:
URL: https://github.com/apache/lucene/issues/14694#issuecomment-2979300604

   @benwtrent yeah, exactly, I think that's what we're seeing here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Revert back to jgit for collecting git status [lucene]

2025-06-17 Thread via GitHub


dweiss commented on issue #14785:
URL: https://github.com/apache/lucene/issues/14785#issuecomment-2979190779

   Thanks, Uwe.
   
   > The "working copy clean" check was faster and better implemented with jgit
   
   It's not that bad, really - the format of the git tool's status may be a bit 
odd but it's ok once you read through the docs [1]. We just parse the output of 
native git [2]. Should work with all sorts of git extensions, respect local 
user settings, etc.
   
   [1] https://git-scm.com/docs/git-status#_porcelain_format_version_2
   [2] 
https://github.com/apache/lucene/blob/main/build-tools/build-infra/src/main/java/org/apache/lucene/gradle/plugins/gitinfo/GitInfoValueSource.java#L63-L91


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Revert back to jgit for collecting git status [lucene]

2025-06-17 Thread via GitHub


dweiss closed issue #14785: Revert back to jgit for collecting git status
URL: https://github.com/apache/lucene/issues/14785


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make `pack` methods public for `BigIntegerPoint` and `HalfFloatPoint` [lucene]

2025-06-17 Thread via GitHub


prudhvigodithi commented on PR #14784:
URL: https://github.com/apache/lucene/pull/14784#issuecomment-2982201768

   Just pushed a commit to fix the conflicts. @jpountz a gentle follow up to 
see if we are ok to merge this change.
   Thanks 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-06-17 Thread via GitHub


kaivalnp commented on PR #14178:
URL: https://github.com/apache/lucene/pull/14178#issuecomment-2981931299

   Thanks @mikemccand, I've made the suggested changes + rebased + improved 
some documentation!
   
   > One could maybe use ulimit so the kernel will return null if the process 
tries to allocate too much RAM, and subtract the JVM heap from that total.
   
   Interesting, I'll try to do this for HNSW (but RAM usage may be wildly 
different for other types of indexes / transforms / etc.)
   
   > Anyway, I don't think this is a blocker for merging to sandbox -- we can 
learn over time the RAM usage.
   
   +1 -- hopefully we'll discover / estimate its usage better over time..
   
   > I don't think I have karma for it
   
   I see, I've switched back to 
[conda-incubator/setup-miniconda](https://github.com/conda-incubator/setup-miniconda)
 so that we can test the codec regularly..
   It does approximately the same thing as 
[mamba-org/setup-micromamba](https://github.com/mamba-org/setup-micromamba), 
but slower / with more overhead -- however this GH action is already allowed by 
Lucene for some reason :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] A multi-tenant ConcurrentMergeScheduler [lucene]

2025-06-17 Thread via GitHub


vigyasharma commented on issue #13883:
URL: https://github.com/apache/lucene/issues/13883#issuecomment-2982081708

   Thanks @yaser-aj , happy to see progress on this project.
   
   You're on the right track with understanding the problem. We want CMS to be 
aware of merge demands across IndexWriters, so that it can allocate/throttle 
merge threads more optimally.
   
   __
   
   > There has to be one MultiTenantConcurrentMergeScheduler object that 
organizes how all ConcurrentMergeScheduler objects operate and divide resources 
wisely across them. It should handle addition and deletion of 
ConcurrentMergeScheduler objects on the go, optimally without the need to 
restart all ConcurrentMergeScheduler objects every time the number of 
ConcurrentMergeScheduler objects changes.
   
   This sounds like a CMS "Manager" that manages multiple other merge 
schedulers. It might be more effective to change the merge scheduler itself to 
be multi-tenant. This `MultiTenantConcurrentMergeScheduler` will consider all 
the active merges across all index writers whenever merges need to be throttled 
(see `ConcurrentMergeScheduler#maybeStall`, 
`ConcurrentMergeScheduler#updateIOThrottle`, 
`ConcurrentMergeScheduler#updateMergeThreads` etc). Internally, it would 
maintain some mapping of index writers -> merges to cleanly handle close for a 
single index writer without affecting merges for other writers. This would give 
you more direct control over scheduling merges from across index writers.
   
   We could then maintain a singleton for this "multiTenantCMS". Index writers 
would acquire an instance of that singleton and register themselves with it.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a DoubleValuesSource for scoring full precision vector similarity [lucene]

2025-06-17 Thread via GitHub


vigyasharma merged PR #14708:
URL: https://github.com/apache/lucene/pull/14708


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] deps(java): bump org.owasp.dependencycheck from 12.1.2 to 12.1.3 [lucene]

2025-06-17 Thread via GitHub


dependabot[bot] opened a new pull request, #14805:
URL: https://github.com/apache/lucene/pull/14805

   Bumps org.owasp.dependencycheck from 12.1.2 to 12.1.3.
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.owasp.dependencycheck&package-manager=gradle&previous-version=12.1.2&new-version=12.1.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] UnsupportedOperation when merging `Lucene90BlockTreeTermsWriter` [lucene]

2025-06-17 Thread via GitHub


benwtrent commented on issue #14429:
URL: https://github.com/apache/lucene/issues/14429#issuecomment-2981454239

   Working more on this, we have ran multiple diagnostics on the machines, no 
hardware issues seem to arise.
   
   This issue arises not only on merge, but I have seen it on flush. 
   
   ```
   Caused by: 
org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: 
unsupported_operation_exception: null
at org.apache.lucene.util.fst.Outputs.merge(Outputs.java:95) 
~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.util.fst.FSTCompiler.add(FSTCompiler.java:936) 
~[lucene-core-9.11.1.jar:?]
at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$PendingBlock.append(Lucene90BlockTreeTermsWriter.java:593)
 ~[lucene-core-9.11.1.jar:?]
at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$PendingBlock.compileIndex(Lucene90BlockTreeTermsWriter.java:562)
 ~[lucene-core-9.11.1.jar:?]
at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.writeBlocks(Lucene90BlockTreeTermsWriter.java:776)
 ~[lucene-core-9.11.1.jar:?]
at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.finish(Lucene90BlockTreeTermsWriter.java:1163)
 ~[lucene-core-9.11.1.jar:?]
at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:402)
 ~[lucene-core-9.11.1.jar:?]
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:172)
 ~[lucene-core-9.11.1.jar:?]
at 
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:134) 
~[lucene-core-9.11.1.jar:?]
at org.apache.lucene.index.IndexingChain.flush(IndexingChain.java:333) 
~[lucene-core-9.11.1.jar:?]
at 
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:445)
 ~[lucene-core-9.11.1.jar:?]
at 
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:496) 
~[lucene-core-9.11.1.jar:?]
at 
org.apache.lucene.index.DocumentsWriter.maybeFlush(DocumentsWriter.java:450) 
~[lucene-core-9.11.1.jar:?]
at 
org.apache.lucene.index.DocumentsWriter.preUpdate(DocumentsWriter.java:391) 
~[lucene-core-9.11.1.jar:?]
at 
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:413)
 ~[lucene-core-9.11.1.jar:?]
at 
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1561) 
~[lucene-core-9.11.1.jar:?]
at 
org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1519) 
~[lucene-core-9.11.1.jar:?]
   ```
   
   What's even weirder, I have seen it happen during document replication, 
meaning the primary index seems to have accepted the doc without issue :( and 
it only failed on replica.
   
   I am still trying to get information about the field contents, but this is 
proving difficult.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-06-17 Thread via GitHub


mikemccand commented on code in PR #14178:
URL: https://github.com/apache/lucene/pull/14178#discussion_r2152136427


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/package-info.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/**
+ * Provides a Faiss-based vector format via {@link

Review Comment:
   Link out to Faiss' source code?



##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsWriter.java:
##
@@ -0,0 +1,240 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.codecs.faiss;
+
+import static 
org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_CODEC_NAME;
+import static 
org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.DATA_EXTENSION;
+import static 
org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_CODEC_NAME;
+import static 
org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.META_EXTENSION;
+import static 
org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormat.VERSION_CURRENT;
+import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.createIndex;
+import static org.apache.lucene.sandbox.codecs.faiss.LibFaissC.indexWrite;
+
+import java.io.IOException;
+import java.lang.foreign.Arena;
+import java.lang.foreign.MemorySegment;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.KnnFieldVectorsWriter;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.hnsw.FlatFieldVectorsWriter;
+import org.apache.lucene.codecs.hnsw.FlatVectorsWriter;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.IndexFileNames;
+import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.index.Sorter;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.search.DocIdSet;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.hnsw.IntToIntFunction;
+
+/**
+ * Write per-segment Faiss indexes and associated metadata.
+ *
+ * @lucene.experimental
+ */
+final class FaissKnnVectorsWriter extends KnnVectorsWriter {
+  private final String description, indexParams;
+  private final FlatVectorsWriter rawVectorsWriter;
+  private final IndexOutput meta, data;
+  private final Map> rawFields;
+  private boolean closed, finished;
+
+  public FaissKnnVectorsWriter(
+  String description,
+  String indexParams,
+  SegmentWriteState state,
+  FlatVectorsWriter rawVectorsWriter)
+  throws IOException {
+
+this.description = description;
+this.indexParams = indexParams;
+this.rawVectorsWriter = rawVectorsWriter;
+this.rawFields = new HashMap<>();
+this.closed = false;
+this.finished = false;
+
+boolean failure = true;
+try {
+  this.meta = openOutput(state, META_EXTENSION, META_CODEC_NAME);
+  this.data = openOutput(state, DATA_EXTENSION, DATA_CODEC_NAME);
+  failure = false;
+} finally {
+  if (failure) {
+IOUtils.closeWhileHandlingException(this);
+  }
+}
+  }
+
+  private IndexOutput openOutput(SegmentWriteState state, String extension, 
String codecName)
+  throws IOException {
+Strin

Re: [I] Support for DocIdSetBuilder with (min,max) docId [lucene]

2025-06-17 Thread via GitHub


prudhvigodithi commented on issue #14485:
URL: https://github.com/apache/lucene/issues/14485#issuecomment-2980630462

   Filtering out out of range docs prevents giant bit-sets, so we can only add 
the docs to the `DocIdSetBuilder` that are within the range of 
`LeafReaderContextPartition` this prevents each thread from storing doc IDs 
that belong to other partitions. 
   
   Adding to this should we also update the `maxDoc` (`this.threshold = maxDoc 
>>> 7;`) with `LeafReaderContextPartition` size so the so the threshold scales 
correctly, this way the resulting bit-set are sized to the partition itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Correct python release scripts for the new location of base version [lucene]

2025-06-17 Thread via GitHub


dweiss merged PR #14798:
URL: https://github.com/apache/lucene/pull/14798


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce getQuantizedVectorValues method in LeafReader to access QuantizedByteVectorValues [lucene]

2025-06-17 Thread via GitHub


msokolov commented on PR #14792:
URL: https://github.com/apache/lucene/pull/14792#issuecomment-2981195460

   For the return values use case, another choice is to disable it in the case 
the original vectors were not "stored" in the searchable index. Otherwise, I 
agree with Ben that we could support "rehydration" in the codec. For example, 
suppose we see that we have zero full-precision vectors, but nonzero quantized 
vectors; then we could fall back to "rehydration". 
   
   For the counting case (get total number of vectors), should we always use 
the quantized count where today we use the full-precision count?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] detect and ban wildcard imports in Java [lucene]

2025-06-17 Thread via GitHub


rmuir commented on PR #14804:
URL: https://github.com/apache/lucene/pull/14804#issuecomment-2981726602

   Can we consider ast-grep for this? it is really fast and doesn't require 
regular expressions, has plugins for editors. I wrote a rule for this in less 
than a minute:
   ```yaml
   id: wildcard-import-not-allowed
   language: java
   rule:
 kind: asterisk
 inside:
   kind: import_declaration
   severity: error
   message: don't use wildcard imports
   note: please use full import instead
   url: https://whatever/explanation
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make HNSW merges cheaper on heap [lucene]

2025-06-17 Thread via GitHub


ChrisHegarty commented on issue #14208:
URL: https://github.com/apache/lucene/issues/14208#issuecomment-2981080469

   The on-heap memory used for the per-node neighbour array during building the 
HNSW graph has been significantly reduced, by approximately 3-4x, see 
https://github.com/apache/lucene/pull/14527.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Compression cache of numeric docvalues [lucene]

2025-06-17 Thread via GitHub


gf2121 opened a new issue, #14803:
URL: https://github.com/apache/lucene/issues/14803

   ### Description
   
   When benchmarking recently with some OLAP engines (no indexes, no stored 
fields, only column data), the results showed that they only occupy 50-70% of 
the storage of `NumericDocvalues`, with comparable performance, which is 
surprising. I looked into their implementation and it turns out they simply use 
BitShuffle and LZ4 to compress data blocks on the write side, and use a global 
cache on the read side to cache decompressed data.
   
   So in Lucene, we have non-compressed data (MMap) on both disk and in memory, 
but they have compressed data on disk and decompressed data in memory, which 
sounds quite reasonable to me. I believe that things like global cache can be 
easily done in a service (like ES) through a custom codec, but I still wonder 
if we can do something on our default codec?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] detect and ban wildcard imports in Java [lucene]

2025-06-17 Thread via GitHub


dweiss opened a new pull request, #14804:
URL: https://github.com/apache/lucene/pull/14804

   Fixes #14553.
   
   I'm not completely happy with this. For some reason, the custom formatting 
step always triggers full spotless run - incremental mode doesn't work.
   
   ```
   > ./gradlew -p lucene/grouping/ spotlessCheck  --info
   > ./gradlew -p lucene/grouping/ spotlessCheck  --info
   ...
   > Task :lucene:grouping:spotlessJava
   Caching disabled for task ':lucene:grouping:spotlessJava' because:
 Build cache is disabled
   Task ':lucene:grouping:spotlessJava' is not up-to-date because:
 Value of input property 'stepsInternalEquality' has changed for task 
':lucene:grouping:spotlessJava'
   The input changes require a full rebuild for incremental task 
':lucene:grouping:spotlessJava'.
   Not incremental: removing prior outputs
   Resolve mutations for :lucene:grouping:spotlessJavaCheck 
(Thread[#9108,Execution worker Thread 4,5,main]) started.
   :lucene:grouping:spotlessJavaCheck (Thread[#9108,Execution worker Thread 
4,5,main]) started.
   ```
   
   The detection is also costly (regexp over the entire codebase); this could 
be probably simplified to line-by-line scanning and a heuristic to 
short-circuit early when import statements are no longer possible... 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] detect and ban wildcard imports in Java [lucene]

2025-06-17 Thread via GitHub


github-actions[bot] commented on PR #14804:
URL: https://github.com/apache/lucene/pull/14804#issuecomment-2981718232

   This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. 
If the PR doesn't need a changelog entry, then add the skip-changelog label to 
it and you will stop receiving this reminder on future updates to the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Compression cache of numeric docvalues [lucene]

2025-06-17 Thread via GitHub


rmuir commented on issue #14803:
URL: https://github.com/apache/lucene/issues/14803#issuecomment-2981632167

   IMO: just use a filesystem with this feature such as zfs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-06-17 Thread via GitHub


kaivalnp commented on code in PR #14178:
URL: https://github.com/apache/lucene/pull/14178#discussion_r2153208743


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/package-info.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/**
+ * Provides a Faiss-based vector format via {@link

Review Comment:
   Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] build and push release regression [lucene]

2025-06-17 Thread via GitHub


dweiss closed issue #14786: build and push release regression
URL: https://github.com/apache/lucene/issues/14786


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Remove all security manager and java security references [lucene]

2025-06-17 Thread via GitHub


dweiss opened a new pull request, #14801:
URL: https://github.com/apache/lucene/pull/14801

   these are no-ops in JDK24+.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove all security manager and java security references [lucene]

2025-06-17 Thread via GitHub


dweiss commented on code in PR #14801:
URL: https://github.com/apache/lucene/pull/14801#discussion_r2151947884


##
build-tools/build-infra/src/main/groovy/lucene.validation.ecj-lint.gradle:
##
@@ -74,6 +76,8 @@ def lintTasks = sourceSets.collect { SourceSet sourceSet ->
 dependsOn sourceSet.compileClasspath
 dependsOn ecjConfiguration
 
+mustRunAfter tasks.withType(SpotlessApply)

Review Comment:
   piggybacking this so that 'gradlew tidy check' works properly...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adjust base knn format assert assertOffHeapByteSize [lucene]

2025-06-17 Thread via GitHub


benwtrent merged PR #14797:
URL: https://github.com/apache/lucene/pull/14797


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add a linter flag to suppress warning about incubating vector module. [lucene]

2025-06-17 Thread via GitHub


dweiss opened a new pull request, #14802:
URL: https://github.com/apache/lucene/pull/14802

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a linter flag to suppress warning about incubating vector module. [lucene]

2025-06-17 Thread via GitHub


github-actions[bot] commented on PR #14802:
URL: https://github.com/apache/lucene/pull/14802#issuecomment-2979902324

   This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. 
If the PR doesn't need a changelog entry, then add the skip-changelog label to 
it and you will stop receiving this reminder on future updates to the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] .editorconfig [lucene]

2025-06-17 Thread via GitHub


dsmiley commented on PR #14740:
URL: https://github.com/apache/lucene/pull/14740#issuecomment-2979904591

   OMG that's ironic!  @rmuir, you added it (in March), and it only configures 
Python :-)   LOL
   
   Okay... well I think that file should be removed and it's python section 
integrated into the top-level file coming with this PR.  As to the very 
specifics of what it configures, I don't care (whatever you guys say) as Python 
isn't my thing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-06-17 Thread via GitHub


mikemccand commented on PR #14178:
URL: https://github.com/apache/lucene/pull/14178#issuecomment-2980130792

   > As a follow up, could you allow the 
[`mamba-org/setup-micromamba`](https://github.com/mamba-org/setup-micromamba) 
GH action to run on the Lucene repository -- so that the Faiss codec can be 
tested regularly? (we need `micromamba` to pull Faiss libraries from Conda, as 
a faster alternative to `miniconda`, `miniforge`, etc). It can be done from 
`Settings > Code and automation > Actions > General > Actions permissions`
   
   +1 for this, but I don't think I have karma for it (I don't see the Settings 
tab for `apache/lucene` repo) -- I'm not sure who does?  @dweiss maybe?  Or it 
might be we need ASF Infra help?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org