[GitHub] [lucene] javanna merged pull request #12325: Parallelize knn query rewrite across slices rather than segments
javanna merged PR #12325: URL: https://github.com/apache/lucene/pull/12325 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on pull request #12325: Parallelize knn query rewrite across slices rather than segments
javanna commented on PR #12325: URL: https://github.com/apache/lucene/pull/12325#issuecomment-1563919463 Thanks @zhaih for the review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna opened a new pull request, #12335: Don't generate stacktrace for TimeExceededException
javanna opened a new pull request, #12335: URL: https://github.com/apache/lucene/pull/12335 The exception is package private and never rethrown, we can avoid generating a stacktrace for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on pull request #12270: Don't generate stacktrace in CollectionTerminatedException
javanna commented on PR #12270: URL: https://github.com/apache/lucene/pull/12270#issuecomment-1564135251 I opened #12335 for TimeExceededException. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on a diff in pull request #12328: Optimize ConjunctionDISI.createConjunction
javanna commented on code in PR #12328: URL: https://github.com/apache/lucene/pull/12328#discussion_r1206530279 ## lucene/CHANGES.txt: ## @@ -76,6 +76,10 @@ Optimizations * GITHUB#11857, GITHUB#11859, GITHUB#11893, GITHUB#11909: Hunspell: improved suggestion performance (Peter Gromov) +* GITHUB#12160: Concurrent rewrite for AbstractKnnVectorQuery. (Kaival Parikh) Review Comment: I just pushed a fix for this to main (https://github.com/apache/lucene/commit/0ce6b9a67b0cca9338e21234fc40e70cf63fcecc), the entry it was among the Lucene 10 changes from before it was backported. You'll have a merge conflict to address around it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #12314: Multi-value support for KnnVectorField
alessandrobenedetti commented on code in PR #12314: URL: https://github.com/apache/lucene/pull/12314#discussion_r1206532595 ## lucene/core/src/java/org/apache/lucene/index/DocsWithFieldSet.java: ## @@ -22,6 +22,8 @@ import org.apache.lucene.util.FixedBitSet; import org.apache.lucene.util.RamUsageEstimator; +import java.util.Stack; Review Comment: I agree, for the time being I reverted the class and extended it. Feel free to have a look and apply/propose any change you see fit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #12314: Multi-value support for KnnVectorField
alessandrobenedetti commented on code in PR #12314: URL: https://github.com/apache/lucene/pull/12314#discussion_r1206533493 ## lucene/core/src/java/org/apache/lucene/index/DocsWithFieldSet.java: ## @@ -32,8 +34,14 @@ public final class DocsWithFieldSet extends DocIdSet { RamUsageEstimator.shallowSizeOfInstance(DocsWithFieldSet.class); private FixedBitSet set; - private int cardinality = 0; - private int lastDocId = -1; + private int docsCount = 0; + private int lastDocId = 0; // at a certain point in time this was changed to 0? why? + + private Stack valuesPerDocuments; + private int currentDocVectorsCount; + private int vectorsCount; + + private boolean multiValued = false; Review Comment: I don't have a strong opinion about this, it's a simple change though if necessary -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #12314: Multi-value support for KnnVectorField
alessandrobenedetti commented on code in PR #12314: URL: https://github.com/apache/lucene/pull/12314#discussion_r1206534495 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java: ## @@ -762,13 +776,18 @@ private void writeMeta( meta.writeVInt(DIRECT_MONOTONIC_BLOCK_SHIFT); // dense case and empty case do not need to store ordToMap mapping final DirectMonotonicWriter ordToDocWriter = - DirectMonotonicWriter.getInstance(meta, vectorData, count, DIRECT_MONOTONIC_BLOCK_SHIFT); + DirectMonotonicWriter.getInstance(meta, vectorData, vectorsCount, DIRECT_MONOTONIC_BLOCK_SHIFT); DocIdSetIterator iterator = docsWithField.iterator(); + int[] valuesPerDocument = docsWithField.getValuesPerDocument(); Review Comment: I'll check later on, happy to use a different implementation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] original-brownbear commented on a diff in pull request #12328: Optimize ConjunctionDISI.createConjunction
original-brownbear commented on code in PR #12328: URL: https://github.com/apache/lucene/pull/12328#discussion_r1206536107 ## lucene/CHANGES.txt: ## @@ -76,6 +76,10 @@ Optimizations * GITHUB#11857, GITHUB#11859, GITHUB#11893, GITHUB#11909: Hunspell: improved suggestion performance (Peter Gromov) +* GITHUB#12160: Concurrent rewrite for AbstractKnnVectorQuery. (Kaival Parikh) Review Comment: Ah thanks Luca, I pushed [3ce79c1](https://github.com/apache/lucene/pull/12328/commits/3ce79c14ba6d283342fcc5576fe7b29e0454d9af) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on a diff in pull request #12328: Optimize ConjunctionDISI.createConjunction
javanna commented on code in PR #12328: URL: https://github.com/apache/lucene/pull/12328#discussion_r1206530279 ## lucene/CHANGES.txt: ## @@ -76,6 +76,10 @@ Optimizations * GITHUB#11857, GITHUB#11859, GITHUB#11893, GITHUB#11909: Hunspell: improved suggestion performance (Peter Gromov) +* GITHUB#12160: Concurrent rewrite for AbstractKnnVectorQuery. (Kaival Parikh) Review Comment: I just pushed a fix for this to main (https://github.com/apache/lucene/commit/0ce6b9a67b0cca9338e21234fc40e70cf63fcecc), the entry was among the Lucene 10 changes from before it was backported. You'll have a merge conflict to address around it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on pull request #12314: Multi-value support for KnnVectorField
alessandrobenedetti commented on PR #12314: URL: https://github.com/apache/lucene/pull/12314#issuecomment-1564148395 I proceeded with some additional refactoring and refinements that find in the latest commits. The diff is down to 25 classes, query time has been simplified, and explicit multi-valued has been moved to transparent multi-valued. I am close to exhausting the funds (fully self-funded at the moment) my company has allocated to the project, but I'll be happy to continue with occasional discussions and reviews. If we get any external funds to continue the project, I'll let you know here. I'll follow up to a response to @jimczi in a next comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on pull request #12314: Multi-value support for KnnVectorField
alessandrobenedetti commented on PR #12314: URL: https://github.com/apache/lucene/pull/12314#issuecomment-1564165188 > Thanks for sharing and working on a prototype @alessandrobenedetti ! > > I have additional questions and comments ;) Starting with the devil advocate but I'd like to understand once more what is the use case? One possible approach today if the use case is passage retrieval is to index one document per passage/sentence. It works fine and you have access to the exact context of the match directly. What are the benefits of moving all passages to a single document? I am not saying there are none but they must be compelling if that requires all these changes. Ale: I guess the use case for multi valued vector-fields is not much different from any other multi valued fields: you may want to avoid normalisation and the necessity to build additional complexity with multiple duplicated documents that only have the vector field as a difference. If you have simple documents with just the vector field, splitting them in passages and then aggregating them in the end of your search pipeline is not going to be that annoying, but imaging more complex situations where you have aggregations already, nested documents or just documents with many more metadata along. On top of that we got curiosity from some customers about the functionality, for this reason I felt it was a nice addition. > > Regarding the change itself, I wonder if the approach could be more explicit. Instead of hiding the new feature I'd argue that we need to follow the doc values model where single and multi-valued are separated. In practice that would mean adding this to the KnnVectorsReader: > > ``` > public abstract MultiFloatVectorValues getMultiFloatVectorValues(String field) throws IOException; > public abstract MultiByteVectorValues getMultiByteVectorValues(String field) throws IOException; > public abstract TopDocs searchMulti(String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException; > ``` > > The writer side is a bit more tricky but you see what I mean. It's a big change that impacts the implementation and expectations of many existing functions if the change is directly inserted in the existing knn field. I know that it's easy to detect if a field is single or multi-valued under the hood so we could handle this transparently. We can do that with the explicit approach too. `searchMulti` can use `search` under the hood if we detect that the field contains exactly one vector per document. So my point is that we can try to share code as much as possible internally but we don't need to hide the difference externally. That was the initial approach, it was explicit at index and query time, and they were separate code paths from the single valued use case. So it was not affecting the single valued scenario at all. Following some of the feedback here I moved to transparent approach, that prooved extremely beneficial in reducing the complexity of the pull request (at the cost of impacting potentially some single valued case scenario). I iterated a bit, so I should have reduced the impact already on the single valued case, but not to 100% I guess. > > Another challenge for this change is the possible overflow in the graph representation. If we change the ordinal value to be the offset of the multivalued position, the maximum allowed is not per doc anymore but per vector. We use int32 in the graph to represents these ordinals, which matches with the max doc limit of Lucene. Maybe not a big deal for small scale but that's something to keep in mind when implementing things like merge or even multi-segments search. Yes, with one vector per document, the maximum amount of supported vectors was in line with the docs, this is still the case but I agree that right now we potentially support less docs, happy to change that. > > Anyway, quite exciting discussions, Ben and Mayya are reviewing too so just adding my 2cents and happy to help if/when we reach a consensus. Thanks for the feedback! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on pull request #12311: Integrate the Incubating Panama Vector API
alessandrobenedetti commented on PR #12311: URL: https://github.com/apache/lucene/pull/12311#issuecomment-1564190226 Just out of curiosity, do we tolerate this sort of class in Lucene? Are some of them auto-generated? (for example lucene/core/src/java20/org/apache/lucene/util/VectorUtilPanamaProvider.java) What's the standard approach in these scenarios? Not a polemic, I am genuinely curious because they seem far from being maintainable, but I guess they are useful as they bring low level implementations goodies? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jimczi commented on pull request #12314: Multi-value support for KnnVectorField
jimczi commented on PR #12314: URL: https://github.com/apache/lucene/pull/12314#issuecomment-1564208967 > That was the initial approach, it was explicit at index and query time, and they were separate code paths from the single valued use case. So it was not affecting the single valued scenario at all. Following some of the feedback here I moved to transparent approach, that prooved extremely beneficial in reducing the complexity of the pull request (at the cost of impacting potentially some single valued case scenario). I iterated a bit, so I should have reduced the impact already on the single valued case, but not to 100% I guess. My main worry is the change to `FloatVectorValue`, moving to a multivalued iterator changes the access pattern so I don't find it right to change the interface and the meaning of the ordinals that are returned based on multivalued or not. If only `search` was exposed in the format that would be ok I think but we're exposing direct access to the document's vector so the parallel with doc values is important imo. > Yes, with one vector per document, the maximum amount of supported vectors was in line with the docs, this is still the case but I agree that right now we potentially support less docs, happy to change that. Well I don't have a good idea to change that if we expect to have an ordinal per value. Switching the ordinal in the hnsw to int64 is not really practical. Adding a sub-id to each ordinal seems also quite invasive. Again not saying that it should block the change but we need to be clear on the downside of this approach and also ensures that we have the right guards for overflows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] original-brownbear commented on pull request #12328: Optimize ConjunctionDISI.createConjunction
original-brownbear commented on PR #12328: URL: https://github.com/apache/lucene/pull/12328#issuecomment-1564256307 npnp + thanks Luca! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna merged pull request #12328: Optimize ConjunctionDISI.createConjunction
javanna merged PR #12328: URL: https://github.com/apache/lucene/pull/12328 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12311: Integrate the Incubating Panama Vector API
uschindler commented on PR #12311: URL: https://github.com/apache/lucene/pull/12311#issuecomment-1564310637 Hi @alessandrobenedetti, the code shown here is indeed crazy to read, but this is more a problem of the APIs in general. The Java Vector API is very low level and you have to exactly know how lanes, species and so on work. The code written by Robert is 100% according to the javadocs guidelines. A lot of low level code (in codecs like BlockTermsReader is done like that). Also MMapDirectory indexinput look like that. They are not beatiful but optimized for performance. To me the variable names are perfectly fine vor vector code (`ab`, `a1b`,...). This is typical in that area. It won't get better with other names. The code on official JDK docs looks identical: https://docs.oracle.com/en/java/javase/20/docs/api/jdk.incubator.vector/jdk/incubator/vector/package-summary.html The arbitrary if/else constructs are a problem of underlying hardware infratstructure. It is NOT autogenerated, but follows low-level hardware specs, so there are arbitrary looking constants and if/else in it. This can be improved by moving numbers like 128 as constants, be free to make PRs! For performance reasons you should NOT split that up into too many different methods, as the code relies on escape analysis of the VM. We may split it later, but that's more a cleanup approach. An additional problem in the whole code is that it is Java version specific, so there will me multiple versions of the same code staying in different directories (java20, java21,...). Same for MMapDirectory. The extraction code for the Java APIs is special and a hack, but it is not part of Lucene's public library; it is a build tool only. Sorry for it, there's a new version now because a rewrite was needed to allow backporting and fix incomplete extraction: #12329 (this version looks much better, also technically bettrer organized). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] joegallo commented on pull request #12328: Optimize ConjunctionDISI.createConjunction
joegallo commented on PR #12328: URL: https://github.com/apache/lucene/pull/12328#issuecomment-1564371710 Does it make sense to backport this to 9.x? (or, perhaps, what is the process for doing that?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on pull request #12328: Optimize ConjunctionDISI.createConjunction
javanna commented on PR #12328: URL: https://github.com/apache/lucene/pull/12328#issuecomment-1564372833 heya @joegallo that was already the plan, it's done now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] joegallo commented on pull request #12328: Optimize ConjunctionDISI.createConjunction
joegallo commented on PR #12328: URL: https://github.com/apache/lucene/pull/12328#issuecomment-1564374279 Ah, outstanding! Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on issue #12317: Option for disabling term dictionary compression
gsmiller commented on issue #12317: URL: https://github.com/apache/lucene/issues/12317#issuecomment-1564397702 @jainankitk thanks! To clarify my question a little bit, my understanding is that you'd like to explore the idea of making this compression optional based on memory usage profiling. I guess what I'm wondering is if that would ever really be an overall benefit in your system (or for our users more generally). A smaller index has a number of benefits, one of which can be improved query-time performance due to data locality benefits, such as more index remaining hot in the page cache. I'd personally rather optimize for better query-time performance than memory consumption while indexing (within reason of course), but I acknowledge different users have different needs of course. I'm just wondering if disabling this compression is something that users would actually be interested in, as I question how it might impact the query performance. (Note I'm only responding to the aspect of making this configurable, not your other points about maybe making it more efficient in some cases) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #12334: Fix searchafter query high latency when after value is out of range for segment
gsmiller commented on code in PR #12334: URL: https://github.com/apache/lucene/pull/12334#discussion_r1206813319 ## lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java: ## @@ -204,13 +200,21 @@ private void updateCompetitiveIterator() throws IOException { return; } if (reverse == false) { -encodeBottom(maxValueAsBytes); +if (queueFull) { // bottom is avilable only when queue is full + maxValueAsBytes = maxValueAsBytes == null ? new byte[bytesCount] : maxValueAsBytes; Review Comment: Do we need the lazy initialization? I thought `topValueSet` would already be set before the `NumericLeafComparator` gets constructed. Maybe I'm misunderstanding that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit
mikemccand commented on PR #12320: URL: https://github.com/apache/lucene/pull/12320#issuecomment-1564480240 > Resolving the class naming conflicts from `main` was a bit of a hassle with an incremental git history. Woops, sorry! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #12311: Integrate the Incubating Panama Vector API
msokolov commented on PR #12311: URL: https://github.com/apache/lucene/pull/12311#issuecomment-1564492407 hm I looked more closely at the test I ran and it seems I managed to create a file full of identical vectors -- so this is going to lead to crazy results. WIll follow up once I've managed to fix the vector creation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit
mikemccand commented on code in PR #12320: URL: https://github.com/apache/lucene/pull/12320#discussion_r1206892745 ## lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java: ## @@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final int[] ints) { int utf8Upto = utf8.offset; final byte[] bytes = utf8.bytes; final int utf8Limit = utf8.offset + utf8.length; +UTF8CodePoint reuse = null; while (utf8Upto < utf8Limit) { - final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF]; - int v = 0; - switch (numBytes) { -case 1: - ints[utf32Count++] = bytes[utf8Upto++]; - continue; -case 2: - // 5 useful bits - v = bytes[utf8Upto++] & 31; - break; -case 3: - // 4 useful bits - v = bytes[utf8Upto++] & 15; - break; -case 4: - // 3 useful bits - v = bytes[utf8Upto++] & 7; - break; -default: - throw new IllegalArgumentException("invalid utf8"); - } + reuse = codePointAt(bytes, utf8Upto, reuse); + ints[utf32Count++] = reuse.codePoint; + utf8Upto += reuse.codePointBytes; +} - // TODO: this may read past utf8's limit. - final int limit = utf8Upto + numBytes - 1; - while (utf8Upto < limit) { -v = v << 6 | bytes[utf8Upto++] & 63; +return utf32Count; + } + + /** + * Computes the codepoint and codepoint length (in bytes) of the specified {@code offset} in the + * provided {@code utf8} byte array, assuming UTF8 encoding. As with other related methods in this + * class, this assumes valid UTF8 input and does not perform full UTF8 + * validation. + * + * @throws IllegalArgumentException If invalid codepoint header byte occurs or the content is Review Comment: I think we may also throw `ArrayIndexOutOfBoundException` on really badly not-UTF-8 `byte[]`? The `utf8CodeLength` array is I think length 248 (256 - 8). Also, it has a bunch of `v` in it, which I think are invalid UTF-8 first bytes, which should throw the `IllegalArgumentException`. Maybe either catch the AIOOBE and rethrow as IAE, or, soften the statement to say "throws various exceptions on invalid UTF-8, or, if the provided pos is NOT the start of a Unicode character". I don't think we want to promise we will always detect invalid UTF-8 and throw a clean exception. ## lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java: ## @@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final int[] ints) { int utf8Upto = utf8.offset; final byte[] bytes = utf8.bytes; final int utf8Limit = utf8.offset + utf8.length; +UTF8CodePoint reuse = null; while (utf8Upto < utf8Limit) { - final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF]; - int v = 0; - switch (numBytes) { -case 1: - ints[utf32Count++] = bytes[utf8Upto++]; - continue; -case 2: - // 5 useful bits - v = bytes[utf8Upto++] & 31; - break; -case 3: - // 4 useful bits - v = bytes[utf8Upto++] & 15; - break; -case 4: - // 3 useful bits - v = bytes[utf8Upto++] & 7; - break; -default: - throw new IllegalArgumentException("invalid utf8"); - } + reuse = codePointAt(bytes, utf8Upto, reuse); + ints[utf32Count++] = reuse.codePoint; + utf8Upto += reuse.codePointBytes; +} - // TODO: this may read past utf8's limit. - final int limit = utf8Upto + numBytes - 1; - while (utf8Upto < limit) { -v = v << 6 | bytes[utf8Upto++] & 63; +return utf32Count; + } + + /** + * Computes the codepoint and codepoint length (in bytes) of the specified {@code offset} in the + * provided {@code utf8} byte array, assuming UTF8 encoding. As with other related methods in this + * class, this assumes valid UTF8 input and does not perform full UTF8 + * validation. + * + * @throws IllegalArgumentException If invalid codepoint header byte occurs or the content is + * prematurely truncated. + */ + public static UTF8CodePoint codePointAt(byte[] utf8, int pos, UTF8CodePoint reuse) { +if (reuse == null) { + reuse = new UTF8CodePoint(); +} + +int leadByte = utf8[pos] & 0xFF; +int numBytes = utf8CodeLength[leadByte]; +reuse.codePointBytes = numBytes; +int v; +switch (numBytes) { + case 1 -> { +reuse.codePoint = leadByte; +return reuse; } - ints[utf32Count++] = v; + case 2 -> v = leadByte & 31; // 5 useful bits + case 3 -> v = leadByte & 15; // 4 useful bits + case 4 -> v = leadByte & 7; // 3 useful bits + default -> throw new IllegalArgumentException("invalid utf8"); Review Comment: Maybe include the `Arrays.toString(utf8)` and `pos` in the exception m
[GitHub] [lucene] uschindler commented on pull request #12311: Integrate the Incubating Panama Vector API
uschindler commented on PR #12311: URL: https://github.com/apache/lucene/pull/12311#issuecomment-1564518097 Hi, I changed te CHANGES.txt entry in main and 9.x to correctly refer to ARM's chipset feature (NEON). @rmuir asked me to correct it. See: https://en.wikipedia.org/wiki/ARM_architecture_family#Advanced_SIMD_(Neon) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12268: add BitSet.clear()
uschindler commented on PR #12268: URL: https://github.com/apache/lucene/pull/12268#issuecomment-1564522169 Hi, sorry this went out of my view. Could you please add a CHANGES.txt entry in the 9.7 part of the file? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12268: add BitSet.clear()
uschindler commented on PR #12268: URL: https://github.com/apache/lucene/pull/12268#issuecomment-1564523245 I will then press the merge button and cherry-pick it in 9.x branch for next release 9.7. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12293: Capture build scans on ge.apache.org to benefit from deep build insights
uschindler commented on PR #12293: URL: https://github.com/apache/lucene/pull/12293#issuecomment-1564543093 I am fine with the changes (mostly), but I still don't understand why this needs to be on top-top level and can't be inside the `gradle/` subfolder. Also please make it conditionally when running on 3rd party build Servers. An idea would be no use the JENKINS_URL with a regex on "apache.org". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] clayburn commented on pull request #12293: Capture build scans on ge.apache.org to benefit from deep build insights
clayburn commented on PR #12293: URL: https://github.com/apache/lucene/pull/12293#issuecomment-1564550262 > Is there a solution for 3rd party build Servers not having any CI secret. > I am fine with the changes (mostly), except. > Also please make it conditionally when running on 3rd party build Servers. An idea would be no use the JENKINS_URL with a regex on "apache.org". @uschindler - This is handled by the `publishIfAuthenticated` setting. For a 3rd party CI server (or an unathenticated local user), no publishing occurs. You can see what this looks like by just running a build on this branch without authenticating to https://ge.apache.org. The behavior when unauthenticated essentially becomes a no-op. I have another change incoming to add the license header to `ge.gradle` to fix that check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gashutos commented on a diff in pull request #12334: Fix searchafter query high latency when after value is out of range for segment
gashutos commented on code in PR #12334: URL: https://github.com/apache/lucene/pull/12334#discussion_r1206958937 ## lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java: ## @@ -204,13 +200,21 @@ private void updateCompetitiveIterator() throws IOException { return; } if (reverse == false) { -encodeBottom(maxValueAsBytes); +if (queueFull) { // bottom is avilable only when queue is full + maxValueAsBytes = maxValueAsBytes == null ? new byte[bytesCount] : maxValueAsBytes; Review Comment: We dont know upfront at the time of construction (where currently initialization is done) that would we be needing maxValueAsBytes & minValuesAsBytes both. Like about the case, where we dont have any competitive hit collected in queue hence no bottom but has ```after``` value so the topValue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit
gsmiller commented on code in PR #12320: URL: https://github.com/apache/lucene/pull/12320#discussion_r1206966623 ## lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java: ## @@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final int[] ints) { int utf8Upto = utf8.offset; final byte[] bytes = utf8.bytes; final int utf8Limit = utf8.offset + utf8.length; +UTF8CodePoint reuse = null; while (utf8Upto < utf8Limit) { - final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF]; - int v = 0; - switch (numBytes) { -case 1: - ints[utf32Count++] = bytes[utf8Upto++]; - continue; -case 2: - // 5 useful bits - v = bytes[utf8Upto++] & 31; - break; -case 3: - // 4 useful bits - v = bytes[utf8Upto++] & 15; - break; -case 4: - // 3 useful bits - v = bytes[utf8Upto++] & 7; - break; -default: - throw new IllegalArgumentException("invalid utf8"); - } + reuse = codePointAt(bytes, utf8Upto, reuse); + ints[utf32Count++] = reuse.codePoint; + utf8Upto += reuse.codePointBytes; +} - // TODO: this may read past utf8's limit. - final int limit = utf8Upto + numBytes - 1; - while (utf8Upto < limit) { -v = v << 6 | bytes[utf8Upto++] & 63; +return utf32Count; + } + + /** + * Computes the codepoint and codepoint length (in bytes) of the specified {@code offset} in the + * provided {@code utf8} byte array, assuming UTF8 encoding. As with other related methods in this + * class, this assumes valid UTF8 input and does not perform full UTF8 + * validation. + * + * @throws IllegalArgumentException If invalid codepoint header byte occurs or the content is Review Comment: You're correct that it could AIOOBE on a particularly malformed header byte. I think the `v` business is OK since the default switch case translates that to IAE, but I agree with your suggestion to make a more general statement that this method may do all sort of terrible and unexpected things if you feed it invalid utf8 (or reference an invalid start position) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on pull request #12311: Integrate the Incubating Panama Vector API
alessandrobenedetti commented on PR #12311: URL: https://github.com/apache/lucene/pull/12311#issuecomment-1564575142 thanks @uschindler for the explanation, I appreciate the work you are doing! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit
gsmiller commented on code in PR #12320: URL: https://github.com/apache/lucene/pull/12320#discussion_r1206972753 ## lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java: ## @@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final int[] ints) { int utf8Upto = utf8.offset; final byte[] bytes = utf8.bytes; final int utf8Limit = utf8.offset + utf8.length; +UTF8CodePoint reuse = null; while (utf8Upto < utf8Limit) { - final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF]; - int v = 0; - switch (numBytes) { -case 1: - ints[utf32Count++] = bytes[utf8Upto++]; - continue; -case 2: - // 5 useful bits - v = bytes[utf8Upto++] & 31; - break; -case 3: - // 4 useful bits - v = bytes[utf8Upto++] & 15; - break; -case 4: - // 3 useful bits - v = bytes[utf8Upto++] & 7; - break; -default: - throw new IllegalArgumentException("invalid utf8"); - } + reuse = codePointAt(bytes, utf8Upto, reuse); + ints[utf32Count++] = reuse.codePoint; + utf8Upto += reuse.codePointBytes; +} - // TODO: this may read past utf8's limit. - final int limit = utf8Upto + numBytes - 1; - while (utf8Upto < limit) { -v = v << 6 | bytes[utf8Upto++] & 63; +return utf32Count; + } + + /** + * Computes the codepoint and codepoint length (in bytes) of the specified {@code offset} in the + * provided {@code utf8} byte array, assuming UTF8 encoding. As with other related methods in this + * class, this assumes valid UTF8 input and does not perform full UTF8 + * validation. + * + * @throws IllegalArgumentException If invalid codepoint header byte occurs or the content is + * prematurely truncated. + */ + public static UTF8CodePoint codePointAt(byte[] utf8, int pos, UTF8CodePoint reuse) { +if (reuse == null) { + reuse = new UTF8CodePoint(); +} + +int leadByte = utf8[pos] & 0xFF; +int numBytes = utf8CodeLength[leadByte]; +reuse.codePointBytes = numBytes; +int v; +switch (numBytes) { + case 1 -> { +reuse.codePoint = leadByte; +return reuse; } - ints[utf32Count++] = v; + case 2 -> v = leadByte & 31; // 5 useful bits + case 3 -> v = leadByte & 15; // 4 useful bits + case 4 -> v = leadByte & 7; // 3 useful bits + default -> throw new IllegalArgumentException("invalid utf8"); Review Comment: How about the header byte that resulted in an illegal parse? I'm a little nervous of including the whole substring of bytes as it has unbounded length and could be a bit unwieldy? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on pull request #12314: Multi-value support for KnnVectorField
alessandrobenedetti commented on PR #12314: URL: https://github.com/apache/lucene/pull/12314#issuecomment-1564579302 > My main worry is the change to `FloatVectorValue`, moving to a multivalued iterator changes the access pattern so I don't find it right to change the interface and the meaning of the ordinals that are returned based on multivalued or not. > If only `search` was exposed in the format that would be ok I think but we're exposing direct access to the document's vector so the parallel with doc values is important Hi @jimczi, nothing in this PR is final nor I have any strong opinion about it. My main intention is to keep the PR as small and as valuable as possible, to build a common ground (and tests) to build the functionality (if nice to have, if not, it was a cool exercise and that's equally fine). In regards to your main worry, can you point me to the areas of code you don't like specifically and I can have a thought in how to modify them! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jbellis commented on pull request #12268: add BitSet.clear()
jbellis commented on PR #12268: URL: https://github.com/apache/lucene/pull/12268#issuecomment-1564588827 done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit
gsmiller commented on PR #12320: URL: https://github.com/apache/lucene/pull/12320#issuecomment-1564595369 Thanks @mikemccand! Did a pass to address your comments. Much appreciated! I also added some testing around the minimization aspect of the automaton building. I think all feedback has been addressed at this point, but no rush on having another look. Thanks again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #12331: GH#12321: Reduce visibility of StringsToAutomaton
gsmiller merged PR #12331: URL: https://github.com/apache/lucene/pull/12331 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #12332: GH#12321: Marked DaciukMihovAutomatonBuilder as deprecated
gsmiller merged PR #12332: URL: https://github.com/apache/lucene/pull/12332 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler merged pull request #12268: add BitSet.clear()
uschindler merged PR #12268: URL: https://github.com/apache/lucene/pull/12268 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12332: GH#12321: Marked DaciukMihovAutomatonBuilder as deprecated
uschindler commented on PR #12332: URL: https://github.com/apache/lucene/pull/12332#issuecomment-1564636151 Thanks. Was also backported to 9.x and will be released with 9.7. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12268: add BitSet.clear()
uschindler commented on PR #12268: URL: https://github.com/apache/lucene/pull/12268#issuecomment-1564636408 Thanks. Was also backported to 9.x and will be released with 9.7. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jimczi commented on pull request #12314: Multi-value support for KnnVectorField
jimczi commented on PR #12314: URL: https://github.com/apache/lucene/pull/12314#issuecomment-1564665498 > nothing in this PR is final nor I have any strong opinion about it. Sure, we're just discussing the approach, no worries. > In regards to your main worry, can you point me to the areas of code you don't like specifically and I can have a thought in how to modify them! This change in [Float|Byte]VectorValues: ``` /** The maximum length of a vector */ public static final int MAX_DIMENSIONS = 1024; protected boolean multiValued = false; public boolean isMultiValued() { return multiValued; } /** Sole constructor */ protected FloatVectorValues() {} @@ -57,4 +63,13 @@ public final long cost() { * @return the vector value */ public abstract float[] vectorValue() throws IOException; /** * Return the document ID corresponding to the input vector id(ordinal) * @param ord vector ID(ordinal) * @return the document ID */ public int ordToDoc(int ord){ return ord; } } ``` Today the expectation is that it iterates over doc ids. This change adds an indirection that needs to be checked (`isMultivalued`) and if it's the case then the doc id is an ordinal id that needs to be transformed with `ordToDoc` . That's trappy, I am not even sure how you can advance to a doc rather than an ordinal and the code would mean different things based on whether you're working on multivalued or not. Considering how the original APIs was made for single valued I don't think we should try to sneak multi-valued into the model. That's why I propose that we add it as separated like you originally did. It's a doc values + search APIs so it needs to be usable for these two purposes in a more predictive way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on issue #12321: Can we make `DaciukMihovAutomatonBuilder` pkg-private?
gsmiller commented on issue #12321: URL: https://github.com/apache/lucene/issues/12321#issuecomment-1564712751 Merged on `main` (#12331) and also added some deprecation notices on 9.x (#12332). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller closed issue #12321: Can we make `DaciukMihovAutomatonBuilder` pkg-private?
gsmiller closed issue #12321: Can we make `DaciukMihovAutomatonBuilder` pkg-private? URL: https://github.com/apache/lucene/issues/12321 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dsmiley commented on pull request #12306: Make MAX_DIMENSIONS configurable via a system property.
dsmiley commented on PR #12306: URL: https://github.com/apache/lucene/pull/12306#issuecomment-1565223689 The PR has evolved from the first iteration. Are there remaining concerns with the PR as it is today? It shows that 2048 dimensions is tested, works, and thus is *supportable*. It would make a lot of our users happy (Lucene exists to serve them) and lure more users, and I don't see what harm it would bring. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org