date:20230526

[GitHub] [lucene] javanna merged pull request #12325: Parallelize knn query rewrite across slices rather than segments

2023-05-26 Thread via GitHub



javanna merged PR #12325:
URL: https://github.com/apache/lucene/pull/12325


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] javanna commented on pull request #12325: Parallelize knn query rewrite across slices rather than segments

2023-05-26 Thread via GitHub



javanna commented on PR #12325:
URL: https://github.com/apache/lucene/pull/12325#issuecomment-1563919463

   Thanks @zhaih for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] javanna opened a new pull request, #12335: Don't generate stacktrace for TimeExceededException

2023-05-26 Thread via GitHub



javanna opened a new pull request, #12335:
URL: https://github.com/apache/lucene/pull/12335

   The exception is package private and never rethrown, we can avoid generating 
a stacktrace for it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] javanna commented on pull request #12270: Don't generate stacktrace in CollectionTerminatedException

2023-05-26 Thread via GitHub



javanna commented on PR #12270:
URL: https://github.com/apache/lucene/pull/12270#issuecomment-1564135251

   I opened #12335 for TimeExceededException.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] javanna commented on a diff in pull request #12328: Optimize ConjunctionDISI.createConjunction

2023-05-26 Thread via GitHub



javanna commented on code in PR #12328:
URL: https://github.com/apache/lucene/pull/12328#discussion_r1206530279


##
lucene/CHANGES.txt:
##
@@ -76,6 +76,10 @@ Optimizations
 
 * GITHUB#11857, GITHUB#11859, GITHUB#11893, GITHUB#11909: Hunspell: improved 
suggestion performance (Peter Gromov)
 
+* GITHUB#12160: Concurrent rewrite for AbstractKnnVectorQuery. (Kaival Parikh)

Review Comment:
   I just pushed a fix for this to main 
(https://github.com/apache/lucene/commit/0ce6b9a67b0cca9338e21234fc40e70cf63fcecc),
 the entry it was among the Lucene 10 changes from before it was backported. 
You'll have a merge conflict to address around it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #12314: Multi-value support for KnnVectorField

2023-05-26 Thread via GitHub



alessandrobenedetti commented on code in PR #12314:
URL: https://github.com/apache/lucene/pull/12314#discussion_r1206532595


##
lucene/core/src/java/org/apache/lucene/index/DocsWithFieldSet.java:
##
@@ -22,6 +22,8 @@
 import org.apache.lucene.util.FixedBitSet;
 import org.apache.lucene.util.RamUsageEstimator;
 
+import java.util.Stack;

Review Comment:
   I agree, for the time being I reverted the class and extended it. Feel free 
to have a look and apply/propose any change you see fit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #12314: Multi-value support for KnnVectorField

2023-05-26 Thread via GitHub



alessandrobenedetti commented on code in PR #12314:
URL: https://github.com/apache/lucene/pull/12314#discussion_r1206533493


##
lucene/core/src/java/org/apache/lucene/index/DocsWithFieldSet.java:
##
@@ -32,8 +34,14 @@ public final class DocsWithFieldSet extends DocIdSet {
   RamUsageEstimator.shallowSizeOfInstance(DocsWithFieldSet.class);
 
   private FixedBitSet set;
-  private int cardinality = 0;
-  private int lastDocId = -1;
+  private int docsCount = 0;
+  private int lastDocId = 0; // at a certain point in time this was changed to 
0? why?
+  
+  private Stack valuesPerDocuments;
+  private int currentDocVectorsCount;
+  private int vectorsCount;
+  
+  private boolean multiValued = false;

Review Comment:
   I don't have a strong opinion about this, it's a simple change though if 
necessary



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on a diff in pull request #12314: Multi-value support for KnnVectorField

2023-05-26 Thread via GitHub



alessandrobenedetti commented on code in PR #12314:
URL: https://github.com/apache/lucene/pull/12314#discussion_r1206534495


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java:
##
@@ -762,13 +776,18 @@ private void writeMeta(
   meta.writeVInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
   // dense case and empty case do not need to store ordToMap mapping
   final DirectMonotonicWriter ordToDocWriter =
-  DirectMonotonicWriter.getInstance(meta, vectorData, count, 
DIRECT_MONOTONIC_BLOCK_SHIFT);
+  DirectMonotonicWriter.getInstance(meta, vectorData, vectorsCount, 
DIRECT_MONOTONIC_BLOCK_SHIFT);
   DocIdSetIterator iterator = docsWithField.iterator();
+  int[] valuesPerDocument = docsWithField.getValuesPerDocument();

Review Comment:
   I'll check later on, happy to use a different implementation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] original-brownbear commented on a diff in pull request #12328: Optimize ConjunctionDISI.createConjunction

2023-05-26 Thread via GitHub



original-brownbear commented on code in PR #12328:
URL: https://github.com/apache/lucene/pull/12328#discussion_r1206536107


##
lucene/CHANGES.txt:
##
@@ -76,6 +76,10 @@ Optimizations
 
 * GITHUB#11857, GITHUB#11859, GITHUB#11893, GITHUB#11909: Hunspell: improved 
suggestion performance (Peter Gromov)
 
+* GITHUB#12160: Concurrent rewrite for AbstractKnnVectorQuery. (Kaival Parikh)

Review Comment:
   Ah thanks Luca, I pushed 
[3ce79c1](https://github.com/apache/lucene/pull/12328/commits/3ce79c14ba6d283342fcc5576fe7b29e0454d9af)
 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] javanna commented on a diff in pull request #12328: Optimize ConjunctionDISI.createConjunction

2023-05-26 Thread via GitHub



javanna commented on code in PR #12328:
URL: https://github.com/apache/lucene/pull/12328#discussion_r1206530279


##
lucene/CHANGES.txt:
##
@@ -76,6 +76,10 @@ Optimizations
 
 * GITHUB#11857, GITHUB#11859, GITHUB#11893, GITHUB#11909: Hunspell: improved 
suggestion performance (Peter Gromov)
 
+* GITHUB#12160: Concurrent rewrite for AbstractKnnVectorQuery. (Kaival Parikh)

Review Comment:
   I just pushed a fix for this to main 
(https://github.com/apache/lucene/commit/0ce6b9a67b0cca9338e21234fc40e70cf63fcecc),
 the entry was among the Lucene 10 changes from before it was backported. 
You'll have a merge conflict to address around it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on pull request #12314: Multi-value support for KnnVectorField

2023-05-26 Thread via GitHub



alessandrobenedetti commented on PR #12314:
URL: https://github.com/apache/lucene/pull/12314#issuecomment-1564148395

   I proceeded with some additional refactoring and refinements that find in 
the latest commits.
   The diff is down to 25 classes, query time has been simplified, and explicit 
multi-valued has been moved to transparent multi-valued.
   I am close to exhausting the funds (fully self-funded at the moment) my 
company has allocated to the project, but I'll be happy to continue with 
occasional discussions and reviews.
   If we get any external funds to continue the project, I'll let you know here.
   I'll follow up to a response to @jimczi in a next comment.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on pull request #12314: Multi-value support for KnnVectorField

2023-05-26 Thread via GitHub



alessandrobenedetti commented on PR #12314:
URL: https://github.com/apache/lucene/pull/12314#issuecomment-1564165188

   > Thanks for sharing and working on a prototype @alessandrobenedetti !
   > 
   > I have additional questions and comments ;) Starting with the devil 
advocate but I'd like to understand once more what is the use case? One 
possible approach today if the use case is passage retrieval is to index one 
document per passage/sentence. It works fine and you have access to the exact 
context of the match directly. What are the benefits of moving all passages to 
a single document? I am not saying there are none but they must be compelling 
if that requires all these changes.
   
   Ale: I guess the use case for multi valued vector-fields is not much 
different from any other multi valued fields:
   you may want to avoid normalisation and the necessity to build additional 
complexity with multiple duplicated documents that only have the vector field 
as a difference.
   If you have simple documents with just the vector field, splitting them in 
passages and then aggregating them in the end of your search pipeline is not 
going to be that annoying, but imaging more complex situations where you have 
aggregations already, nested documents or just documents with many more 
metadata along.
   On top of that we got curiosity from some customers about the functionality, 
for this reason I felt it was a nice addition.
   
   > 
   > Regarding the change itself, I wonder if the approach could be more 
explicit. Instead of hiding the new feature I'd argue that we need to follow 
the doc values model where single and multi-valued are separated. In practice 
that would mean adding this to the KnnVectorsReader:
   > 
   > ```
   >   public abstract MultiFloatVectorValues getMultiFloatVectorValues(String 
field) throws IOException;
   >   public abstract MultiByteVectorValues getMultiByteVectorValues(String 
field) throws IOException;
   >   public abstract TopDocs searchMulti(String field, float[] target, int k, 
Bits acceptDocs, int visitedLimit) throws IOException;
   > ```
   > 
   > The writer side is a bit more tricky but you see what I mean. It's a big 
change that impacts the implementation and expectations of many existing 
functions if the change is directly inserted in the existing knn field. I know 
that it's easy to detect if a field is single or multi-valued under the hood so 
we could handle this transparently. We can do that with the explicit approach 
too. `searchMulti` can use `search` under the hood if we detect that the field 
contains exactly one vector per document. So my point is that we can try to 
share code as much as possible internally but we don't need to hide the 
difference externally.
   
   That was the initial approach, it was explicit at index and query time, and 
they were separate code paths from the single valued use case. So it was not 
affecting the single valued scenario at all.
   Following some of the feedback here I moved to transparent approach, that 
prooved extremely beneficial in reducing the complexity of the pull request (at 
the cost of impacting potentially some single valued case scenario).
   I iterated a bit, so I should have reduced the impact already on the single 
valued case, but not to 100% I guess.
   
   > 
   > Another challenge for this change is the possible overflow in the graph 
representation. If we change the ordinal value to be the offset of the 
multivalued position, the maximum allowed is not per doc anymore but per 
vector. We use int32 in the graph to represents these ordinals, which matches 
with the max doc limit of Lucene. Maybe not a big deal for small scale but 
that's something to keep in mind when implementing things like merge or even 
multi-segments search.
   
   Yes, with one vector per document, the maximum amount of supported vectors 
was in line with the docs, this is still the case but I agree that right now we 
potentially support less docs, happy to change that.
   > 
   > Anyway, quite exciting discussions, Ben and Mayya are reviewing too so 
just adding my 2cents and happy to help if/when we reach a consensus.
   
   Thanks for the feedback!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on pull request #12311: Integrate the Incubating Panama Vector API

2023-05-26 Thread via GitHub



alessandrobenedetti commented on PR #12311:
URL: https://github.com/apache/lucene/pull/12311#issuecomment-1564190226

   Just out of curiosity, do we tolerate this sort of class in Lucene?
   Are some of them auto-generated? (for example 
lucene/core/src/java20/org/apache/lucene/util/VectorUtilPanamaProvider.java)
   What's the standard approach in these scenarios?
   
   Not a polemic, I am genuinely curious because they seem far from being 
maintainable, but I guess they are useful as they bring low level 
implementations goodies?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jimczi commented on pull request #12314: Multi-value support for KnnVectorField

2023-05-26 Thread via GitHub



jimczi commented on PR #12314:
URL: https://github.com/apache/lucene/pull/12314#issuecomment-1564208967

   > That was the initial approach, it was explicit at index and query time, 
and they were separate code paths from the single valued use case. So it was 
not affecting the single valued scenario at all.
   Following some of the feedback here I moved to transparent approach, that 
prooved extremely beneficial in reducing the complexity of the pull request (at 
the cost of impacting potentially some single valued case scenario).
   I iterated a bit, so I should have reduced the impact already on the single 
valued case, but not to 100% I guess.
   
   My main worry is the change to `FloatVectorValue`, moving to a multivalued 
iterator changes the access pattern so I don't find it right to change the 
interface and the meaning of the ordinals that are returned based on 
multivalued or not.
   If only `search` was exposed in the format that would be ok I think but 
we're exposing direct access to the document's vector so the parallel with doc 
values is important imo. 
   
   > Yes, with one vector per document, the maximum amount of supported vectors 
was in line with the docs, this is still the case but I agree that right now we 
potentially support less docs, happy to change that.
   
   Well I don't have a good idea to change that if we expect to have an ordinal 
per value. Switching the ordinal in the hnsw to int64 is not really practical. 
Adding a sub-id to each ordinal seems also quite invasive. Again not saying 
that it should block the change but we need to be clear on the downside of this 
approach and also ensures that we have the right guards for overflows.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] original-brownbear commented on pull request #12328: Optimize ConjunctionDISI.createConjunction

2023-05-26 Thread via GitHub



original-brownbear commented on PR #12328:
URL: https://github.com/apache/lucene/pull/12328#issuecomment-1564256307

   npnp + thanks Luca!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] javanna merged pull request #12328: Optimize ConjunctionDISI.createConjunction

2023-05-26 Thread via GitHub



javanna merged PR #12328:
URL: https://github.com/apache/lucene/pull/12328


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #12311: Integrate the Incubating Panama Vector API

2023-05-26 Thread via GitHub

uschindler commented on PR #12311:
URL: https://github.com/apache/lucene/pull/12311#issuecomment-1564310637

Hi @alessandrobenedetti, the code shown here is indeed crazy to read, but
this is more a problem of the APIs in general. The Java Vector API is very low
level and you have to exactly know how lanes, species and so on work. The code
written by Robert is 100% according to the javadocs guidelines.
A lot of low level code (in codecs like BlockTermsReader is done like that).
Also MMapDirectory indexinput look like that. They are not beatiful but
optimized for performance. To me the variable names are perfectly fine vor
vector code (`ab`, `a1b`,...). This is typical in that area. It won't get
better with other names.

The code on official JDK docs looks identical:
https://docs.oracle.com/en/java/javase/20/docs/api/jdk.incubator.vector/jdk/incubator/vector/package-summary.html

The arbitrary if/else constructs are a problem of underlying hardware
infratstructure. It is NOT autogenerated, but follows low-level hardware specs,
so there are arbitrary looking constants and if/else in it. This can be
improved by moving numbers like 128 as constants, be free to make PRs! For
performance reasons you should NOT split that up into too many different
methods, as the code relies on escape analysis of the VM. We may split it
later, but that's more a cleanup approach.

An additional problem in the whole code is that it is Java version specific,
so there will me multiple versions of the same code staying in different
directories (java20, java21,...). Same for MMapDirectory.

The extraction code for the Java APIs is special and a hack, but it is not
part of Lucene's public library; it is a build tool only. Sorry for it, there's
a new version now because a rewrite was needed to allow backporting and fix
incomplete extraction: #12329 (this version looks much better, also technically
bettrer organized).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] joegallo commented on pull request #12328: Optimize ConjunctionDISI.createConjunction

2023-05-26 Thread via GitHub



joegallo commented on PR #12328:
URL: https://github.com/apache/lucene/pull/12328#issuecomment-1564371710

   Does it make sense to backport this to 9.x? (or, perhaps, what is the 
process for doing that?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] javanna commented on pull request #12328: Optimize ConjunctionDISI.createConjunction

2023-05-26 Thread via GitHub



javanna commented on PR #12328:
URL: https://github.com/apache/lucene/pull/12328#issuecomment-1564372833

   heya @joegallo that was already the plan, it's done now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] joegallo commented on pull request #12328: Optimize ConjunctionDISI.createConjunction

2023-05-26 Thread via GitHub



joegallo commented on PR #12328:
URL: https://github.com/apache/lucene/pull/12328#issuecomment-1564374279

   Ah, outstanding! Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on issue #12317: Option for disabling term dictionary compression

2023-05-26 Thread via GitHub



gsmiller commented on issue #12317:
URL: https://github.com/apache/lucene/issues/12317#issuecomment-1564397702

   @jainankitk thanks! To clarify my question a little bit, my understanding is 
that you'd like to explore the idea of making this compression optional based 
on memory usage profiling. I guess what I'm wondering is if that would ever 
really be an overall benefit in your system (or for our users more generally). 
A smaller index has a number of benefits, one of which can be improved 
query-time performance due to data locality benefits, such as more index 
remaining hot in the page cache. I'd personally rather optimize for better 
query-time performance than memory consumption while indexing (within reason of 
course), but I acknowledge different users have different needs of course. I'm 
just wondering if disabling this compression is something that users would 
actually be interested in, as I question how it might impact the query 
performance.
   
   (Note I'm only responding to the aspect of making this configurable, not 
your other points about maybe making it more efficient in some cases)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12334: Fix searchafter query high latency when after value is out of range for segment

2023-05-26 Thread via GitHub



gsmiller commented on code in PR #12334:
URL: https://github.com/apache/lucene/pull/12334#discussion_r1206813319


##
lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java:
##
@@ -204,13 +200,21 @@ private void updateCompetitiveIterator() throws 
IOException {
 return;
   }
   if (reverse == false) {
-encodeBottom(maxValueAsBytes);
+if (queueFull) { // bottom is avilable only when queue is full
+  maxValueAsBytes = maxValueAsBytes == null ? new byte[bytesCount] : 
maxValueAsBytes;

Review Comment:
   Do we need the lazy initialization? I thought `topValueSet` would already be 
set before the `NumericLeafComparator` gets constructed. Maybe I'm 
misunderstanding that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

2023-05-26 Thread via GitHub



mikemccand commented on PR #12320:
URL: https://github.com/apache/lucene/pull/12320#issuecomment-1564480240

   > Resolving the class naming conflicts from `main` was a bit of a hassle 
with an incremental git history.
   
   Woops, sorry!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #12311: Integrate the Incubating Panama Vector API

2023-05-26 Thread via GitHub



msokolov commented on PR #12311:
URL: https://github.com/apache/lucene/pull/12311#issuecomment-1564492407

   hm I looked more closely at the test I ran and it seems I managed to create 
a file full of identical vectors -- so this is going to lead to crazy results. 
WIll follow up once I've managed to fix the vector creation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

2023-05-26 Thread via GitHub



mikemccand commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1206892745


##
lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java:
##
@@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final 
int[] ints) {
 int utf8Upto = utf8.offset;
 final byte[] bytes = utf8.bytes;
 final int utf8Limit = utf8.offset + utf8.length;
+UTF8CodePoint reuse = null;
 while (utf8Upto < utf8Limit) {
-  final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF];
-  int v = 0;
-  switch (numBytes) {
-case 1:
-  ints[utf32Count++] = bytes[utf8Upto++];
-  continue;
-case 2:
-  // 5 useful bits
-  v = bytes[utf8Upto++] & 31;
-  break;
-case 3:
-  // 4 useful bits
-  v = bytes[utf8Upto++] & 15;
-  break;
-case 4:
-  // 3 useful bits
-  v = bytes[utf8Upto++] & 7;
-  break;
-default:
-  throw new IllegalArgumentException("invalid utf8");
-  }
+  reuse = codePointAt(bytes, utf8Upto, reuse);
+  ints[utf32Count++] = reuse.codePoint;
+  utf8Upto += reuse.codePointBytes;
+}
 
-  // TODO: this may read past utf8's limit.
-  final int limit = utf8Upto + numBytes - 1;
-  while (utf8Upto < limit) {
-v = v << 6 | bytes[utf8Upto++] & 63;
+return utf32Count;
+  }
+
+  /**
+   * Computes the codepoint and codepoint length (in bytes) of the specified 
{@code offset} in the
+   * provided {@code utf8} byte array, assuming UTF8 encoding. As with other 
related methods in this
+   * class, this assumes valid UTF8 input and does not 
perform full UTF8
+   * validation.
+   *
+   * @throws IllegalArgumentException If invalid codepoint header byte occurs 
or the content is

Review Comment:
   I think we may also throw `ArrayIndexOutOfBoundException` on really badly 
not-UTF-8 `byte[]`?  The `utf8CodeLength` array is I think length 248 (256 - 
8).  Also, it has a bunch of `v` in it, which I think are invalid UTF-8 first 
bytes, which should throw the `IllegalArgumentException`.
   
   Maybe either catch the AIOOBE and rethrow as IAE, or, soften the statement 
to say "throws various exceptions on invalid UTF-8, or, if the provided pos is 
NOT the start of a Unicode character".  I don't think we want to promise we 
will always detect invalid UTF-8 and throw a clean exception.



##
lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java:
##
@@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final 
int[] ints) {
 int utf8Upto = utf8.offset;
 final byte[] bytes = utf8.bytes;
 final int utf8Limit = utf8.offset + utf8.length;
+UTF8CodePoint reuse = null;
 while (utf8Upto < utf8Limit) {
-  final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF];
-  int v = 0;
-  switch (numBytes) {
-case 1:
-  ints[utf32Count++] = bytes[utf8Upto++];
-  continue;
-case 2:
-  // 5 useful bits
-  v = bytes[utf8Upto++] & 31;
-  break;
-case 3:
-  // 4 useful bits
-  v = bytes[utf8Upto++] & 15;
-  break;
-case 4:
-  // 3 useful bits
-  v = bytes[utf8Upto++] & 7;
-  break;
-default:
-  throw new IllegalArgumentException("invalid utf8");
-  }
+  reuse = codePointAt(bytes, utf8Upto, reuse);
+  ints[utf32Count++] = reuse.codePoint;
+  utf8Upto += reuse.codePointBytes;
+}
 
-  // TODO: this may read past utf8's limit.
-  final int limit = utf8Upto + numBytes - 1;
-  while (utf8Upto < limit) {
-v = v << 6 | bytes[utf8Upto++] & 63;
+return utf32Count;
+  }
+
+  /**
+   * Computes the codepoint and codepoint length (in bytes) of the specified 
{@code offset} in the
+   * provided {@code utf8} byte array, assuming UTF8 encoding. As with other 
related methods in this
+   * class, this assumes valid UTF8 input and does not 
perform full UTF8
+   * validation.
+   *
+   * @throws IllegalArgumentException If invalid codepoint header byte occurs 
or the content is
+   * prematurely truncated.
+   */
+  public static UTF8CodePoint codePointAt(byte[] utf8, int pos, UTF8CodePoint 
reuse) {
+if (reuse == null) {
+  reuse = new UTF8CodePoint();
+}
+
+int leadByte = utf8[pos] & 0xFF;
+int numBytes = utf8CodeLength[leadByte];
+reuse.codePointBytes = numBytes;
+int v;
+switch (numBytes) {
+  case 1 -> {
+reuse.codePoint = leadByte;
+return reuse;
   }
-  ints[utf32Count++] = v;
+  case 2 -> v = leadByte & 31; // 5 useful bits
+  case 3 -> v = leadByte & 15; // 4 useful bits
+  case 4 -> v = leadByte & 7; // 3 useful bits
+  default -> throw new IllegalArgumentException("invalid utf8");

Review Comment:
   Maybe include the `Arrays.toString(utf8)` and `pos` in the exception 
m

[GitHub] [lucene] uschindler commented on pull request #12311: Integrate the Incubating Panama Vector API

2023-05-26 Thread via GitHub



uschindler commented on PR #12311:
URL: https://github.com/apache/lucene/pull/12311#issuecomment-1564518097

   Hi, I changed te CHANGES.txt entry in main and 9.x to correctly refer to 
ARM's chipset feature (NEON). @rmuir asked me to correct it. See: 
https://en.wikipedia.org/wiki/ARM_architecture_family#Advanced_SIMD_(Neon)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #12268: add BitSet.clear()

2023-05-26 Thread via GitHub



uschindler commented on PR #12268:
URL: https://github.com/apache/lucene/pull/12268#issuecomment-1564522169

   Hi,
   sorry this went out of my view. Could you please add a CHANGES.txt entry in 
the 9.7 part of the file?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #12268: add BitSet.clear()

2023-05-26 Thread via GitHub



uschindler commented on PR #12268:
URL: https://github.com/apache/lucene/pull/12268#issuecomment-1564523245

   I will then press the merge button and cherry-pick it in 9.x branch for next 
release 9.7.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #12293: Capture build scans on ge.apache.org to benefit from deep build insights

2023-05-26 Thread via GitHub



uschindler commented on PR #12293:
URL: https://github.com/apache/lucene/pull/12293#issuecomment-1564543093

   I am fine with the changes (mostly), but I still don't understand why this 
needs to be on top-top level and can't be inside the `gradle/` subfolder.
   
   Also please make it conditionally when running on 3rd party build Servers. 
An idea would be no use the JENKINS_URL with a regex on "apache.org".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] clayburn commented on pull request #12293: Capture build scans on ge.apache.org to benefit from deep build insights

2023-05-26 Thread via GitHub



clayburn commented on PR #12293:
URL: https://github.com/apache/lucene/pull/12293#issuecomment-1564550262

   > Is there a solution for 3rd party build Servers not having any CI secret.
   
   > I am fine with the changes (mostly), except.
   
   > Also please make it conditionally when running on 3rd party build Servers. 
An idea would be no use the JENKINS_URL with a regex on "apache.org".
   
   @uschindler - This is handled by the `publishIfAuthenticated` setting. For a 
3rd party CI server (or an unathenticated local user), no publishing occurs. 
You can see what this looks like by just running a build on this branch without 
authenticating to https://ge.apache.org. The behavior when unauthenticated 
essentially becomes a no-op.
   
   I have another change incoming to add the license header to `ge.gradle` to 
fix that check.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gashutos commented on a diff in pull request #12334: Fix searchafter query high latency when after value is out of range for segment

2023-05-26 Thread via GitHub



gashutos commented on code in PR #12334:
URL: https://github.com/apache/lucene/pull/12334#discussion_r1206958937


##
lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java:
##
@@ -204,13 +200,21 @@ private void updateCompetitiveIterator() throws 
IOException {
 return;
   }
   if (reverse == false) {
-encodeBottom(maxValueAsBytes);
+if (queueFull) { // bottom is avilable only when queue is full
+  maxValueAsBytes = maxValueAsBytes == null ? new byte[bytesCount] : 
maxValueAsBytes;

Review Comment:
   We dont know upfront at the time of construction (where currently 
initialization is done) that would we be needing maxValueAsBytes & 
minValuesAsBytes both. Like about the case, where we dont have any competitive 
hit collected in queue hence no bottom but has ```after``` value so the 
topValue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

2023-05-26 Thread via GitHub



gsmiller commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1206966623


##
lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java:
##
@@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final 
int[] ints) {
 int utf8Upto = utf8.offset;
 final byte[] bytes = utf8.bytes;
 final int utf8Limit = utf8.offset + utf8.length;
+UTF8CodePoint reuse = null;
 while (utf8Upto < utf8Limit) {
-  final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF];
-  int v = 0;
-  switch (numBytes) {
-case 1:
-  ints[utf32Count++] = bytes[utf8Upto++];
-  continue;
-case 2:
-  // 5 useful bits
-  v = bytes[utf8Upto++] & 31;
-  break;
-case 3:
-  // 4 useful bits
-  v = bytes[utf8Upto++] & 15;
-  break;
-case 4:
-  // 3 useful bits
-  v = bytes[utf8Upto++] & 7;
-  break;
-default:
-  throw new IllegalArgumentException("invalid utf8");
-  }
+  reuse = codePointAt(bytes, utf8Upto, reuse);
+  ints[utf32Count++] = reuse.codePoint;
+  utf8Upto += reuse.codePointBytes;
+}
 
-  // TODO: this may read past utf8's limit.
-  final int limit = utf8Upto + numBytes - 1;
-  while (utf8Upto < limit) {
-v = v << 6 | bytes[utf8Upto++] & 63;
+return utf32Count;
+  }
+
+  /**
+   * Computes the codepoint and codepoint length (in bytes) of the specified 
{@code offset} in the
+   * provided {@code utf8} byte array, assuming UTF8 encoding. As with other 
related methods in this
+   * class, this assumes valid UTF8 input and does not 
perform full UTF8
+   * validation.
+   *
+   * @throws IllegalArgumentException If invalid codepoint header byte occurs 
or the content is

Review Comment:
   You're correct that it could AIOOBE on a particularly malformed header byte. 
I think the `v` business is OK since the default switch case translates that to 
IAE, but I agree with your suggestion to make a more general statement that 
this method may do all sort of terrible and unexpected things if you feed it 
invalid utf8 (or reference an invalid start position)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on pull request #12311: Integrate the Incubating Panama Vector API

2023-05-26 Thread via GitHub



alessandrobenedetti commented on PR #12311:
URL: https://github.com/apache/lucene/pull/12311#issuecomment-1564575142

   thanks @uschindler for the explanation, I appreciate the work you are doing! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

2023-05-26 Thread via GitHub



gsmiller commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1206972753


##
lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java:
##
@@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final 
int[] ints) {
 int utf8Upto = utf8.offset;
 final byte[] bytes = utf8.bytes;
 final int utf8Limit = utf8.offset + utf8.length;
+UTF8CodePoint reuse = null;
 while (utf8Upto < utf8Limit) {
-  final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF];
-  int v = 0;
-  switch (numBytes) {
-case 1:
-  ints[utf32Count++] = bytes[utf8Upto++];
-  continue;
-case 2:
-  // 5 useful bits
-  v = bytes[utf8Upto++] & 31;
-  break;
-case 3:
-  // 4 useful bits
-  v = bytes[utf8Upto++] & 15;
-  break;
-case 4:
-  // 3 useful bits
-  v = bytes[utf8Upto++] & 7;
-  break;
-default:
-  throw new IllegalArgumentException("invalid utf8");
-  }
+  reuse = codePointAt(bytes, utf8Upto, reuse);
+  ints[utf32Count++] = reuse.codePoint;
+  utf8Upto += reuse.codePointBytes;
+}
 
-  // TODO: this may read past utf8's limit.
-  final int limit = utf8Upto + numBytes - 1;
-  while (utf8Upto < limit) {
-v = v << 6 | bytes[utf8Upto++] & 63;
+return utf32Count;
+  }
+
+  /**
+   * Computes the codepoint and codepoint length (in bytes) of the specified 
{@code offset} in the
+   * provided {@code utf8} byte array, assuming UTF8 encoding. As with other 
related methods in this
+   * class, this assumes valid UTF8 input and does not 
perform full UTF8
+   * validation.
+   *
+   * @throws IllegalArgumentException If invalid codepoint header byte occurs 
or the content is
+   * prematurely truncated.
+   */
+  public static UTF8CodePoint codePointAt(byte[] utf8, int pos, UTF8CodePoint 
reuse) {
+if (reuse == null) {
+  reuse = new UTF8CodePoint();
+}
+
+int leadByte = utf8[pos] & 0xFF;
+int numBytes = utf8CodeLength[leadByte];
+reuse.codePointBytes = numBytes;
+int v;
+switch (numBytes) {
+  case 1 -> {
+reuse.codePoint = leadByte;
+return reuse;
   }
-  ints[utf32Count++] = v;
+  case 2 -> v = leadByte & 31; // 5 useful bits
+  case 3 -> v = leadByte & 15; // 4 useful bits
+  case 4 -> v = leadByte & 7; // 3 useful bits
+  default -> throw new IllegalArgumentException("invalid utf8");

Review Comment:
   How about the header byte that resulted in an illegal parse? I'm a little 
nervous of including the whole substring of bytes as it has unbounded length 
and could be a bit unwieldy?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti commented on pull request #12314: Multi-value support for KnnVectorField

2023-05-26 Thread via GitHub



alessandrobenedetti commented on PR #12314:
URL: https://github.com/apache/lucene/pull/12314#issuecomment-1564579302

   > My main worry is the change to `FloatVectorValue`, moving to a multivalued 
iterator changes the access pattern so I don't find it right to change the 
interface and the meaning of the ordinals that are returned based on 
multivalued or not.
   > If only `search` was exposed in the format that would be ok I think but 
we're exposing direct access to the document's vector so the parallel with doc 
values is important
   
   Hi @jimczi, nothing in this PR is final nor I have any strong opinion about 
it.
   My main intention is to keep the PR as small and as valuable as possible, to 
build a common ground (and tests) to build the functionality (if nice to have, 
if not, it was a cool exercise and that's equally fine).
   
   In regards to your main worry, can you point me to the areas of code you 
don't like specifically and I can have a thought in how to modify them!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jbellis commented on pull request #12268: add BitSet.clear()

2023-05-26 Thread via GitHub



jbellis commented on PR #12268:
URL: https://github.com/apache/lucene/pull/12268#issuecomment-1564588827

   done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

2023-05-26 Thread via GitHub



gsmiller commented on PR #12320:
URL: https://github.com/apache/lucene/pull/12320#issuecomment-1564595369

   Thanks @mikemccand! Did a pass to address your comments. Much appreciated! I 
also added some testing around the minimization aspect of the automaton 
building. I think all feedback has been addressed at this point, but no rush on 
having another look. Thanks again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller merged pull request #12331: GH#12321: Reduce visibility of StringsToAutomaton

2023-05-26 Thread via GitHub



gsmiller merged PR #12331:
URL: https://github.com/apache/lucene/pull/12331


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller merged pull request #12332: GH#12321: Marked DaciukMihovAutomatonBuilder as deprecated

2023-05-26 Thread via GitHub



gsmiller merged PR #12332:
URL: https://github.com/apache/lucene/pull/12332


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler merged pull request #12268: add BitSet.clear()

2023-05-26 Thread via GitHub



uschindler merged PR #12268:
URL: https://github.com/apache/lucene/pull/12268


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #12332: GH#12321: Marked DaciukMihovAutomatonBuilder as deprecated

2023-05-26 Thread via GitHub



uschindler commented on PR #12332:
URL: https://github.com/apache/lucene/pull/12332#issuecomment-1564636151

   Thanks. Was also backported to 9.x and will be released with 9.7.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #12268: add BitSet.clear()

2023-05-26 Thread via GitHub



uschindler commented on PR #12268:
URL: https://github.com/apache/lucene/pull/12268#issuecomment-1564636408

   Thanks. Was also backported to 9.x and will be released with 9.7.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jimczi commented on pull request #12314: Multi-value support for KnnVectorField

2023-05-26 Thread via GitHub



jimczi commented on PR #12314:
URL: https://github.com/apache/lucene/pull/12314#issuecomment-1564665498

   > nothing in this PR is final nor I have any strong opinion about it.
   
   Sure, we're just discussing the approach, no worries.
   
   > In regards to your main worry, can you point me to the areas of code you 
don't like specifically and I can have a thought in how to modify them!
   
   This change in [Float|Byte]VectorValues:
   
   ```
/** The maximum length of a vector */
 public static final int MAX_DIMENSIONS = 1024;
   
 protected boolean multiValued = false;
   
 public boolean isMultiValued() {
   return multiValued;
 }
   
 /** Sole constructor */
 protected FloatVectorValues() {}
   
   @@ -57,4 +63,13 @@ public final long cost() {
  * @return the vector value
  */
 public abstract float[] vectorValue() throws IOException;
   
 /**
  * Return the document ID corresponding to the input vector id(ordinal)
  * @param ord vector ID(ordinal)
  * @return the document ID
  */
 public int ordToDoc(int ord){
   return ord;
 }
   }
   ```
   
   Today the expectation is that it iterates over doc ids. This change adds an 
indirection that needs to be checked (`isMultivalued`) and if it's the case 
then the doc id is an ordinal id that needs to be transformed with `ordToDoc` .
   That's trappy, I am not even sure how you can advance to a doc rather than 
an ordinal and the code would mean different things based on whether you're 
working on multivalued or not. Considering how the original APIs was made for 
single valued I don't think we should try to sneak multi-valued into the model. 
That's why I propose that we add it as separated like you originally did. It's 
a doc values + search APIs so it needs to be usable for these two purposes in a 
more predictive way.  
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on issue #12321: Can we make `DaciukMihovAutomatonBuilder` pkg-private?

2023-05-26 Thread via GitHub



gsmiller commented on issue #12321:
URL: https://github.com/apache/lucene/issues/12321#issuecomment-1564712751

   Merged on `main` (#12331) and also added some deprecation notices on 9.x 
(#12332).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller closed issue #12321: Can we make `DaciukMihovAutomatonBuilder` pkg-private?

2023-05-26 Thread via GitHub



gsmiller closed issue #12321: Can we make `DaciukMihovAutomatonBuilder` 
pkg-private?
URL: https://github.com/apache/lucene/issues/12321


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dsmiley commented on pull request #12306: Make MAX_DIMENSIONS configurable via a system property.

2023-05-26 Thread via GitHub



dsmiley commented on PR #12306:
URL: https://github.com/apache/lucene/pull/12306#issuecomment-1565223689

   The PR has evolved from the first iteration.  Are there remaining concerns 
with the PR as it is today?  It shows that 2048 dimensions is tested, works, 
and thus is *supportable*.  It would make a lot of our users happy (Lucene 
exists to serve them) and lure more users, and I don't see what harm it would 
bring.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

46 matches

Mail list logo