[GitHub] [lucene] jpountz commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

2023-02-16 Thread via GitHub


jpountz commented on PR #12139:
URL: https://github.com/apache/lucene/pull/12139#issuecomment-1433116951

   I removed type guessing by adding a new `IndexableField#invertableType` that 
can be either `TERM` or `TOKEN_STREAM`. The type guessing is now contained in 
`Field.java`. Initially I wanted to contain everything through something that 
would like more like a value type, like `StoredValue` but fields must be able 
to customize the way that they produce their token stream and I didn't like 
requiring `IndexableField` implementations to provide both an implementation 
for the `IndexableField` and for this abstraction that produces terms or token 
streams. I'm curious if you have thoughts on ways to make the API better @rmuir.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

2023-02-16 Thread via GitHub


rmuir commented on PR #12139:
URL: https://github.com/apache/lucene/pull/12139#issuecomment-1433128283

   I'm lost, i see type guessing and an InvertableType class that does nothing. 
Maybe you forgot to 'git add' or something? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

2023-02-16 Thread via GitHub


jpountz commented on PR #12139:
URL: https://github.com/apache/lucene/pull/12139#issuecomment-1433141266

   Yes! Sorry about that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on issue #11915: Make Lucene smarter about long runs of matches

2023-02-16 Thread via GitHub


jpountz commented on issue #11915:
URL: https://github.com/apache/lucene/issues/11915#issuecomment-1433171306

   Thanks for looking!
   
   > peekNextNonMatchingDocID() - 1 is guaranteed to not be a match.
   
   `peekNextNonMatchingDocID() - 1` would either be the current doc ID, or a 
match. (did you make a typo when writing that it's guaranteed *not* to be a 
match?)
   
   > But I'm wondering if it will be better for the API to just return the 
next, furthest out doc ID that we know is not going to be a match?
   
   Ideally our queries that can compute this information cheaply would do this. 
I wanted to make it an optional API so that all queries like doc-values-based 
queries wouldn't have to linearly scan until they find a non-match, which could 
often be more costly than asking other clauses to advance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

2023-02-16 Thread via GitHub


rmuir commented on PR #12139:
URL: https://github.com/apache/lucene/pull/12139#issuecomment-1433180423

   its better, i'm only sad about a naming issue:
   
   * InvertableType: OK
   * InvertableType.TERM: Terrible, it isn't a Term at all, its a BytesRef.
   * InvertableType.TOKEN_STREAM: OK
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

2023-02-16 Thread via GitHub


jpountz commented on PR #12139:
URL: https://github.com/apache/lucene/pull/12139#issuecomment-1433184781

   Fair point, I renamed `TERM` to `BINARY`, which is consistent with 
`StoredValue` and the fact that the API on `IndexableField` is called 
`#binaryValue()`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

2023-02-16 Thread via GitHub


rmuir commented on PR #12139:
URL: https://github.com/apache/lucene/pull/12139#issuecomment-1433193923

   yes, better thanks! The only thing good about the "Term" was that it did 
capture the singleton nature. I'd just suggest a small improvement to the 
javadocs for BINARY to mention that its "a single value" or similar? 
   
   We don't want someone to pass a large UTF-8 encoded document in this way :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tylerbertrand commented on a diff in pull request #12150: Gradle optimizations

2023-02-16 Thread via GitHub


tylerbertrand commented on code in PR #12150:
URL: https://github.com/apache/lucene/pull/12150#discussion_r1108598872


##
gradle/validation/jar-checks.gradle:
##
@@ -231,7 +238,8 @@ subprojects {
   }
 }
   }
-
+  def f = new File(project.buildDir.path + "/" + outputFileName)
+  f.text = errors

Review Comment:
   Dawid is correct, for tasks to take advantage of incremental building and 
the build cache, they're required to have an output.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tylerbertrand commented on a diff in pull request #12150: Gradle optimizations

2023-02-16 Thread via GitHub


tylerbertrand commented on code in PR #12150:
URL: https://github.com/apache/lucene/pull/12150#discussion_r1108635838


##
gradle/validation/jar-checks.gradle:
##
@@ -231,7 +238,8 @@ subprojects {
   }
 }
   }
-
+  def f = new File(project.buildDir.path + "/" + outputFileName)
+  f.text = errors

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dnhatn commented on pull request #12147: Ensure caching all leaves from the upper tier

2023-02-16 Thread via GitHub


dnhatn commented on PR #12147:
URL: https://github.com/apache/lucene/pull/12147#issuecomment-1433551735

   @jpountz Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dnhatn merged pull request #12147: Ensure caching all leaves from the upper tier

2023-02-16 Thread via GitHub


dnhatn merged PR #12147:
URL: https://github.com/apache/lucene/pull/12147


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dnhatn closed issue #12140: LRUQueryCache disabled for indices with more than 33 segments

2023-02-16 Thread via GitHub


dnhatn closed issue #12140: LRUQueryCache disabled for indices with more than 
33 segments
URL: https://github.com/apache/lucene/issues/12140


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani merged pull request #12146: Simplify max score for kNN vector queries

2023-02-16 Thread via GitHub


jtibshirani merged PR #12146:
URL: https://github.com/apache/lucene/pull/12146


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #12146: Simplify max score for kNN vector queries

2023-02-16 Thread via GitHub


jtibshirani commented on PR #12146:
URL: https://github.com/apache/lucene/pull/12146#issuecomment-1433647800

   Thanks for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent opened a new pull request, #12152: Fix vector search doc score query bugs

2023-02-16 Thread via GitHub


benwtrent opened a new pull request, #12152:
URL: https://github.com/apache/lucene/pull/12152

   This commit fixes one major bug and has two minor performance improvements.
   
   In a pure disjunction case within the `BoolQuery` (and probably other 
times), the maximum score up to `NO_MORE_DOCS` is calculated. 
   
   `AbstractKnnVectorQuery.DocAndScoreQuery` was consistently adding the 
current leaf-context's docBase to the passed in parameter. This would cause 
`int` to rollover and `DocAndScoreQuery` would return `0` for its highest score 
in the segment when it obviously wasn't. 
   
   The two minor performance improvements are around `count` and 
`Weight#scorer`. 
   `segmentStarts` is a monotonically increasing start for each scored document 
indexed by leaf-segment ordinal. Consequently, if the upper and lower segments 
are equivalent, that means no docs match for this segment.
   
   Count is similarly calculated by the difference between upper and lower 
`segmentStarts` according to the segment ordinal.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12152: Fix vector search doc score query bugs

2023-02-16 Thread via GitHub


benwtrent commented on PR #12152:
URL: https://github.com/apache/lucene/pull/12152#issuecomment-1433673847

   I see that the maxScore was fixed within: 
https://github.com/apache/lucene/pull/12146
   
   Will revert that part and simply add the tests && minor optimizations :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #12152: Minor vector search matching doc optimizations

2023-02-16 Thread via GitHub


zhaih commented on code in PR #12152:
URL: https://github.com/apache/lucene/pull/12152#discussion_r1109161735


##
lucene/core/src/test/org/apache/lucene/search/TestDocAndScoreQuery.java:
##
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+import static com.carrotsearch.randomizedtesting.RandomizedTest.randomFloat;
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.document.StringField;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.tests.index.RandomIndexWriter;
+import org.apache.lucene.tests.util.LuceneTestCase;
+
+public class TestDocAndScoreQuery extends LuceneTestCase {

Review Comment:
   Should we move the tests to one of the KNN query's test? Since this query is 
only used by KNN queries?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org