[GitHub] [lucene] jpountz commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.
jpountz commented on PR #12139: URL: https://github.com/apache/lucene/pull/12139#issuecomment-1433116951 I removed type guessing by adding a new `IndexableField#invertableType` that can be either `TERM` or `TOKEN_STREAM`. The type guessing is now contained in `Field.java`. Initially I wanted to contain everything through something that would like more like a value type, like `StoredValue` but fields must be able to customize the way that they produce their token stream and I didn't like requiring `IndexableField` implementations to provide both an implementation for the `IndexableField` and for this abstraction that produces terms or token streams. I'm curious if you have thoughts on ways to make the API better @rmuir. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.
rmuir commented on PR #12139: URL: https://github.com/apache/lucene/pull/12139#issuecomment-1433128283 I'm lost, i see type guessing and an InvertableType class that does nothing. Maybe you forgot to 'git add' or something? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.
jpountz commented on PR #12139: URL: https://github.com/apache/lucene/pull/12139#issuecomment-1433141266 Yes! Sorry about that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on issue #11915: Make Lucene smarter about long runs of matches
jpountz commented on issue #11915: URL: https://github.com/apache/lucene/issues/11915#issuecomment-1433171306 Thanks for looking! > peekNextNonMatchingDocID() - 1 is guaranteed to not be a match. `peekNextNonMatchingDocID() - 1` would either be the current doc ID, or a match. (did you make a typo when writing that it's guaranteed *not* to be a match?) > But I'm wondering if it will be better for the API to just return the next, furthest out doc ID that we know is not going to be a match? Ideally our queries that can compute this information cheaply would do this. I wanted to make it an optional API so that all queries like doc-values-based queries wouldn't have to linearly scan until they find a non-match, which could often be more costly than asking other clauses to advance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.
rmuir commented on PR #12139: URL: https://github.com/apache/lucene/pull/12139#issuecomment-1433180423 its better, i'm only sad about a naming issue: * InvertableType: OK * InvertableType.TERM: Terrible, it isn't a Term at all, its a BytesRef. * InvertableType.TOKEN_STREAM: OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.
jpountz commented on PR #12139: URL: https://github.com/apache/lucene/pull/12139#issuecomment-1433184781 Fair point, I renamed `TERM` to `BINARY`, which is consistent with `StoredValue` and the fact that the API on `IndexableField` is called `#binaryValue()`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.
rmuir commented on PR #12139: URL: https://github.com/apache/lucene/pull/12139#issuecomment-1433193923 yes, better thanks! The only thing good about the "Term" was that it did capture the singleton nature. I'd just suggest a small improvement to the javadocs for BINARY to mention that its "a single value" or similar? We don't want someone to pass a large UTF-8 encoded document in this way :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] tylerbertrand commented on a diff in pull request #12150: Gradle optimizations
tylerbertrand commented on code in PR #12150: URL: https://github.com/apache/lucene/pull/12150#discussion_r1108598872 ## gradle/validation/jar-checks.gradle: ## @@ -231,7 +238,8 @@ subprojects { } } } - + def f = new File(project.buildDir.path + "/" + outputFileName) + f.text = errors Review Comment: Dawid is correct, for tasks to take advantage of incremental building and the build cache, they're required to have an output. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] tylerbertrand commented on a diff in pull request #12150: Gradle optimizations
tylerbertrand commented on code in PR #12150: URL: https://github.com/apache/lucene/pull/12150#discussion_r1108635838 ## gradle/validation/jar-checks.gradle: ## @@ -231,7 +238,8 @@ subprojects { } } } - + def f = new File(project.buildDir.path + "/" + outputFileName) + f.text = errors Review Comment: Fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn commented on pull request #12147: Ensure caching all leaves from the upper tier
dnhatn commented on PR #12147: URL: https://github.com/apache/lucene/pull/12147#issuecomment-1433551735 @jpountz Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn merged pull request #12147: Ensure caching all leaves from the upper tier
dnhatn merged PR #12147: URL: https://github.com/apache/lucene/pull/12147 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn closed issue #12140: LRUQueryCache disabled for indices with more than 33 segments
dnhatn closed issue #12140: LRUQueryCache disabled for indices with more than 33 segments URL: https://github.com/apache/lucene/issues/12140 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani merged pull request #12146: Simplify max score for kNN vector queries
jtibshirani merged PR #12146: URL: https://github.com/apache/lucene/pull/12146 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #12146: Simplify max score for kNN vector queries
jtibshirani commented on PR #12146: URL: https://github.com/apache/lucene/pull/12146#issuecomment-1433647800 Thanks for the review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent opened a new pull request, #12152: Fix vector search doc score query bugs
benwtrent opened a new pull request, #12152: URL: https://github.com/apache/lucene/pull/12152 This commit fixes one major bug and has two minor performance improvements. In a pure disjunction case within the `BoolQuery` (and probably other times), the maximum score up to `NO_MORE_DOCS` is calculated. `AbstractKnnVectorQuery.DocAndScoreQuery` was consistently adding the current leaf-context's docBase to the passed in parameter. This would cause `int` to rollover and `DocAndScoreQuery` would return `0` for its highest score in the segment when it obviously wasn't. The two minor performance improvements are around `count` and `Weight#scorer`. `segmentStarts` is a monotonically increasing start for each scored document indexed by leaf-segment ordinal. Consequently, if the upper and lower segments are equivalent, that means no docs match for this segment. Count is similarly calculated by the difference between upper and lower `segmentStarts` according to the segment ordinal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #12152: Fix vector search doc score query bugs
benwtrent commented on PR #12152: URL: https://github.com/apache/lucene/pull/12152#issuecomment-1433673847 I see that the maxScore was fixed within: https://github.com/apache/lucene/pull/12146 Will revert that part and simply add the tests && minor optimizations :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a diff in pull request #12152: Minor vector search matching doc optimizations
zhaih commented on code in PR #12152: URL: https://github.com/apache/lucene/pull/12152#discussion_r1109161735 ## lucene/core/src/test/org/apache/lucene/search/TestDocAndScoreQuery.java: ## @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +import static com.carrotsearch.randomizedtesting.RandomizedTest.randomFloat; +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Comparator; +import java.util.List; +import org.apache.lucene.document.Document; +import org.apache.lucene.document.Field; +import org.apache.lucene.document.StringField; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.store.Directory; +import org.apache.lucene.tests.index.RandomIndexWriter; +import org.apache.lucene.tests.util.LuceneTestCase; + +public class TestDocAndScoreQuery extends LuceneTestCase { Review Comment: Should we move the tests to one of the KNN query's test? Since this query is only used by KNN queries? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org