[GitHub] [lucene] LuXugang commented on a diff in pull request #12017: Aggressive `count` in BooleanWeight
LuXugang commented on code in PR #12017: URL: https://github.com/apache/lucene/pull/12017#discussion_r1049162986 ## lucene/core/src/test/org/apache/lucene/search/TestBooleanQuery.java: ## @@ -1015,6 +1015,80 @@ public void testDisjunctionRandomClausesMatchesCount() throws Exception { } } + public void testAggressiveMatchCount() throws IOException { Review Comment: addressed in https://github.com/apache/lucene/pull/12017/commits/272dec54a59dedc666cf8311c09830c94b3c5369 and https://github.com/apache/lucene/pull/12017/commits/ae3f8d67ed75724ac2662c24025c85ae7008612a -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] craigtaverner opened a new issue, #12020: Very flat polygons give incorrect 'contains' result
craigtaverner opened a new issue, #12020: URL: https://github.com/apache/lucene/issues/12020 ### Description When performing a search using a shape geometry query of relation type `QueryRelation.CONTAINS`, it is possible to get a false positive when two geometries intersect, but neither actually contains the other. This happens if the indexed geometry is a polygon that is so flat that one of its triangles is simplified to a single line segment. The bug is that the line segment currently records whether it is part of the external polygon by taking that knowledge from first line segment of the triangle, not necessarily the part of the triangle being retained. This first line segment could be an internal line (part of the triangle, but not the polygon). The simplification code should instead take this knowledge from a line segment that is not being collapsed by the simplification. The consequence of this bug occur later during the contains search when the search query deduces that the geometry is contained because the polygon is not closed. The search does not realise it intersects an outer line of the polygon because that line is not correctly marked as outer. ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nosvalds opened a new issue, #12021: Large fields with large="true" can be truncated in v9+
nosvalds opened a new issue, #12021: URL: https://github.com/apache/lucene/issues/12021 ### Description ## Issue For fields using `large="true"`, large fields (which is what they are intended for) can be truncated in v9+ of Lucene. Example fieldtype definition: ``` ``` ## Cause Looks like this is a bug introduced along with [LUCENE-8805](https://issues.apache.org/jira/browse/LUCENE-8805) / https://github.com/apache/lucene/issues/9849: [https://github.com/apache/lucene/blob/5a694ea26ff862ecc874ca798135073d300c2234/sol[…]r/core/src/java/org/apache/solr/search/SolrDocumentFetcher.java](https://github.com/apache/lucene/blob/5a694ea26ff862ecc874ca798135073d300c2234/solr/core/src/java/org/apache/solr/search/SolrDocumentFetcher.java#L462-L465) Specifically with respect to "large" fields handling. The length in utf8 bytes will often be longer than the string length `value.length()`, hence the truncation. ## Fix The Fix would be: `bytesRef.length = bytesRef.bytes.length` ### Version and environment details - Solr v9.1.0 - Lucene v9.3.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] craigtaverner opened a new pull request, #12022: Fix flat polygons incorrectly containing intersecting geometries
craigtaverner opened a new pull request, #12022: URL: https://github.com/apache/lucene/pull/12022 Fixes https://github.com/apache/lucene/issues/12020 ### Description When performing a search using a shape geometry query of relation type `QueryRelation.CONTAINS`, it is possible to get a false positive when two geometries intersect, but neither actually contains the other. This happens if the indexed geometry is a polygon that is so flat that one of its triangles is simplified to a single line segment. The bug is that the line segment currently records whether it is part of the external polygon by taking that knowledge from first line segment of the triangle, not necessarily the part of the triangle being retained. This first line segment could be an internal line (part of the triangle, but not the polygon). The simplification code should instead take this knowledge from a line segment that is not being collapsed by the simplification. The consequence of this bug occur later during the contains search when the search query deduces that the geometry is contained because the polygon is not closed. The search does not realise it intersects an outer line of the polygon because that line is not correctly marked as outer. The fix is to select the correct flag from the surviving line segments instead of the first line segment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a diff in pull request #12022: Fix flat polygons incorrectly containing intersecting geometries
iverase commented on code in PR #12022: URL: https://github.com/apache/lucene/pull/12022#discussion_r1049521314 ## lucene/CHANGES.txt: ## @@ -68,6 +68,8 @@ Bug Fixes * LUCENE-10599: LogMergePolicy is more likely to keep merging segments until they reach the maximum merge size. (Adrien Grand) +* GITHUB#12020: Fixes bug whereby very flat polygons can incorrectly contain intersecting geometries. (Craig Taverner) + Review Comment: Could you move this entry to Lucene 9.5? I am planning in backporting it so it will make life easier -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a diff in pull request #12022: Fix flat polygons incorrectly containing intersecting geometries
iverase commented on code in PR #12022: URL: https://github.com/apache/lucene/pull/12022#discussion_r1049521314 ## lucene/CHANGES.txt: ## @@ -68,6 +68,8 @@ Bug Fixes * LUCENE-10599: LogMergePolicy is more likely to keep merging segments until they reach the maximum merge size. (Adrien Grand) +* GITHUB#12020: Fixes bug whereby very flat polygons can incorrectly contain intersecting geometries. (Craig Taverner) + Review Comment: Could you move this entry to the lucene 9.5 section? I am planning in backporting it so it will make life easier -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #12021: Large fields with large="true" can be truncated in v9+
rmuir commented on issue #12021: URL: https://github.com/apache/lucene/issues/12021#issuecomment-1352977757 This looks like a bug in solr code (SolrDocumentFetcher) so I'd recommend opening a bug over at https://github.com/apache/solr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Bukhtawar opened a new issue, #12023: Mechanism to interrupt long-running/resource intensive queries
Bukhtawar opened a new issue, #12023: URL: https://github.com/apache/lucene/issues/12023 ### Description As a part of https://github.com/opensearch-project/OpenSearch/issues/687 we detected that regex queries can run into tight loops for quite long. Below is the stack trace of the request for a wildcard query which consumed 100% CPU for an hour(although addressed in https://issues.apache.org/jira/browse/LUCENE-9981). OpenSearch has a mechanism to cancel task on timeout and one other option that was we were deliberating was to send interrupts to the long running thread if the request is consuming more Memory/CPU or time, independent of query timeout. The problem is not all costly executions in Lucene have a check on interrupts much like ExitableDirectoryReader#checkAndThrow ``` private void checkAndThrow() { if (queryTimeout.shouldExit()) { throw new ExitingReaderException("The request took too long to iterate over point values. Timeout: " + queryTimeout.toString() + ", PointValues=" + in ); } else if (Thread.interrupted()) { throw new ExitingReaderException("Interrupted while iterating over point values. PointValues=" + in); } } ``` Sharing the stack trace for the request ``` 100.2% (500.8ms out of 500ms) cpu usage by thread 'opensearch[917451917ca5731579187db45dd52853][search][T#5]' 4/10 snapshots sharing following 36 elements app//org.apache.lucene.util.automaton.Operations.determinize(Operations.java:780) app//org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155) app//org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:245) app//org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:110) app//org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:87) app//org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:71) app//org.apache.lucene.search.WildcardQuery.(WildcardQuery.java:56) app//org.opensearch.index.mapper.StringFieldType.wildcardQuery(StringFieldType.java:158) app//org.opensearch.index.query.WildcardQueryBuilder.doToQuery(WildcardQueryBuilder.java:259) app//org.opensearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:116) app//org.opensearch.index.query.BoolQueryBuilder.addBooleanClauses(BoolQueryBuilder.java:337) app//org.opensearch.index.query.BoolQueryBuilder.doToQuery(BoolQueryBuilder.java:321) app//org.opensearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:116) app//org.opensearch.index.query.QueryShardContext.lambda$toQuery$3(QueryShardContext.java:386) app//org.opensearch.index.query.QueryShardContext$$Lambda$5010/0x000801d4d840.apply(Unknown Source) app//org.opensearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:398) app//org.opensearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:385) app//org.opensearch.search.SearchService.parseSource(SearchService.java:903) app//org.opensearch.search.SearchService.createContext(SearchService.java:740) app//org.opensearch.search.SearchService.executeQueryPhase(SearchService.java:442) app//org.opensearch.search.SearchService.access$500(SearchService.java:155) app//org.opensearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:415) app//org.opensearch.search.SearchService$2$$Lambda$4889/0x000801ce9840.get(Unknown Source) app//org.opensearch.search.SearchService$$Lambda$4891/0x000801ce9c40.get(Unknown Source) app//org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:71) app//org.opensearch.action.ActionRunnable$$Lambda$3741/0x0008011fa440.accept(Unknown Source) app//org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:86) app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) app//org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78) app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) app//org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:57) app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:774) app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) java.base@11.0.16/java.lang.Th
[GitHub] [lucene] rmuir closed issue #12023: Mechanism to interrupt long-running/resource intensive queries
rmuir closed issue #12023: Mechanism to interrupt long-running/resource intensive queries URL: https://github.com/apache/lucene/issues/12023 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #12023: Mechanism to interrupt long-running/resource intensive queries
rmuir commented on issue #12023: URL: https://github.com/apache/lucene/issues/12023#issuecomment-1353025015 determinization has already been removed here. that is the problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] craigtaverner commented on a diff in pull request #12022: Fix flat polygons incorrectly containing intersecting geometries
craigtaverner commented on code in PR #12022: URL: https://github.com/apache/lucene/pull/12022#discussion_r1049613431 ## lucene/CHANGES.txt: ## @@ -68,6 +68,8 @@ Bug Fixes * LUCENE-10599: LogMergePolicy is more likely to keep merging segments until they reach the maximum merge size. (Adrien Grand) +* GITHUB#12020: Fixes bug whereby very flat polygons can incorrectly contain intersecting geometries. (Craig Taverner) + Review Comment: Done in a3f640950180d5047d4fe8204416bdcc4f46e713 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Bukhtawar commented on issue #12023: Mechanism to interrupt long-running/resource intensive queries
Bukhtawar commented on issue #12023: URL: https://github.com/apache/lucene/issues/12023#issuecomment-1353034054 Thanks @rmuir I am aware this has been addressed, this issue was to primarily gather thoughts on other possible queries similar to this that might be expensive or running tight loops but lacking interrupts or a good way to short circuit and ways that to cancel these based on certain heuristics like CPU/memory or time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent opened a new pull request, #12024: Fix SimpleTextKnnVectorsReader to handle changes introduced in GITHUB#12004
benwtrent opened a new pull request, #12024: URL: https://github.com/apache/lucene/pull/12024 `SimpleTextKnnVectorsReader` is used for recreation and testing. It needs to handle the new way of searching with BytesRef directly instead of always searching with `float`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #12023: Mechanism to interrupt long-running/resource intensive queries
rmuir commented on issue #12023: URL: https://github.com/apache/lucene/issues/12023#issuecomment-1353034922 not going to support Thread.interrupt or any nonsense like that. you already have the exitable reader: use that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #12023: Mechanism to interrupt long-running/resource intensive queries
rmuir commented on issue #12023: URL: https://github.com/apache/lucene/issues/12023#issuecomment-1353044756 And the reason i am short with you, again, is because you still implement a garbage security model (no authentication required by default). Stop shipping insecure apps and you'll stop needing to worry about situations like this. I'm not going to workaround the insecurity of your application with a ton of complexity to lucene. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12024: Fix SimpleTextKnnVectorsReader to handle changes introduced in GITHUB#12004
jpountz commented on code in PR #12024: URL: https://github.com/apache/lucene/pull/12024#discussion_r1049643447 ## lucene/core/src/test/org/apache/lucene/search/TestVectorScorer.java: ## @@ -36,13 +36,26 @@ public class TestVectorScorer extends LuceneTestCase { public void testFindAll() throws IOException { +VectorEncoding encoding = + VectorEncoding.values()[random().nextInt(VectorEncoding.values().length)]; Review Comment: Nit: use RandomPicks#randomFrom? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nosvalds commented on issue #12021: Large fields with large="true" can be truncated in v9+
nosvalds commented on issue #12021: URL: https://github.com/apache/lucene/issues/12021#issuecomment-1353077520 Sorry about that looks like the code link I had was from before the split. Moved this issue here: https://issues.apache.org/jira/browse/SOLR-16589 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nosvalds closed issue #12021: Large fields with large="true" can be truncated in v9+
nosvalds closed issue #12021: Large fields with large="true" can be truncated in v9+ URL: https://github.com/apache/lucene/issues/12021 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12024: Fix SimpleTextKnnVectorsReader to handle changes introduced in GITHUB#12004
benwtrent commented on code in PR #12024: URL: https://github.com/apache/lucene/pull/12024#discussion_r1049648525 ## lucene/core/src/test/org/apache/lucene/search/TestVectorScorer.java: ## @@ -36,13 +36,26 @@ public class TestVectorScorer extends LuceneTestCase { public void testFindAll() throws IOException { +VectorEncoding encoding = + VectorEncoding.values()[random().nextInt(VectorEncoding.values().length)]; Review Comment: pushed a commit addressing this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #12024: Fix SimpleTextKnnVectorsReader to handle changes introduced in GITHUB#12004
jpountz merged PR #12024: URL: https://github.com/apache/lucene/pull/12024 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase merged pull request #12022: Fix flat polygons incorrectly containing intersecting geometries
iverase merged PR #12022: URL: https://github.com/apache/lucene/pull/12022 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase closed issue #12020: Very flat polygons give incorrect 'contains' result
iverase closed issue #12020: Very flat polygons give incorrect 'contains' result URL: https://github.com/apache/lucene/issues/12020 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on pull request #12022: Fix flat polygons incorrectly containing intersecting geometries
iverase commented on PR #12022: URL: https://github.com/apache/lucene/pull/12022#issuecomment-1353129113 Thanks @craigtaverner! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Bukhtawar commented on issue #12023: Mechanism to interrupt long-running/resource intensive queries
Bukhtawar commented on issue #12023: URL: https://github.com/apache/lucene/issues/12023#issuecomment-1353148073 Maybe will discuss the security part separately, but agree, one idea is to detect such queries and prevent running these queries in the first place, in this case(not the original issue) it was a bad query from an authenticated user. Since this specific case and likes of these cannot be addressed by `ExitableDirectoryReaders` alone where looping over terms aren't involved, we need alternatives to cancel runaway queries if there are other requests which could exhibit the same behaviour -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12016: Upgrade ANTLR to version 4.11.1
uschindler commented on PR #12016: URL: https://github.com/apache/lucene/pull/12016#issuecomment-1353249709 Cool, thanks for the "huge whitespace" test! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] reta commented on pull request #12016: Upgrade ANTLR to version 4.11.1
reta commented on PR #12016: URL: https://github.com/apache/lucene/pull/12016#issuecomment-1353254656 @rmuir @uschindler thanks a lot for HUGE help here guys! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #11946: add similarity threshold for hnsw
msokolov commented on code in PR #11946: URL: https://github.com/apache/lucene/pull/11946#discussion_r1049804451 ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) { * @throws IllegalArgumentException if k is less than 1 */ public KnnVectorQuery(String field, float[] target, int k, Query filter) { +this(field, target, k, Float.NEGATIVE_INFINITY, filter); + } + + /** + * Find the k nearest documents to the target vector according to the vectors in the + * given field. target vector. + * + * @param field a field that has been indexed as a {@link KnnVectorField}. + * @param target the target of the search + * @param k the number of documents to find (the upper bound) + * @param similarityThreshold the minimum acceptable value of similarity Review Comment: OK, with the current CR, orthogonal vectors will have a DOT_PRODUCT "score" of 0.5, which could be surprising. However, this is similar to how result scores are treated elsewhere in Lucene - their value ranges are not well-defined; the only guarantee is that higher scores are "more relevant". I guess practically speaking, as a user, I think I am going to have to do empirical work to know what threshold to use; these are not likely going to be motivated by some a priori knowledge of what a "good" dot-product is, and given that I'd like to just be able to work with some kind of abstracted score in a known range (0 = worst, 1 = best).Conversely, if we were to switch to using vector similarities that would correspond more directly to the underlying functions, we would have to clearly define them (today we don't actually explain this anywhere, I guess we'd need to document) and maybe provide methods for computing them. Also they would be weird too, just in a different way. For exam ple, how would we explain 8-bit dot-product? Would it be the 8-bit dot-product score normalized by 2^15? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on issue #12023: Mechanism to interrupt long-running/resource intensive queries
msokolov commented on issue #12023: URL: https://github.com/apache/lucene/issues/12023#issuecomment-1353313592 Q: are you aware of https://github.com/apache/lucene/issues/11188? It's a fair question whether `ExitableDirectoryReader` is adequate for catching all runaway queries. There can be cases where there is some tight loop that never visits the index, so it slips through the net, like the regex one. With sledgehammers like `Thread.interrupt` off the table (it has always been questionable and is going away from JDK altogether soonish I think) we have to address these as they arise. A useful contribution here would be some example queries that have unbounded runtimes and are not handled by the DirectoryReader approach. The example here has already been addressed, but do we have anything else? Note: one thing we have done is to use a custom Query whose Scorer checks for timeout. That way if we somehow do a lot of work that advances a docid iterator without reading from the index we can catch and throw an EarlyTerminationException. This isn't really going to be useful in most scenarios but I mention it in case you have some wonky Query execution that might benefit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #11946: add similarity threshold for hnsw
benwtrent commented on code in PR #11946: URL: https://github.com/apache/lucene/pull/11946#discussion_r1049904819 ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) { * @throws IllegalArgumentException if k is less than 1 */ public KnnVectorQuery(String field, float[] target, int k, Query filter) { +this(field, target, k, Float.NEGATIVE_INFINITY, filter); + } + + /** + * Find the k nearest documents to the target vector according to the vectors in the + * given field. target vector. + * + * @param field a field that has been indexed as a {@link KnnVectorField}. + * @param target the target of the search + * @param k the number of documents to find (the upper bound) + * @param similarityThreshold the minimum acceptable value of similarity Review Comment: Tl;dr Thank you for bearing with me! I think this is a good change. I would be happy with the JavaDocs, etc. clearly indicating that this threshold relates to the un-boosted vector score, not the raw similarity calculation. Dot-product, cosine, and euclidean are well defined concepts outside of Lucene. Lucene mangles (for undoubtably good reasons) the output of these similarities in undocumented ways to fit within boundaries. > with the current CR, I don't know what `CR` means. Change request? > However, this is similar to how result scores are treated elsewhere in Lucene - their value ranges are not well-defined; Agreed, ranges are usually predicated on term statistics, etc. and can potentially be considered "unbounded" as the corpus changes. However, does Lucene require that all unboosted BM25 scores are between 0-1? It does seem like an "arbitrary" decision (to me, I don't know the full-breadth of Lucene optimizations, etc. when it comes to scores) to restrict vector similarity in this way. But that is a broader conversation. I have some learning to do. > I guess practically speaking, as a user, I think I am going to have to do empirical work to know what threshold to use; these are not likely going to be motivated by some a priori knowledge of what a "good" dot-product is I would argue that a user could have a priori knowledge here. Think of it in the use case when the user knows their model used to make the vectors. At that point, they 100% know what is considered relevant based on their loss function and training + test data. Choosing a dot-product or cosine threshold that fits within 90% percentile or something given their test data results. I agree that this would be different if users were using an "off the shelf" model. In that case, they would probably require hybrid-search and combining with BM25 to get anything like relevant results (boosting various queries accordingly). Thus, learning what settings are required in an unfiltered case. > if we were to switch to using vector similarities that would correspond more directly to the underlying functions, we would have to clearly define them Cosine, dot-product, euclidean, are all already well defined. The functions to calculate them are universally recognized. Where Lucene separates itself is the manipulation of the similarity output to fit into a range [0, 1]. I guess this is cost of doing business in Lucene. I am not suggesting that all scoring of vector document searches changes. Simply that "similarity" and "score" are related, but are different things. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #11946: add similarity threshold for hnsw
msokolov commented on code in PR #11946: URL: https://github.com/apache/lucene/pull/11946#discussion_r1050171394 ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) { * @throws IllegalArgumentException if k is less than 1 */ public KnnVectorQuery(String field, float[] target, int k, Query filter) { +this(field, target, k, Float.NEGATIVE_INFINITY, filter); + } + + /** + * Find the k nearest documents to the target vector according to the vectors in the + * given field. target vector. + * + * @param field a field that has been indexed as a {@link KnnVectorField}. + * @param target the target of the search + * @param k the number of documents to find (the upper bound) + * @param similarityThreshold the minimum acceptable value of similarity Review Comment: > I don't know what CR means. Change request? sorry, yes like a PR but from a parallel universe (code review actually) So .. theoretical considerations aside, what's the alternative here -- we would treat the threshold as a "vector similarity" and internally convert it to a score. I mean that seems to make sense -- all the conversions are invertible, right? I think we'd want to add a normalize method to VectorSimilarity for this internal use. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #11946: add similarity threshold for hnsw
rmuir commented on code in PR #11946: URL: https://github.com/apache/lucene/pull/11946#discussion_r1050228933 ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) { * @throws IllegalArgumentException if k is less than 1 */ public KnnVectorQuery(String field, float[] target, int k, Query filter) { +this(field, target, k, Float.NEGATIVE_INFINITY, filter); + } + + /** + * Find the k nearest documents to the target vector according to the vectors in the + * given field. target vector. + * + * @param field a field that has been indexed as a {@link KnnVectorField}. + * @param target the target of the search + * @param k the number of documents to find (the upper bound) + * @param similarityThreshold the minimum acceptable value of similarity Review Comment: ben for the normal scoring, you can look at tests for similarities package. none of these have any 0 to 1 range or anything like that. instead requirements are that score increases semimonotonically as term frequency increases, decreases wrt documents length, etc. these guarantees allow optimizations such as block max wand to be applied safely. but theres no defined range at all. instead lots of crazy floating point hacks so that we can safely get really good performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12016: Upgrade ANTLR to version 4.11.1
rmuir commented on PR #12016: URL: https://github.com/apache/lucene/pull/12016#issuecomment-1354141954 I forced regeneration with `./gradlew -p lucene/expressions regenerate --rerun-tasks` just to ensure there were no source code changes and regeneration is idempotent / reproducible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #12016: Upgrade ANTLR to version 4.11.1
rmuir merged PR #12016: URL: https://github.com/apache/lucene/pull/12016 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir closed issue #11788: Upgrade ANTLR to version 4.11.1
rmuir closed issue #11788: Upgrade ANTLR to version 4.11.1 URL: https://github.com/apache/lucene/issues/11788 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #12023: Mechanism to interrupt long-running/resource intensive queries
rmuir commented on issue #12023: URL: https://github.com/apache/lucene/issues/12023#issuecomment-1354168057 > Maybe will discuss the security part separately, but agree, one idea is to detect such queries and prevent running these queries in the first place, in this case(not the original issue) it was a bad query from an authenticated user. > Since this specific case and likes of these cannot be addressed by `ExitableDirectoryReaders` alone where looping over terms aren't involved, we need alternatives to cancel runaway queries if there are other requests which could exhibit a similar behaviour(tight loops consuming resources) It isn't separate. Look at the actual regexes: these are not normal user queries, they are malicious, constructed purposefully to cause problems. That's why the issue is improper security (e.g. authentication, audit logging etc). With these in place, if someone tries to run slow searches you will be able to attribute the malicious action to that human, hang, draw and quarter them, or whatever it is you want to do. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Bukhtawar commented on issue #12023: Mechanism to interrupt long-running/resource intensive queries
Bukhtawar commented on issue #12023: URL: https://github.com/apache/lucene/issues/12023#issuecomment-1354235380 This specific case although the stack trace might appear the same wasn't a regex query but a wildcard and fuzzy query. I will share the redacted request -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org