[GitHub] [lucene] LuXugang commented on a diff in pull request #12017: Aggressive `count` in BooleanWeight

2022-12-15 Thread GitBox


LuXugang commented on code in PR #12017:
URL: https://github.com/apache/lucene/pull/12017#discussion_r1049162986


##
lucene/core/src/test/org/apache/lucene/search/TestBooleanQuery.java:
##
@@ -1015,6 +1015,80 @@ public void testDisjunctionRandomClausesMatchesCount() 
throws Exception {
 }
   }
 
+  public void testAggressiveMatchCount() throws IOException {

Review Comment:
   addressed in 
https://github.com/apache/lucene/pull/12017/commits/272dec54a59dedc666cf8311c09830c94b3c5369
 and 
https://github.com/apache/lucene/pull/12017/commits/ae3f8d67ed75724ac2662c24025c85ae7008612a



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] craigtaverner opened a new issue, #12020: Very flat polygons give incorrect 'contains' result

2022-12-15 Thread GitBox


craigtaverner opened a new issue, #12020:
URL: https://github.com/apache/lucene/issues/12020

   ### Description
   
   When performing a search using a shape geometry query of relation type 
`QueryRelation.CONTAINS`, it is possible to get a false positive when two 
geometries intersect, but neither actually contains the other. This happens if 
the indexed geometry is a polygon that is so flat that one of its triangles is 
simplified to a single line segment. The bug is that the line segment currently 
records whether it is part of the external polygon by taking that knowledge 
from first line segment of the triangle, not necessarily the part of the 
triangle being retained. This first line segment could be an internal line 
(part of the triangle, but not the polygon). The simplification code should 
instead take this knowledge from a line segment that is not being collapsed by 
the simplification. The consequence of this bug occur later during the contains 
search when the search query deduces that the geometry is contained because the 
polygon is not closed. The search does not realise it intersects an outer
  line of the polygon because that line is not correctly marked as outer.
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] nosvalds opened a new issue, #12021: Large fields with large="true" can be truncated in v9+

2022-12-15 Thread GitBox


nosvalds opened a new issue, #12021:
URL: https://github.com/apache/lucene/issues/12021

   ### Description
   
   ## Issue
   
   For fields using `large="true"`, large fields (which is what they are 
intended for) can be truncated in v9+ of Lucene.
   
   Example fieldtype definition:
   ```
   
   ```
   
   ## Cause
   Looks like this is a bug introduced along with 
[LUCENE-8805](https://issues.apache.org/jira/browse/LUCENE-8805) / 
https://github.com/apache/lucene/issues/9849:
   
   
[https://github.com/apache/lucene/blob/5a694ea26ff862ecc874ca798135073d300c2234/sol[…]r/core/src/java/org/apache/solr/search/SolrDocumentFetcher.java](https://github.com/apache/lucene/blob/5a694ea26ff862ecc874ca798135073d300c2234/solr/core/src/java/org/apache/solr/search/SolrDocumentFetcher.java#L462-L465)
   
   Specifically with respect to "large" fields handling.
   
   The length in utf8 bytes will often be longer than the string length 
`value.length()`, hence the truncation.
   
   ## Fix
   
   The Fix would be:
   
   `bytesRef.length = bytesRef.bytes.length`
   
   ### Version and environment details
   
   - Solr v9.1.0
   - Lucene v9.3.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] craigtaverner opened a new pull request, #12022: Fix flat polygons incorrectly containing intersecting geometries

2022-12-15 Thread GitBox


craigtaverner opened a new pull request, #12022:
URL: https://github.com/apache/lucene/pull/12022

   Fixes https://github.com/apache/lucene/issues/12020
   
   ### Description
   
   When performing a search using a shape geometry query of relation type 
`QueryRelation.CONTAINS`, it is possible to get a false positive when two 
geometries intersect, but neither actually contains the other. This happens if 
the indexed geometry is a polygon that is so flat that one of its triangles is 
simplified to a single line segment. The bug is that the line segment currently 
records whether it is part of the external polygon by taking that knowledge 
from first line segment of the triangle, not necessarily the part of the 
triangle being retained. This first line segment could be an internal line 
(part of the triangle, but not the polygon). The simplification code should 
instead take this knowledge from a line segment that is not being collapsed by 
the simplification. The consequence of this bug occur later during the contains 
search when the search query deduces that the geometry is contained because the 
polygon is not closed. The search does not realise it intersects an outer
  line of the polygon because that line is not correctly marked as outer.
   
   The fix is to select the correct flag from the surviving line segments 
instead of the first line segment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a diff in pull request #12022: Fix flat polygons incorrectly containing intersecting geometries

2022-12-15 Thread GitBox


iverase commented on code in PR #12022:
URL: https://github.com/apache/lucene/pull/12022#discussion_r1049521314


##
lucene/CHANGES.txt:
##
@@ -68,6 +68,8 @@ Bug Fixes
 * LUCENE-10599: LogMergePolicy is more likely to keep merging segments until
   they reach the maximum merge size. (Adrien Grand)
 
+* GITHUB#12020: Fixes bug whereby very flat polygons can incorrectly contain 
intersecting geometries. (Craig Taverner)
+

Review Comment:
   Could you move this entry to Lucene 9.5? I am planning in backporting it so 
it will make life easier



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a diff in pull request #12022: Fix flat polygons incorrectly containing intersecting geometries

2022-12-15 Thread GitBox


iverase commented on code in PR #12022:
URL: https://github.com/apache/lucene/pull/12022#discussion_r1049521314


##
lucene/CHANGES.txt:
##
@@ -68,6 +68,8 @@ Bug Fixes
 * LUCENE-10599: LogMergePolicy is more likely to keep merging segments until
   they reach the maximum merge size. (Adrien Grand)
 
+* GITHUB#12020: Fixes bug whereby very flat polygons can incorrectly contain 
intersecting geometries. (Craig Taverner)
+

Review Comment:
   Could you move this entry to the lucene 9.5 section? I am planning in 
backporting it so it will make life easier



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #12021: Large fields with large="true" can be truncated in v9+

2022-12-15 Thread GitBox


rmuir commented on issue #12021:
URL: https://github.com/apache/lucene/issues/12021#issuecomment-1352977757

   This looks like a bug in solr code (SolrDocumentFetcher) so I'd recommend 
opening a bug over at https://github.com/apache/solr


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Bukhtawar opened a new issue, #12023: Mechanism to interrupt long-running/resource intensive queries

2022-12-15 Thread GitBox


Bukhtawar opened a new issue, #12023:
URL: https://github.com/apache/lucene/issues/12023

   ### Description
   
   As a part of https://github.com/opensearch-project/OpenSearch/issues/687 we 
detected that regex queries can run into tight loops for quite long. Below is 
the stack trace of the request for a wildcard query which consumed 100% CPU for 
an hour(although addressed in 
https://issues.apache.org/jira/browse/LUCENE-9981). 
   OpenSearch has a mechanism to cancel task on timeout and one other option 
that was we were deliberating was to send interrupts to the long running thread 
if the request is consuming more Memory/CPU or time, independent of query 
timeout. 
   The problem is not all costly executions in Lucene have a check on 
interrupts much like ExitableDirectoryReader#checkAndThrow
   
   ```
   private void checkAndThrow() {
 if (queryTimeout.shouldExit()) {
   throw new ExitingReaderException("The request took too long to 
iterate over point values. Timeout: "
   + queryTimeout.toString()
   + ", PointValues=" + in
   );
 } else if (Thread.interrupted()) {
   throw new ExitingReaderException("Interrupted while iterating over 
point values. PointValues=" + in);
 }
   }
   ```
   
   Sharing the stack trace for the request
   ```
   100.2% (500.8ms out of 500ms) cpu usage by thread 
'opensearch[917451917ca5731579187db45dd52853][search][T#5]'
4/10 snapshots sharing following 36 elements
  
app//org.apache.lucene.util.automaton.Operations.determinize(Operations.java:780)
  
app//org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
  
app//org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:245)
  
app//org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:110)
  
app//org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:87)
  
app//org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:71)
  
app//org.apache.lucene.search.WildcardQuery.(WildcardQuery.java:56)
  
app//org.opensearch.index.mapper.StringFieldType.wildcardQuery(StringFieldType.java:158)
  
app//org.opensearch.index.query.WildcardQueryBuilder.doToQuery(WildcardQueryBuilder.java:259)
  
app//org.opensearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:116)
  
app//org.opensearch.index.query.BoolQueryBuilder.addBooleanClauses(BoolQueryBuilder.java:337)
  
app//org.opensearch.index.query.BoolQueryBuilder.doToQuery(BoolQueryBuilder.java:321)
  
app//org.opensearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:116)
  
app//org.opensearch.index.query.QueryShardContext.lambda$toQuery$3(QueryShardContext.java:386)
  
app//org.opensearch.index.query.QueryShardContext$$Lambda$5010/0x000801d4d840.apply(Unknown
 Source)
  
app//org.opensearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:398)
  
app//org.opensearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:385)
  
app//org.opensearch.search.SearchService.parseSource(SearchService.java:903)
  
app//org.opensearch.search.SearchService.createContext(SearchService.java:740)
  
app//org.opensearch.search.SearchService.executeQueryPhase(SearchService.java:442)
  
app//org.opensearch.search.SearchService.access$500(SearchService.java:155)
  
app//org.opensearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:415)
  
app//org.opensearch.search.SearchService$2$$Lambda$4889/0x000801ce9840.get(Unknown
 Source)
  
app//org.opensearch.search.SearchService$$Lambda$4891/0x000801ce9c40.get(Unknown
 Source)
  
app//org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:71)
  
app//org.opensearch.action.ActionRunnable$$Lambda$3741/0x0008011fa440.accept(Unknown
 Source)
  
app//org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:86)
  
app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
  
app//org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78)
  
app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
  
app//org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:57)
  
app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:774)
  
app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
  
java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  
java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  java.base@11.0.16/java.lang.Th

[GitHub] [lucene] rmuir closed issue #12023: Mechanism to interrupt long-running/resource intensive queries

2022-12-15 Thread GitBox


rmuir closed issue #12023: Mechanism to interrupt long-running/resource 
intensive queries
URL: https://github.com/apache/lucene/issues/12023


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #12023: Mechanism to interrupt long-running/resource intensive queries

2022-12-15 Thread GitBox


rmuir commented on issue #12023:
URL: https://github.com/apache/lucene/issues/12023#issuecomment-1353025015

   determinization has already been removed here. that is the problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] craigtaverner commented on a diff in pull request #12022: Fix flat polygons incorrectly containing intersecting geometries

2022-12-15 Thread GitBox


craigtaverner commented on code in PR #12022:
URL: https://github.com/apache/lucene/pull/12022#discussion_r1049613431


##
lucene/CHANGES.txt:
##
@@ -68,6 +68,8 @@ Bug Fixes
 * LUCENE-10599: LogMergePolicy is more likely to keep merging segments until
   they reach the maximum merge size. (Adrien Grand)
 
+* GITHUB#12020: Fixes bug whereby very flat polygons can incorrectly contain 
intersecting geometries. (Craig Taverner)
+

Review Comment:
   Done in a3f640950180d5047d4fe8204416bdcc4f46e713



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Bukhtawar commented on issue #12023: Mechanism to interrupt long-running/resource intensive queries

2022-12-15 Thread GitBox


Bukhtawar commented on issue #12023:
URL: https://github.com/apache/lucene/issues/12023#issuecomment-1353034054

   Thanks @rmuir I am aware this has been addressed, this issue was to 
primarily gather thoughts on other possible queries similar to this that might 
be expensive or running tight loops but lacking interrupts or a good way to 
short circuit and ways that to cancel these based on certain heuristics like 
CPU/memory or time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent opened a new pull request, #12024: Fix SimpleTextKnnVectorsReader to handle changes introduced in GITHUB#12004

2022-12-15 Thread GitBox


benwtrent opened a new pull request, #12024:
URL: https://github.com/apache/lucene/pull/12024

   `SimpleTextKnnVectorsReader` is used for recreation and testing. It needs to 
handle the new way of searching with BytesRef directly instead of always 
searching with `float`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #12023: Mechanism to interrupt long-running/resource intensive queries

2022-12-15 Thread GitBox


rmuir commented on issue #12023:
URL: https://github.com/apache/lucene/issues/12023#issuecomment-1353034922

   not going to support Thread.interrupt or any nonsense like that. you already 
have the exitable reader: use that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #12023: Mechanism to interrupt long-running/resource intensive queries

2022-12-15 Thread GitBox


rmuir commented on issue #12023:
URL: https://github.com/apache/lucene/issues/12023#issuecomment-1353044756

   And the reason i am short with you, again, is because you still implement a 
garbage security model (no authentication required by default).
   
   Stop shipping insecure apps and you'll stop needing to worry about 
situations like this. I'm not going to workaround the insecurity of your 
application with a ton of complexity to lucene.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12024: Fix SimpleTextKnnVectorsReader to handle changes introduced in GITHUB#12004

2022-12-15 Thread GitBox


jpountz commented on code in PR #12024:
URL: https://github.com/apache/lucene/pull/12024#discussion_r1049643447


##
lucene/core/src/test/org/apache/lucene/search/TestVectorScorer.java:
##
@@ -36,13 +36,26 @@
 public class TestVectorScorer extends LuceneTestCase {
 
   public void testFindAll() throws IOException {
+VectorEncoding encoding =
+
VectorEncoding.values()[random().nextInt(VectorEncoding.values().length)];

Review Comment:
   Nit: use RandomPicks#randomFrom?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] nosvalds commented on issue #12021: Large fields with large="true" can be truncated in v9+

2022-12-15 Thread GitBox


nosvalds commented on issue #12021:
URL: https://github.com/apache/lucene/issues/12021#issuecomment-1353077520

   Sorry about that looks like the code link I had was from before the split. 
Moved this issue here: https://issues.apache.org/jira/browse/SOLR-16589


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] nosvalds closed issue #12021: Large fields with large="true" can be truncated in v9+

2022-12-15 Thread GitBox


nosvalds closed issue #12021: Large fields with large="true" can be truncated 
in v9+
URL: https://github.com/apache/lucene/issues/12021


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12024: Fix SimpleTextKnnVectorsReader to handle changes introduced in GITHUB#12004

2022-12-15 Thread GitBox


benwtrent commented on code in PR #12024:
URL: https://github.com/apache/lucene/pull/12024#discussion_r1049648525


##
lucene/core/src/test/org/apache/lucene/search/TestVectorScorer.java:
##
@@ -36,13 +36,26 @@
 public class TestVectorScorer extends LuceneTestCase {
 
   public void testFindAll() throws IOException {
+VectorEncoding encoding =
+
VectorEncoding.values()[random().nextInt(VectorEncoding.values().length)];

Review Comment:
   pushed a commit addressing this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #12024: Fix SimpleTextKnnVectorsReader to handle changes introduced in GITHUB#12004

2022-12-15 Thread GitBox


jpountz merged PR #12024:
URL: https://github.com/apache/lucene/pull/12024


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase merged pull request #12022: Fix flat polygons incorrectly containing intersecting geometries

2022-12-15 Thread GitBox


iverase merged PR #12022:
URL: https://github.com/apache/lucene/pull/12022


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase closed issue #12020: Very flat polygons give incorrect 'contains' result

2022-12-15 Thread GitBox


iverase closed issue #12020: Very flat polygons give incorrect 'contains' result
URL: https://github.com/apache/lucene/issues/12020


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on pull request #12022: Fix flat polygons incorrectly containing intersecting geometries

2022-12-15 Thread GitBox


iverase commented on PR #12022:
URL: https://github.com/apache/lucene/pull/12022#issuecomment-1353129113

   Thanks @craigtaverner!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Bukhtawar commented on issue #12023: Mechanism to interrupt long-running/resource intensive queries

2022-12-15 Thread GitBox


Bukhtawar commented on issue #12023:
URL: https://github.com/apache/lucene/issues/12023#issuecomment-1353148073

   Maybe will discuss the security part separately, but agree, one idea is to 
detect such queries and prevent running these queries in the first place, in 
this case(not the original issue) it was a bad query from an authenticated 
user. 
   Since this specific case and likes of these cannot be addressed by 
`ExitableDirectoryReaders` alone where looping over terms aren't involved, we 
need alternatives to cancel runaway queries if there are other requests which 
could exhibit the same behaviour


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #12016: Upgrade ANTLR to version 4.11.1

2022-12-15 Thread GitBox


uschindler commented on PR #12016:
URL: https://github.com/apache/lucene/pull/12016#issuecomment-1353249709

   Cool, thanks for the "huge whitespace" test!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] reta commented on pull request #12016: Upgrade ANTLR to version 4.11.1

2022-12-15 Thread GitBox


reta commented on PR #12016:
URL: https://github.com/apache/lucene/pull/12016#issuecomment-1353254656

   @rmuir @uschindler thanks a lot for HUGE help here guys!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #11946: add similarity threshold for hnsw

2022-12-15 Thread GitBox


msokolov commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1049804451


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
* @throws IllegalArgumentException if k is less than 1
*/
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the k nearest documents to the target vector according 
to the vectors in the
+   * given field. target vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   OK, with the current CR,   orthogonal vectors will have a DOT_PRODUCT  
"score" of 0.5, which could be surprising. However, this is similar to how 
result scores are treated elsewhere in Lucene - their value ranges are not 
well-defined; the only guarantee is that higher scores are "more relevant".  I 
guess practically speaking, as a user, I think I am going to have to do 
empirical work to know what threshold to use; these are not likely going to be 
motivated by some a priori knowledge of what a "good" dot-product is, and given 
that I'd like to just be able to work with some kind of abstracted score in a 
known range (0 = worst, 1 = best).Conversely, if we were to switch to using 
vector similarities that would correspond more directly to the underlying 
functions, we would have to clearly define them (today we don't actually 
explain this anywhere, I guess we'd need to document) and maybe provide methods 
for computing them. Also they would be weird too, just in a different way. For 
exam
 ple, how would we explain 8-bit dot-product? Would it be the 8-bit dot-product 
score normalized by 2^15? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on issue #12023: Mechanism to interrupt long-running/resource intensive queries

2022-12-15 Thread GitBox


msokolov commented on issue #12023:
URL: https://github.com/apache/lucene/issues/12023#issuecomment-1353313592

   Q: are you aware of https://github.com/apache/lucene/issues/11188? It's a 
fair question whether `ExitableDirectoryReader` is adequate for catching all 
runaway queries. There can be cases where there is some tight loop that never 
visits the index, so it slips through the net, like the regex one. With 
sledgehammers like `Thread.interrupt` off the table (it has always been 
questionable and is going away from JDK altogether soonish I think) we have to 
address these as they arise.
   
   A useful contribution here would be some example queries that have unbounded 
runtimes and are not handled by the DirectoryReader approach. The example here 
has already been addressed, but do we have anything else? 
   
   Note: one thing we have done is to use a custom Query whose Scorer checks 
for timeout. That way if we somehow do a lot of work that advances a docid 
iterator without reading from the index we can catch and throw an 
EarlyTerminationException. This isn't really going to be useful in most 
scenarios but I mention it in case you have some wonky Query execution that 
might benefit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #11946: add similarity threshold for hnsw

2022-12-15 Thread GitBox


benwtrent commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1049904819


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
* @throws IllegalArgumentException if k is less than 1
*/
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the k nearest documents to the target vector according 
to the vectors in the
+   * given field. target vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   Tl;dr
   
   Thank you for bearing with me! I think this is a good change.
   
   I would be happy with the JavaDocs, etc. clearly indicating that this 
threshold relates to the un-boosted vector score, not the raw similarity 
calculation. Dot-product, cosine, and euclidean are well defined concepts 
outside of Lucene. Lucene mangles (for undoubtably good reasons) the output of 
these similarities in undocumented ways to fit within boundaries.
   
   > with the current CR,
   
   I don't know what `CR` means. Change request?
   
   > However, this is similar to how result scores are treated elsewhere in 
Lucene - their value ranges are not well-defined;
   
   Agreed, ranges are usually predicated on term statistics, etc. and can 
potentially be considered "unbounded" as the corpus changes. 
   
   However, does Lucene require that all unboosted BM25 scores are between 0-1? 
It does seem like an "arbitrary" decision (to me, I don't know the full-breadth 
of Lucene optimizations, etc. when it comes to scores) to restrict vector 
similarity in this way. But that is a broader conversation. I have some 
learning to do.
   
   >  I guess practically speaking, as a user, I think I am going to have to do 
empirical work to know what threshold to use; these are not likely going to be 
motivated by some a priori knowledge of what a "good" dot-product is
   
   I would argue that a user could have a priori knowledge here. Think of it in 
the use case when the user knows their model used to make the vectors. At that 
point, they 100% know what is considered relevant based on their loss function 
and training + test data. Choosing a dot-product or cosine threshold that fits 
within 90% percentile or something given their test data results.
   
   I agree that this would be different if users were using an "off the shelf" 
model. In that case, they would probably require hybrid-search and combining 
with BM25 to get anything like relevant results (boosting various queries 
accordingly). Thus, learning what settings are required in an unfiltered case.
   
   > if we were to switch to using vector similarities that would correspond 
more directly to the underlying functions, we would have to clearly define them
   
   Cosine, dot-product, euclidean, are all already well defined. The functions 
to calculate them are universally recognized. Where Lucene separates itself is 
the manipulation of the similarity output to fit into a range [0, 1]. I guess 
this is cost of doing business in Lucene.
   
   I am not suggesting that all scoring of vector document searches changes. 
Simply that "similarity" and "score" are related, but are different things. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #11946: add similarity threshold for hnsw

2022-12-15 Thread GitBox


msokolov commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1050171394


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
* @throws IllegalArgumentException if k is less than 1
*/
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the k nearest documents to the target vector according 
to the vectors in the
+   * given field. target vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   > I don't know what CR means. Change request?
   
   sorry, yes like a PR but from a parallel universe (code review actually)
   
   So .. theoretical considerations aside, what's the alternative here -- we 
would treat the threshold as a "vector similarity" and internally convert it to 
a score. I mean that seems to make sense -- all the conversions are invertible, 
right? I think we'd want to add a normalize method to VectorSimilarity for this 
internal use.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11946: add similarity threshold for hnsw

2022-12-15 Thread GitBox


rmuir commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1050228933


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
* @throws IllegalArgumentException if k is less than 1
*/
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the k nearest documents to the target vector according 
to the vectors in the
+   * given field. target vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   ben for the normal scoring, you can look at tests for similarities package. 
none of these have any 0 to 1 range or anything like that. instead requirements 
are that score increases semimonotonically as term frequency increases, 
decreases wrt documents length, etc. these guarantees allow optimizations such 
as block max wand to be applied safely. but theres no defined range at all. 
instead lots of crazy floating point hacks so that we can safely get really 
good performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12016: Upgrade ANTLR to version 4.11.1

2022-12-15 Thread GitBox


rmuir commented on PR #12016:
URL: https://github.com/apache/lucene/pull/12016#issuecomment-1354141954

   I forced regeneration with `./gradlew -p lucene/expressions regenerate 
--rerun-tasks` just to ensure there were no source code changes and 
regeneration is idempotent / reproducible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #12016: Upgrade ANTLR to version 4.11.1

2022-12-15 Thread GitBox


rmuir merged PR #12016:
URL: https://github.com/apache/lucene/pull/12016


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir closed issue #11788: Upgrade ANTLR to version 4.11.1

2022-12-15 Thread GitBox


rmuir closed issue #11788: Upgrade ANTLR to version 4.11.1
URL: https://github.com/apache/lucene/issues/11788


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #12023: Mechanism to interrupt long-running/resource intensive queries

2022-12-15 Thread GitBox


rmuir commented on issue #12023:
URL: https://github.com/apache/lucene/issues/12023#issuecomment-1354168057

   > Maybe will discuss the security part separately, but agree, one idea is to 
detect such queries and prevent running these queries in the first place, in 
this case(not the original issue) it was a bad query from an authenticated user.
   > Since this specific case and likes of these cannot be addressed by 
`ExitableDirectoryReaders` alone where looping over terms aren't involved, we 
need alternatives to cancel runaway queries if there are other requests which 
could exhibit a similar behaviour(tight loops consuming resources)
   
   It isn't separate. Look at the actual regexes: these are not normal user 
queries, they are malicious, constructed purposefully to cause problems.
   
   That's why the issue is improper security (e.g. authentication, audit 
logging etc). With these in place, if someone tries to run slow searches you 
will be able to attribute the malicious action to that human, hang, draw and 
quarter them, or whatever it is you want to do.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Bukhtawar commented on issue #12023: Mechanism to interrupt long-running/resource intensive queries

2022-12-15 Thread GitBox


Bukhtawar commented on issue #12023:
URL: https://github.com/apache/lucene/issues/12023#issuecomment-1354235380

   This specific case although the stack trace might appear the same wasn't a 
regex query but a wildcard and fuzzy query. I will share the redacted request 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org