[PR] [Minor] Document operation costs for stale workflow [lucene]

2024-01-09 Thread via GitHub


stefanvodita opened a new pull request, #13000:
URL: https://github.com/apache/lucene/pull/13000

   Documenting the operation costs from the [latest stale workflow 
run](https://github.com/apache/lucene/actions/runs/7454785611/job/20282760199) 
for posterity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2024-01-09 Thread via GitHub


daixque commented on PR #12915:
URL: https://github.com/apache/lucene/pull/12915#issuecomment-1882633384

   @mikemccand @dungba88 Let me ping. Do I still have anything to do for this 
PR? If not, could you merge it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Get rid of deprecated assertThat() usages [lucene]

2024-01-09 Thread via GitHub


sabi0 commented on code in PR #12982:
URL: https://github.com/apache/lucene/pull/12982#discussion_r1445816935


##
lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestSuggestField.java:
##
@@ -485,7 +470,7 @@ public void testNRTDeletedDocFiltering() throws Exception {
 PrefixCompletionQuery query =
 new PrefixCompletionQuery(analyzer, new Term("suggest_field", "abc_"));
 TopSuggestDocs suggest = indexSearcher.suggest(query, numLive, false);
-assertSuggestions(suggest, expectedEntries.toArray(new 
Entry[expectedEntries.size()]));
+assertSuggestions(suggest, expectedEntries.toArray(new Entry[0]));

Review Comment:
   I had a closer look at the `assertSuggestions()` method. It has 56 usages in 
6 different classes.
   So here are the options I see:
   1. return pre-calculated array sizes and keep this part as is
   2. overload the method to accept either `Entry...` or `Collection` 
and update the 13 usages in the `TestSuggestField` (leaving the other classes 
for later)
   3. proceed with the refactoring to `Collection` and update all 6 
classes (in a separate PR)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Get rid of deprecated assertThat() usages [lucene]

2024-01-09 Thread via GitHub


sabi0 commented on code in PR #12982:
URL: https://github.com/apache/lucene/pull/12982#discussion_r1445839349


##
lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestSuggestField.java:
##
@@ -766,12 +749,12 @@ public void testScoring() throws Exception {
   assertTrue(topScore >= scoreDoc.score);
 }
 topScore = scoreDoc.score;
-assertThat((float) mappings.get(scoreDoc.key.toString()), 
equalTo(scoreDoc.score));
+assertEquals(scoreDoc.score, (float) 
mappings.get(scoreDoc.key.toString()), 1.0e-8);

Review Comment:
   There are 33 usages of `assertEquals(String, double, double, double)` and 
653 usages of `assertEquals(String, float, float, float)` across the project.
   The deltas vary considerably, from 0 to 0.0001 to 0.001 to 1e-10.
   
   Maybe I should just use 0 delta in this PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Get rid of deprecated assertThat() usages [lucene]

2024-01-09 Thread via GitHub


sabi0 commented on code in PR #12982:
URL: https://github.com/apache/lucene/pull/12982#discussion_r1445839349


##
lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestSuggestField.java:
##
@@ -766,12 +749,12 @@ public void testScoring() throws Exception {
   assertTrue(topScore >= scoreDoc.score);
 }
 topScore = scoreDoc.score;
-assertThat((float) mappings.get(scoreDoc.key.toString()), 
equalTo(scoreDoc.score));
+assertEquals(scoreDoc.score, (float) 
mappings.get(scoreDoc.key.toString()), 1.0e-8);

Review Comment:
   There are 33 usages of `assertEquals(String, double, double, double)` and 
653 usages of `assertEquals(float, float, float)` across the project.
   The deltas vary considerably, from 0 to 0.0001 to 0.001 to 1e-10.
   
   Maybe I should just use 0 delta in this PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Get rid of deprecated assertThat() usages [lucene]

2024-01-09 Thread via GitHub


sabi0 commented on code in PR #12982:
URL: https://github.com/apache/lucene/pull/12982#discussion_r1445816935


##
lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestSuggestField.java:
##
@@ -485,7 +470,7 @@ public void testNRTDeletedDocFiltering() throws Exception {
 PrefixCompletionQuery query =
 new PrefixCompletionQuery(analyzer, new Term("suggest_field", "abc_"));
 TopSuggestDocs suggest = indexSearcher.suggest(query, numLive, false);
-assertSuggestions(suggest, expectedEntries.toArray(new 
Entry[expectedEntries.size()]));
+assertSuggestions(suggest, expectedEntries.toArray(new Entry[0]));

Review Comment:
   I had a closer look at the `assertSuggestions()` method. It has 56 usages in 
6 different classes.
   So here are the options I see:
   1. revert pre-calculated array sizes and keep this part as is
   2. overload the method to accept either `Entry...` or `Collection` 
and update the 13 usages in the `TestSuggestField` (leaving the other classes 
for later)
   3. proceed with the refactoring to `Collection` and update all 6 
classes (in a separate PR)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix broken testAllVersionHaveCfsAndNocfs() [lucene]

2024-01-09 Thread via GitHub


sabi0 commented on code in PR #12969:
URL: https://github.com/apache/lucene/pull/12969#discussion_r1445850623


##
lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestBackwardsCompatibility.java:
##
@@ -731,6 +732,7 @@ public void testAllVersionHaveCfsAndNocfs() {
 String prefix = prevFile.replace("-cfs", "");
 assertEquals("Missing -nocfs for backcompat index " + prefix, prefix + 
"-nocfs", file);
   }
+  prevFile = file;

Review Comment:
   The assert in the loop was never executed because this line was missing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix broken testAllVersionHaveCfsAndNocfs() [lucene]

2024-01-09 Thread via GitHub


dweiss merged PR #12969:
URL: https://github.com/apache/lucene/pull/12969


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Get rid of deprecated assertThat() usages [lucene]

2024-01-09 Thread via GitHub


dweiss commented on code in PR #12982:
URL: https://github.com/apache/lucene/pull/12982#discussion_r1445882248


##
lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestSuggestField.java:
##
@@ -485,7 +470,7 @@ public void testNRTDeletedDocFiltering() throws Exception {
 PrefixCompletionQuery query =
 new PrefixCompletionQuery(analyzer, new Term("suggest_field", "abc_"));
 TopSuggestDocs suggest = indexSearcher.suggest(query, numLive, false);
-assertSuggestions(suggest, expectedEntries.toArray(new 
Entry[expectedEntries.size()]));
+assertSuggestions(suggest, expectedEntries.toArray(new Entry[0]));

Review Comment:
   Given the number of use cases, I'd go with option 3 or 1 - maybe I'm overly 
sensitive about that new Foo[0]...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Get rid of deprecated assertThat() usages [lucene]

2024-01-09 Thread via GitHub


dweiss commented on code in PR #12982:
URL: https://github.com/apache/lucene/pull/12982#discussion_r1445883098


##
lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestSuggestField.java:
##
@@ -766,12 +749,12 @@ public void testScoring() throws Exception {
   assertTrue(topScore >= scoreDoc.score);
 }
 topScore = scoreDoc.score;
-assertThat((float) mappings.get(scoreDoc.key.toString()), 
equalTo(scoreDoc.score));
+assertEquals(scoreDoc.score, (float) 
mappings.get(scoreDoc.key.toString()), 1.0e-8);

Review Comment:
   Yep, please do. Seems like this is what it was before (without any delta). 
Let's see if CI passes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Get rid of deprecated assertThat() usages [lucene]

2024-01-09 Thread via GitHub


sabi0 commented on code in PR #12982:
URL: https://github.com/apache/lucene/pull/12982#discussion_r1445888566


##
lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestSuggestField.java:
##
@@ -485,7 +470,7 @@ public void testNRTDeletedDocFiltering() throws Exception {
 PrefixCompletionQuery query =
 new PrefixCompletionQuery(analyzer, new Term("suggest_field", "abc_"));
 TopSuggestDocs suggest = indexSearcher.suggest(query, numLive, false);
-assertSuggestions(suggest, expectedEntries.toArray(new 
Entry[expectedEntries.size()]));
+assertSuggestions(suggest, expectedEntries.toArray(new Entry[0]));

Review Comment:
   There are 135 matches for `toArray\(new \w+\[0]` ... and 138 occurrences 
with pre-calculated array size.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2024-01-09 Thread via GitHub


dungba88 commented on PR #12915:
URL: https://github.com/apache/lucene/pull/12915#issuecomment-1882788058

   I think it's good to go, but I don't have merge permission. Mike should be 
able to help you, otherwise you can try notify the dev mailing list as 
suggested by the bot


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove stale BWC tests [lucene]

2024-01-09 Thread via GitHub


s1monw merged PR #12874:
URL: https://github.com/apache/lucene/pull/12874


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Get rid of deprecated assertThat() usages [lucene]

2024-01-09 Thread via GitHub


dweiss commented on code in PR #12982:
URL: https://github.com/apache/lucene/pull/12982#discussion_r1445940966


##
lucene/core/src/test/org/apache/lucene/geo/TestTessellator.java:
##
@@ -841,20 +839,19 @@ public void testComplexPolygon50() throws Exception {
   public void testComplexPolygon50_WithMonitor() throws Exception {
 String geoJson = GeoTestUtil.readShape("lucene-10563-1.geojson.gz");
 Polygon[] polygons = Polygon.fromGeoJSON(geoJson);
-assertThat("Only one polygon", polygons.length, equalTo(1));
+assertEquals("Only one polygon", 1, polygons.length);
 Polygon polygon = polygons[0];
 TestCountingMonitor monitor = new TestCountingMonitor();
 Tessellator.tessellate(polygon, true, monitor);
-assertThat("Expected many monitor calls", monitor.count, greaterThan(390));
-assertThat("Expected specific number of splits", monitor.splitsStarted, 
equalTo(3));
-assertThat(
-"Expected splits to start and end", monitor.splitsStarted, 
equalTo(monitor.splitsEnded));
+assertTrue("Expected many monitor calls", monitor.count > 390);

Review Comment:
   The downside of converting to an explicit comparison is that you won't know 
what the variable (monitor.count, etc.) was. Maybe these should be converted to 
hamcrest MatcherAssert.assertThat(...), just as JUnit deprecation message 
suggests?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Forebidden Thread.sleep API [lucene]

2024-01-09 Thread via GitHub


shubhamvishu opened a new pull request, #13001:
URL: https://github.com/apache/lucene/pull/13001

   ### Description
   
   This API mark `Thread.sleep` API as forebidden for futures uses in the 
codebase and suppresses the existing usages.
   
   Closes #12946
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Can we ban `Thread.sleep`? [lucene]

2024-01-09 Thread via GitHub


shubhamvishu commented on issue #12946:
URL: https://github.com/apache/lucene/issues/12946#issuecomment-1882877659

   Opened a PR  #13001 to address this (I liked the PR id, first one in the 13k 
range :)). Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for index sorting with document blocks [lucene]

2024-01-09 Thread via GitHub


s1monw commented on PR #12829:
URL: https://github.com/apache/lucene/pull/12829#issuecomment-1882919819

   @mikemccand I did another pass on it and change the wording. I think I know 
why you are confused and I tired to adress it. We use the exact same wording in 
the soft deletes case. I will work on this in a sep. issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix a bug in ShapeTestUtil [lucene]

2024-01-09 Thread via GitHub


stefanvodita commented on PR #12287:
URL: https://github.com/apache/lucene/pull/12287#issuecomment-1883060775

   We would still have a few more issues to solve after #12757, which are 
documented in #12596, but overall I agree that it makes sense to merge this as 
is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [BROKEN, for reference only] concurrent hnsw [lucene]

2024-01-09 Thread via GitHub


msokolov closed pull request #12683: [BROKEN, for reference only] concurrent 
hnsw 
URL: https://github.com/apache/lucene/pull/12683


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] NeighborArray is now fixed size [lucene]

2024-01-09 Thread via GitHub


msokolov closed pull request #11784: NeighborArray is now fixed size
URL: https://github.com/apache/lucene/pull/11784


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Forebidden Thread.sleep API [lucene]

2024-01-09 Thread via GitHub


uschindler commented on code in PR #13001:
URL: https://github.com/apache/lucene/pull/13001#discussion_r1446133887


##
gradle/validation/forbidden-apis/defaults.all.txt:
##
@@ -74,3 +74,8 @@ javax.sql.rowset.spi.SyncFactory
 @defaultMessage Math.fma is insanely slow (2500x) in many environments (e.g. 
VMs). Use multiply/add and suffer the extra rounding
 java.lang.Math#fma(float,float,float)
 java.lang.Math#fma(double,double,double)
+
+@defaultMessage We should not use Thread.sleep in lucene code
+java.lang.Thread#sleep(long)

Review Comment:
   I'd use `java.lang.Thread#sleep(**)`.
   
   This prevents any sleep variant coming soon. E.g., the Duration one not yet 
available in JDK 17.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Forebidden Thread.sleep API [lucene]

2024-01-09 Thread via GitHub


uschindler commented on code in PR #13001:
URL: https://github.com/apache/lucene/pull/13001#discussion_r1446135710


##
gradle/validation/forbidden-apis/defaults.all.txt:
##
@@ -74,3 +74,8 @@ javax.sql.rowset.spi.SyncFactory
 @defaultMessage Math.fma is insanely slow (2500x) in many environments (e.g. 
VMs). Use multiply/add and suffer the extra rounding
 java.lang.Math#fma(float,float,float)
 java.lang.Math#fma(double,double,double)
+
+@defaultMessage We should not use Thread.sleep in lucene code
+java.lang.Thread#sleep(long)

Review Comment:
   you should also move the message into the signature. I'd also prefer a more 
meaningful message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix a bug in ShapeTestUtil [lucene]

2024-01-09 Thread via GitHub


nknize commented on PR #12287:
URL: https://github.com/apache/lucene/pull/12287#issuecomment-1883224420

   > There are existing tests which should fails. 
https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/document/TestShapeDocValues.java#L67
 
   > However because we are catching exception and try until we get the correct 
polygon, it never fails.
   
   That try-catch is intentional. The evil, esoteric / invalid test geometry 
generation is also intentional so we can maximize our test coverage across the 
tessellator and Shape indexing.  This is why we have [Tessellator 
logic](https://github.com/apache/lucene/blob/4d916a754be608e7ffb4e5626e4f99b21b8513f1/lucene/core/src/java/org/apache/lucene/geo/Tessellator.java#L1293)
 to filter colinear and coplanar points so we can best effort index shapes that 
may not be well formed or may not have been pre-validated or cleaned. While we 
expect well formed data, that's not the real world. Users (especially geo 
users) often throw dirty data at the API. And to pre-validate geometries is an 
expensive operation. Not all users want to pay that price. This is why we left 
much of that diligence out of Lucene and in the hands of the integrating API. 
For example, see Elasticsearch [`ignore_malformed` 
parameter](https://github.com/elastic/elasticsearch/pull/24654). For the 
explicit shape doc value boundi
 ng box test referenced, we expect a valid polygon bounding box for test 
purposes, but (as just pointed out) that reused polygon generator could create 
a malformed one and an invalid bounding box. Rather than write a new polygon 
generator I opted to reuse the existing one and just retry until a valid 
polygon is created.  What you could do is just write a new polygon generator 
that always creates a valid well-formed polygon. I'd be concerned, though, 
about new contributors always using that to create perfect data, which is not 
real world.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid reset BlockDocsEnum#freqBuffer when indexHasFreq is false [lucene]

2024-01-09 Thread via GitHub


rmuir commented on PR #12997:
URL: https://github.com/apache/lucene/pull/12997#issuecomment-1883250347

   > I tried to add checks to AssertingLeafReader to fail when reading 
freqs/positions/offsets/payloads if they have not been requested in the flags, 
and there are many test failures. It's not entirely clear to me how intentional 
that is, @rmuir do you have context by any chance?
   
   @jpountz The current behavior is documented: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/PostingsEnum.java#L73-L99
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid reset BlockDocsEnum#freqBuffer when indexHasFreq is false [lucene]

2024-01-09 Thread via GitHub


rmuir commented on PR #12997:
URL: https://github.com/apache/lucene/pull/12997#issuecomment-1883254450

   as far as failing the methods, i'm not really sure if it is a good idea for 
frequencies. Treating the value as 1 seems reasonable, and e.g. i'm pretty sure 
TermScorer works correctly because of it, the `omitTF` has behaved like this 
for a really long time as far as i know, as the typical use case for this was 
super-short documents where frequency doesn't matter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Forebidden Thread.sleep API [lucene]

2024-01-09 Thread via GitHub


shubhamvishu commented on code in PR #13001:
URL: https://github.com/apache/lucene/pull/13001#discussion_r1446269754


##
gradle/validation/forbidden-apis/defaults.all.txt:
##
@@ -74,3 +74,8 @@ javax.sql.rowset.spi.SyncFactory
 @defaultMessage Math.fma is insanely slow (2500x) in many environments (e.g. 
VMs). Use multiply/add and suffer the extra rounding
 java.lang.Math#fma(float,float,float)
 java.lang.Math#fma(double,double,double)
+
+@defaultMessage We should not use Thread.sleep in lucene code
+java.lang.Thread#sleep(long)

Review Comment:
   Makes sense. I addressed these in the new revision



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Forebidden Thread.sleep API [lucene]

2024-01-09 Thread via GitHub


shubhamvishu commented on code in PR #13001:
URL: https://github.com/apache/lucene/pull/13001#discussion_r1446269754


##
gradle/validation/forbidden-apis/defaults.all.txt:
##
@@ -74,3 +74,8 @@ javax.sql.rowset.spi.SyncFactory
 @defaultMessage Math.fma is insanely slow (2500x) in many environments (e.g. 
VMs). Use multiply/add and suffer the extra rounding
 java.lang.Math#fma(float,float,float)
 java.lang.Math#fma(double,double,double)
+
+@defaultMessage We should not use Thread.sleep in lucene code
+java.lang.Thread#sleep(long)

Review Comment:
   Makes sense. I updated these in the new revision



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] HnwsGraph creates disconnected components [lucene]

2024-01-09 Thread via GitHub


angadp commented on issue #12627:
URL: https://github.com/apache/lucene/issues/12627#issuecomment-1883327952

   I +1 this issue and would like to try the gains in Amazon codebase. I think 
with lowered max-conn this can give some latency gains.
   
   I also feel that there is a use case for indexes with less number of updates 
where the disconnected nodes might never be connected again. Looking forward to 
the results from the latest benchmark run.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Explore moving HNSW's NeighborQueue to a radix heap [LUCENE-10383] [lucene]

2024-01-09 Thread via GitHub


angadp commented on issue #11419:
URL: https://github.com/apache/lucene/issues/11419#issuecomment-1883368469

   Given the comment by @benwtrent is this issue still relevant?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Forebidden Thread.sleep API [lucene]

2024-01-09 Thread via GitHub


uschindler commented on PR #13001:
URL: https://github.com/apache/lucene/pull/13001#issuecomment-1883368845

   P.S.: You don't need to force-push, it is better to keep track what was 
changed in a PR. The PR will get squashed anyways.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Forebidden Thread.sleep API [lucene]

2024-01-09 Thread via GitHub


shubhamvishu commented on PR #13001:
URL: https://github.com/apache/lucene/pull/13001#issuecomment-1883422583

   > In general this looks fine, although I think we should also work on remove 
some of those sleeps. For all others there should be an explanation, why the 
test needs sleeping.
   
   +1 to remove some to these usages. I didn't pursue since there are just too 
many of them. I think we could scope this PR to only add the api to forbidden 
ones and suppress all the usages and then follow up with a second pass or PR to 
remove those usages as much as possible. If that sounds good I could create a 
separate issue to reduce the existing usages(or we could use the same issue 
too). Let me know what do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix a bug in ShapeTestUtil [lucene]

2024-01-09 Thread via GitHub


heemin32 commented on PR #12287:
URL: https://github.com/apache/lucene/pull/12287#issuecomment-1883449137

   >That try-catch is intentional. 
   
   However, the implementation between latlon and xy are different then. 
https://github.com/apache/lucene/blob/7b8aece125aabff2823626d5b939abf4747f63a7/lucene/core/src/test/org/apache/lucene/document/TestShapeDocValues.java#L77-L95
   
   For latlon, it does not have try-catch statement wrapping 
`GeoTestUtil.nextPolygon()`. I believe `GeoTestUtil.nextPolygon()` returns 
valid polygon always but `ShapeTestUtil.nextPolygon()` does not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix a bug in ShapeTestUtil [lucene]

2024-01-09 Thread via GitHub


nknize commented on PR #12287:
URL: https://github.com/apache/lucene/pull/12287#issuecomment-1883477836

   > ... I believe `GeoTestUtil.nextPolygon()` returns valid polygon always but 
`ShapeTestUtil.nextPolygon()` does not.
   
   A quick glance looks like the difference is from 
[LUCENE-9192](https://github.com/apache/lucene/issues/10232) where 
ShapeTestUtil#nextPolygon will only consider executing the try-catch path on 
Nightly runs. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Split taxonomy arrays across chunks [lucene]

2024-01-09 Thread via GitHub


stefanvodita commented on PR #12995:
URL: https://github.com/apache/lucene/pull/12995#issuecomment-1883476554

   The results are in. I don't see any significant p-values (< 0.05).
   `python3 src/python/localrun.py -source wikimediumall -r`
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
  MedPhrase   27.55  (7.2%)   27.08  
(5.8%)   -1.7% ( -13% -   12%) 0.408
 IntNRQ   19.20 (10.6%)   18.89  
(7.1%)   -1.6% ( -17% -   18%) 0.566
  BrowseMonthTaxoFacets9.67  (8.9%)9.52 
(10.7%)   -1.6% ( -19% -   19%) 0.613
BrowseRandomLabelTaxoFacets4.01 (25.1%)3.97 
(26.0%)   -1.0% ( -41% -   66%) 0.902
   HighSloppyPhrase   28.64  (3.5%)   28.41  
(2.4%)   -0.8% (  -6% -5%) 0.406
 AndHighLow  325.06  (3.8%)  322.67  
(7.0%)   -0.7% ( -11% -   10%) 0.680
BrowseRandomLabelSSDVFacets2.92  (1.8%)2.90  
(2.6%)   -0.6% (  -4% -3%) 0.403
  BrowseDayOfYearSSDVFacets4.08  (3.8%)4.06  
(3.7%)   -0.6% (  -7% -7%) 0.625
Prefix3  142.86  (5.0%)  142.32  
(5.4%)   -0.4% ( -10% -   10%) 0.819
   HighTermTitleBDVSort7.40  (6.3%)7.38  
(6.4%)   -0.3% ( -12% -   13%) 0.862
LowSloppyPhrase   11.47  (1.9%)   11.44  
(1.7%)   -0.3% (  -3% -3%) 0.657
LowSpanNear6.86  (2.0%)6.85  
(1.8%)   -0.2% (  -3% -3%) 0.754
 HighPhrase   67.07  (3.1%)   66.94  
(3.1%)   -0.2% (  -6% -6%) 0.850
  LowPhrase7.77  (3.2%)7.76  
(3.4%)   -0.1% (  -6% -6%) 0.888
   HighSpanNear3.81  (3.0%)3.80  
(2.8%)   -0.1% (  -5% -5%) 0.875
   HighIntervalsOrdered1.61  (2.5%)1.61  
(2.4%)   -0.1% (  -4% -4%) 0.853
 AndHighMed   76.92  (6.7%)   76.83  
(8.0%)   -0.1% ( -13% -   15%) 0.959
MedSloppyPhrase   15.30  (1.9%)   15.29  
(1.7%)   -0.1% (  -3% -3%) 0.850
 OrHighHigh   20.33  (7.3%)   20.31  
(6.4%)   -0.1% ( -12% -   14%) 0.961
LowIntervalsOrdered7.12  (2.5%)7.12  
(2.4%)   -0.0% (  -4% -4%) 0.955
AndHighMedDayTaxoFacets   11.03  (2.2%)   11.04  
(2.2%)0.0% (  -4% -4%) 0.959
   OrNotHighMed  216.20  (2.6%)  216.28  
(3.0%)0.0% (  -5% -5%) 0.967
MedIntervalsOrdered   23.70  (1.8%)   23.71  
(1.6%)0.0% (  -3% -3%) 0.944
MedSpanNear   13.33  (1.9%)   13.34  
(1.7%)0.1% (  -3% -3%) 0.910
   AndHighHighDayTaxoFacets7.73  (3.5%)7.74  
(3.2%)0.1% (  -6% -7%) 0.906
   MedTermDayTaxoFacets   15.49  (2.2%)   15.54  
(1.6%)0.3% (  -3% -4%) 0.590
  HighTermTitleSort  121.92  (1.8%)  122.33  
(1.6%)0.3% (  -2% -3%) 0.533
 TermDTSort  142.74  (2.1%)  143.26  
(3.4%)0.4% (  -5% -5%) 0.683
   PKLookup  172.65  (3.0%)  173.29  
(2.8%)0.4% (  -5% -6%) 0.687
Respell   43.69  (2.3%)   43.87  
(2.7%)0.4% (  -4% -5%) 0.613
  HighTermDayOfYearSort  210.29  (2.8%)  211.25  
(3.1%)0.5% (  -5% -6%) 0.626
AndHighHigh   23.87 (11.4%)   23.99 
(11.7%)0.5% ( -20% -   26%) 0.888
   BrowseDateSSDVFacets1.19  (5.2%)1.19  
(4.2%)0.6% (  -8% -   10%) 0.698
   Wildcard  186.59  (2.6%)  187.68  
(2.9%)0.6% (  -4% -6%) 0.505
  OrHighMed   90.56  (5.3%)   91.09  
(4.9%)0.6% (  -9% -   11%) 0.715
  OrNotHighHigh  354.86  (3.6%)  356.98  
(3.9%)0.6% (  -6% -8%) 0.611
  OrHighNotHigh  306.95  (3.8%)  308.88  
(4.2%)0.6% (  -7% -9%) 0.620
   OrHighNotMed  363.76  (3.9%)  366.46  
(4.0%)0.7% (  -6% -8%) 0.548
   BrowseDateTaxoFacets4.97 (27.3%)5.00 
(27.3%)0.8% ( -42% -   76%) 0.929
  BrowseDayOfYearTaxoFacets5.06 (26.8%)5.09 
(26.9%)0.8% ( -41% -   74%) 0.927

Re: [PR] Fix a bug in ShapeTestUtil [lucene]

2024-01-09 Thread via GitHub


heemin32 commented on PR #12287:
URL: https://github.com/apache/lucene/pull/12287#issuecomment-1883491755

   Yes. But also what I meant is `testLatLonPolygonCentroid()` test calls 
`Polygon p = GeoTestUtil.nextPolygon();` directly, but 
`testXYPolygonCentroid()` test calls `XYPolygon p = (XYPolygon) 
BaseXYShapeTestCase.ShapeType.POLYGON.nextShape();` which calls 
`ShapeTestUtil.nextPolygon();` inside try-catch. We need to call 
`ShapeTestUtil.nextPolygon();` directly from `testXYPolygonCentroid()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Explore moving HNSW's NeighborQueue to a radix heap [LUCENE-10383] [lucene]

2024-01-09 Thread via GitHub


benwtrent commented on issue #11419:
URL: https://github.com/apache/lucene/issues/11419#issuecomment-1883500270

   @angadp I think it's more relevant given recent refactors. A radix heap may 
be possible now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Explore moving HNSW's NeighborQueue to a radix heap [LUCENE-10383] [lucene]

2024-01-09 Thread via GitHub


angadp commented on issue #11419:
URL: https://github.com/apache/lucene/issues/11419#issuecomment-1883522255

   Thanks! I was looking for an open issue and will check this out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] migrate OpenNLP 'ant train-test-models' to Gradle [lucene]

2024-01-09 Thread via GitHub


cpoerschke opened a new issue, #13002:
URL: https://github.com/apache/lucene/issues/13002

   ### Description
   
   ref: 
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/analysis/opennlp/build.xml#L52-L84


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] upgrade to OpenNLP 2.3.1 [lucene]

2024-01-09 Thread via GitHub


cpoerschke commented on PR #12674:
URL: https://github.com/apache/lucene/pull/12674#issuecomment-1883525174

   > It would be good migrate the model regeneration to gradle - probably a 
good follow-up issue on its own though.
   
   Makes sense, opened #13002 for that.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] upgrade to OpenNLP 2.3.1 [lucene]

2024-01-09 Thread via GitHub


cpoerschke commented on code in PR #12674:
URL: https://github.com/apache/lucene/pull/12674#discussion_r1446425261


##
lucene/analysis/opennlp/src/tools/test-model-data/README.txt:
##
@@ -3,4 +3,4 @@ Training data derived from Reuters corpus in very unscientific 
way.
 Tagging done with CCG Urbana-Champaign online demos:
http://cogcomp.cs.illinois.edu/page/demos
 
-Run 'ant train-test-models' to generate models from training data here.

Review Comment:
   #13002 is for migration to Gradle.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] upgrade to OpenNLP 2.3.1 [lucene]

2024-01-09 Thread via GitHub


cpoerschke commented on PR #12674:
URL: https://github.com/apache/lucene/pull/12674#issuecomment-1883534556

   > What's probably missing is ... a note on migration?
   
   Yes, indeed. It being a 1.x to 2.x dependency upgrade, I'm not yet familiar 
enough with OpenNLP to gauge what's involved in migrating. Perhaps @jzonthemtn 
and/or @epugh would have insights to share on that? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix a bug in ShapeTestUtil [lucene]

2024-01-09 Thread via GitHub


nknize commented on PR #12287:
URL: https://github.com/apache/lucene/pull/12287#issuecomment-1883537742

   > We need to call `ShapeTestUtil.nextPolygon();` directly from 
`testXYPolygonCentroid()`.
   
   Doing that could create a polygon that can't be tessellated. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix a bug in ShapeTestUtil [lucene]

2024-01-09 Thread via GitHub


heemin32 commented on PR #12287:
URL: https://github.com/apache/lucene/pull/12287#issuecomment-1883551295

   > Doing that could create a polygon that can't be tessellated.
   
   But `testLatLonPolygonCentroid()` does call `GeoTestUtils.nextPolygon()` 
directly and tessellation works fine.
   
   
https://github.com/apache/lucene/blob/7b8aece125aabff2823626d5b939abf4747f63a7/lucene/core/src/test/org/apache/lucene/document/TestShapeDocValues.java#L77-L86


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Exploring GPU based kNN vector search [lucene]

2024-01-09 Thread via GitHub


chatman opened a new issue, #13003:
URL: https://github.com/apache/lucene/issues/13003

   ### Description
   
   Through this issue, I wish to explore integrating NVIDIA's kNN indexing and 
search support, https://github.com/rapidsai/raft. Through our initial 
benchmarks/prototypes, we found it a lot faster than our HNSW based search.
   Along the way, we shall add more details of our experiments through this 
issue. And will open a PR as soon as something takes shape (right now, things 
are in extremely early proof-of-concept state).
   
   One of the potential results of this work, if explorations prove worthwhile, 
can be a Lucene module based on a JNI wrapper around the Raft library.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Add post-filter capability to `SynonymGraphFilter` [lucene]

2024-01-09 Thread via GitHub


mikemccand opened a new issue, #13004:
URL: https://github.com/apache/lucene/issues/13004

   ### Description
   
   [I'm not sure how general this is but figured I'd open this to see if there 
is interest / other use cases:]
   
   At Amazon product search team we have synonyms that are sometimes 
conditionally applied depending on some context about the document or query.  
For example, `apple` might be a fruit in the grocery subset of Amazon's 
catalog, or a computer in the electronics catalog.
   
   Today we implement this very inefficiently: we compile N massive synonym 
maps, mostly (wastefully) with the same synonyms except for a few that are 
catalog / query context specific.  This is wasteful and takes gobbs of heap.  
(Hmm, separately: `SynonymGraphFilter` should be fixed to use off-heap FSTs -- 
I'll open a crab spinoff issue).
   
   Maybe instead we could allow each synonym rule to optionally have some 
metadata that may be used, at matching time, to post-filter, only applying the 
synonym based on the context of the current document/query?  I'm not sure how 
this would work -- maybe N labels that are compiled to an int/long bitset 
recorded into each FST rule?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] `SynonymGraphFilter` should read FSTs off-heap? [lucene]

2024-01-09 Thread via GitHub


mikemccand opened a new issue, #13005:
URL: https://github.com/apache/lucene/issues/13005

   ### Description
   
   [Spinoff from #13004]
   
   Recently we added off-heap FST reading, but only switched to it in limited 
cases, starting with the terms index in `BlockTree` terms dictionary.  Should 
we switch to off-heap for `SynonymGraphFilter` too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Forbidden Thread.sleep API [lucene]

2024-01-09 Thread via GitHub


mikemccand commented on PR #13001:
URL: https://github.com/apache/lucene/pull/13001#issuecomment-1883719867

   +1 to merge this, first, so we stop the bleeding (no more `Thread.sleep` 
added to the code base), and in follow-on PR we can start whittling down the 
existing grandfather'd `Thread.sleep` calls.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2024-01-09 Thread via GitHub


MarcusSorealheis commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1883963445

   Great to finally see you in the Lucene repo @kevindrosendahl after all these 
years. 🍰 The work you have done here is stellar and the whole world welcomes 
the diligence. I hope we can keep this momentum. I have a few concerns that are 
not intended to block the project 
   
   I second the question, "How is segment merging implemented by Lucene 
Vamana?" I have a suspicion that extending `MergePolicy` could be in order for 
a production-ready implementation but I have such limited knowledge for such a 
thought.
   
   I have general concerns about large copy overheads and the restrictions of 
POSIX AIO from experience dealing with the failures of storing data on disk, as 
well as random disk failures.  `io_uring` certainly offers some theoretical 
promise in my mind. I know that is a newer Linux kernel module/interface, so 
that would lead me to believe actually taking this implementation forward would 
require a bit of comprehensive testing. 
   
   One particular concern I have, as you specifically might have guessed, would 
be around soak testing. Has anyone looked into this issue since November, or 
has anyone run the tests for an extended period?  
   
   The other concern I have is that I suspect the community would probably need 
to keep an eye on `io_uring` given its age. I was instructed long ago to use 
old interfaces whenever possible but the person who told me retired so I cannot 
ask why. I would suspect the newer interfaces tend to evolve more quickly than 
the ones than the ones that have been around for a while. Surely that burden 
would likely fall to the Policeman. In general, this is an interesting 
development and I look forward to continuing to explore. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] upgrade to OpenNLP 2.3.1 [lucene]

2024-01-09 Thread via GitHub


epugh commented on PR #12674:
URL: https://github.com/apache/lucene/pull/12674#issuecomment-1883984387

   I don't know about the migration side specifically, but based on 
https://opennlp.apache.org/news/release-200.html and some of the other release 
notes, here is a first stab.  
   
   ```
   OpenNLP 2x opens the door to accessing various models via the ONNX runtime.  
To migrate you will need to update any deprecated OpenNLP methods that you may 
be using and be running on Java 17.
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Lazily write the FST padding byte [lucene]

2024-01-09 Thread via GitHub


github-actions[bot] commented on PR #12981:
URL: https://github.com/apache/lucene/pull/12981#issuecomment-1883992139

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org