Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-08 Thread via GitHub


mikemccand commented on PR #12633:
URL: https://github.com/apache/lucene/pull/12633#issuecomment-1751999625

   Here are the results from running `test_all_sizes.py` then 
`results_to_md.py`:
   
   |NodeHash size|FST (mb)|RAM (mb)|FST build time (sec)|   

   
   |-||||   

   
   |0|577.4|0.0|35.2|   

   
   |4|586.5|0.0|43.2|   

   
   |8|587.0|0.0|46.4|   

   
   |16|585.2|0.0|44.8|  

   
   |32|582.0|0.0|45.9|  

   
   |64|578.8|0.0|45.4|  

   
   |128|573.0|0.0|45.9| 

   
   |256|563.6|0.0|46.1| 

   
   |512|551.2|0.0|45.4| 

   
   |1024|537.5|0.0|45.7|

   
   |2048|523.4|0.0|46.0|

   
   |4096|509.5|0.1|45.6|

   
   |8192|495.8|0.1|45.2|

   
   |16384|481.8|0.2|46.3|   

   
   |32768|461.1|0.5|45.2|   

   
   |65536|447.2|1.0|45.7|   

   
   |131072|432.4|2.0|46.3|  

   
   |262144|418.6|4.0|46.3|  

   
   |524288|402.4|8.0|46.9|  

   
   |1048576|391.0|16.0|50.0|

   
   |2097152|380.8|32.0|55.2|

   
   |4194304|371.4|64.0|58.3|

   
   |8388608|362.5|128.0|59.9|   

   
   |16777216|356.1|256.0|59.3| 

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub


ChrisHegarty commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752003494

   Thanks for looking into this @rmuir, I've been thinking similar myself (just 
didn't get around to anything other than the thinking! )
   
   On my Mac M2.
   
   JDK 20.0.2.
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   BinaryDotProductBenchmark.dotProductNew   1024  thrpt5  6.590 ± 
0.098  ops/us
   BinaryDotProductBenchmark.dotProductNewNew1024  thrpt5  7.769 ± 
0.102  ops/us
   BinaryDotProductBenchmark.dotProductOld   1024  thrpt5  3.159 ± 
0.034  ops/us
   ```
   
   JDK 21
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   BinaryDotProductBenchmark.dotProductNew   1024  thrpt5  6.546 ± 
0.054  ops/us
   BinaryDotProductBenchmark.dotProductNewNew1024  thrpt5  7.696 ± 
0.103  ops/us
   BinaryDotProductBenchmark.dotProductOld   1024  thrpt5  2.893 ± 
0.306  ops/us
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub


ChrisHegarty commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752024575

   ```
 // sum into accumulators
 Vector prod16 = prod16_1.add(prod16_2);
 acc = acc.add(prod16.convert(VectorOperators.S2I, 0));
 acc = acc.add(prod16.convert(VectorOperators.S2I, 1));
   ```
   
   What is the maximum value that we can see in the input bytes? Can they every 
hold `-128`? Do we need to handle "overflow" in the short accumulation and 
subsequent conversion to `int`?  If so then we can use 
`VectorOperators.ZERO_EXTEND_S2I` (rather than `S2I`). ( this is just a 
question, rather than a suggestion, since I only thought of this after leaving 
my keyboard) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable rank-unsafe optimization of top-k hit computations by quantizing scores. [lucene]

2023-10-08 Thread via GitHub


mikemccand commented on PR #12628:
URL: https://github.com/apache/lucene/pull/12628#issuecomment-1752028823

   Very cool, surprisingly impactful!
   
   > I ran the Tantivy benchmark with TOP_10 and TOP_100 commands
   
   This is the Tantivy benchmark tooling, but you are comparing  Lucene (main) 
to Lucene (your PR) right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub


ChrisHegarty commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752029230

   And of course, `ZERO_EXTEND_S2I`, will work in the maximum boundary case, 
but not in others.  So the question is then just about the maximum value of the 
bytes in these input arrays. :-( 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality [lucene]

2023-10-08 Thread via GitHub


mikemccand commented on issue #12542:
URL: https://github.com/apache/lucene/issues/12542#issuecomment-1752030874

   Talking to @sokolovm at Community Over Code 2023 he suggested another idea 
here: instead of a (RAM hungry) hash table, couldn't we use the growing FST 
itself to lookup suffixes?
   
   If we added the reversed (incoming) transitions to each FST node then we 
could do such a suffix lookup in reverse of the normal forward only FST lookup. 
 Maybe we could have these additional reverse transitions in a separate RAM 
efficient structure, just used during construction?  (Because the written FST 
only needs the forwards-only lookup).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub


mikemccand commented on PR #12631:
URL: https://github.com/apache/lucene/pull/12631#issuecomment-1752031210

   > sum | 31606784 | 27188690 | -13.98%
   
   WHOA, wow!  This is a massive gain for such a tiny change :)  I'll try to 
review soon!  Nice to revisit ancient `TODO`s in the source code ... :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752033176

   > What is the maximum value that we can see in the input bytes? 
   
   All possible values is how i test
   
   > Can they every hold `-128`? 
   
   Yes!
   
   > Do we need to handle "overflow" in the short accumulation and subsequent 
conversion to `int`?
   
   No! integer fma won't overflow here. and i know that isn't obvious and 
probably needs a code comment lol. that's why it only works for these two 
methods and not the square.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub


ChrisHegarty commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752035773

   Ok, cool. If there is not already one, we should add a test to the Panama / 
scalar unit test for the boundary values. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752036396

   yeah agreed: we should test the boundaries for all 3 functions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub


mikemccand commented on code in PR #12631:
URL: https://github.com/apache/lucene/pull/12631#discussion_r1349699402


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/FieldReader.java:
##
@@ -99,6 +102,26 @@ public final class FieldReader extends Terms {
 */
   }
 
+  long readLongOutput(DataInput in) throws IOException {

Review Comment:
   Maybe rename to `readVLongOutput`?  Make it clearly we are doing some form 
of vlong encoding?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java:
##
@@ -460,6 +460,19 @@ public String toString() {
   return "BLOCK: prefix=" + brToString(prefix);
 }
 
+private static void writeMSBVLong(long i, DataOutput scratchBytes) throws 
IOException {

Review Comment:
   Maybe rename `i` to `value`?  And then you can use `i` for the for loop 
below?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java:
##
@@ -81,8 +81,11 @@ public final class Lucene90BlockTreeTermsReader extends 
FieldsProducer {
   /** Initial terms format. */
   public static final int VERSION_START = 0;
 
+  /** Version that uses MSB VLong encoded output */
+  public static final int VERSION_MSB_VLONG_OUTPUT = 1;
+
   /** Current terms format. */
-  public static final int VERSION_CURRENT = VERSION_START;
+  public static final int VERSION_CURRENT = VERSION_MSB_VLONG_OUTPUT;

Review Comment:
   Add a comment explaining what changed, and link to the GitHub issue?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java:
##
@@ -460,6 +460,19 @@ public String toString() {
   return "BLOCK: prefix=" + brToString(prefix);
 }
 
+private static void writeMSBVLong(long i, DataOutput scratchBytes) throws 
IOException {
+  assert i >= 0;
+  // Keep zero bits on most significant byte to have more chance to get 
prefix bytes shared.
+  // e.g. we expect 0x7FFF stored as [0x81, 0xFF, 0x7F] but not [0xFF, 
0xFF, 0x40]
+  int LSBVLongBytes = (Long.SIZE - Long.numberOfLeadingZeros(i) - 1) / 7 + 
1;

Review Comment:
   Maybe rename to `bytesNeeded`?  (Full disclosure this was suggested by 
[Bard](https://bard.google.com)).



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java:
##
@@ -460,6 +460,19 @@ public String toString() {
   return "BLOCK: prefix=" + brToString(prefix);
 }
 
+private static void writeMSBVLong(long i, DataOutput scratchBytes) throws 
IOException {
+  assert i >= 0;
+  // Keep zero bits on most significant byte to have more chance to get 
prefix bytes shared.
+  // e.g. we expect 0x7FFF stored as [0x81, 0xFF, 0x7F] but not [0xFF, 
0xFF, 0x40]
+  int LSBVLongBytes = (Long.SIZE - Long.numberOfLeadingZeros(i) - 1) / 7 + 
1;
+  i <<= Long.SIZE - LSBVLongBytes * 7;
+  for (int j = 1; j < LSBVLongBytes; j++) {
+scratchBytes.writeByte((byte) (((i >>> 57) & 0x7FL) | 0x80));

Review Comment:
   Maybe @uschindler can help review this heavy math :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752039360

   yeah, you are right, i am wrong. the trick only works in the unsigned case, 
Byte.MIN_VALUE is a problem :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] add tests for vectorutils integer boundaries [lucene]

2023-10-08 Thread via GitHub


rmuir opened a new pull request, #12634:
URL: https://github.com/apache/lucene/pull/12634

   Let's improve the testing for the boundary cases and check them explicitly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752041404

   at least we can improve the testing out of this: 
https://github.com/apache/lucene/pull/12634


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub


gf2121 commented on code in PR #12631:
URL: https://github.com/apache/lucene/pull/12631#discussion_r1349705693


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java:
##
@@ -81,8 +81,11 @@ public final class Lucene90BlockTreeTermsReader extends 
FieldsProducer {
   /** Initial terms format. */
   public static final int VERSION_START = 0;
 
+  /** Version that uses MSB VLong encoded output */
+  public static final int VERSION_MSB_VLONG_OUTPUT = 1;
+
   /** Current terms format. */
-  public static final int VERSION_CURRENT = VERSION_START;
+  public static final int VERSION_CURRENT = VERSION_MSB_VLONG_OUTPUT;

Review Comment:
   I wrote some comments on `VERSION_MSB_VLONG_OUTPUT`. I prefer to leave 
comments there, as `VERSION_CURRENT` could probably link to other version in 
feature. WDYT?
   
   Issue added :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752049654

   don't worry, i have a plan B. it is just frustrating due to the nightmare of 
operating on the mac, combined with the fact this benchmark and lucene source 
is a separate repo. it makes the situation very slow and error-prone


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752050233

   see latest commit for the idea. on my mac it gives a decent boost. it uses 
"32-bit" vector by loading 64-bit vector from array but only processing half of 
it.  The tests should fail as i need to fix the other functions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub


mikemccand commented on code in PR #12631:
URL: https://github.com/apache/lucene/pull/12631#discussion_r1349711457


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java:
##
@@ -81,8 +81,11 @@ public final class Lucene90BlockTreeTermsReader extends 
FieldsProducer {
   /** Initial terms format. */
   public static final int VERSION_START = 0;
 
+  /** Version that uses MSB VLong encoded output */
+  public static final int VERSION_MSB_VLONG_OUTPUT = 1;
+
   /** Current terms format. */
-  public static final int VERSION_CURRENT = VERSION_START;
+  public static final int VERSION_CURRENT = VERSION_MSB_VLONG_OUTPUT;

Review Comment:
   Oh, sorry, yes - `VERSION_MSB_VLONG_OUTPUT` is the right place for the 
comment!  Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub


mikemccand commented on PR #12631:
URL: https://github.com/apache/lucene/pull/12631#issuecomment-1752050479

   I kicked off a `luceneutil` run ... I'll post results here soonish.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] add tests for vectorutils integer boundaries [lucene]

2023-10-08 Thread via GitHub


rmuir merged PR #12634:
URL: https://github.com/apache/lucene/pull/12634


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752063622

   ok on my mac i see:
   
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   BinaryCosineBenchmark.cosineDistanceNew   1024  thrpt5  2.261 ± 
0.007  ops/us
   BinaryCosineBenchmark.cosineDistanceNewNew1024  thrpt5  3.708 ± 
0.034  ops/us
   BinaryDotProductBenchmark.dotProductNew   1024  thrpt5  6.138 ± 
0.021  ops/us
   BinaryDotProductBenchmark.dotProductNewNew1024  thrpt5  6.645 ± 
0.021  ops/us
   ```
   
   and it passes new boundary tests (but no sneakiness with boundary values, 
instead another type of sneakiness).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub


mikemccand commented on PR #12631:
URL: https://github.com/apache/lucene/pull/12631#issuecomment-1752064474

   `luceneutil` results on `wikimediumall` look good -- looks like all noise 
(even for `PKLookup`), or, any signal (change) is very low, making the ~15% 
reduction very much worth it.
   
   ```
   TaskQPS base  StdDevQPS msb_vlong  
StdDevPct diff p-value  
 
  BrowseMonthTaxoFacets8.34  (4.4%)8.24  
(3.6%)   -1.2% (  -8% -7%) 0.344
  
   HighIntervalsOrdered   11.37  (3.9%)   11.24  
(4.5%)   -1.2% (  -9% -7%) 0.382
  
  OrHighLow  394.28  (1.6%)  390.47  
(2.1%)   -1.0% (  -4% -2%) 0.099
  
MedSpanNear   24.41  (3.9%)   24.23  
(3.9%)   -0.8% (  -8% -7%) 0.537
  
MedIntervalsOrdered9.96  (4.0%)9.89  
(4.3%)   -0.7% (  -8% -7%) 0.602
  
LowSpanNear   13.82  (3.2%)   13.73  
(3.2%)   -0.6% (  -6% -5%) 0.526
  
AndHighHigh   97.28  (1.9%)   96.71  
(1.2%)   -0.6% (  -3% -2%) 0.259
  
BrowseRandomLabelSSDVFacets9.22  (4.5%)9.17  
(4.3%)   -0.6% (  -8% -8%) 0.688
  
LowIntervalsOrdered   28.78  (4.0%)   28.64  
(4.1%)   -0.5% (  -8% -7%) 0.720
  
   PKLookup  250.58  (0.9%)  249.44  
(1.0%)   -0.5% (  -2% -1%) 0.118
  
 IntNRQ  315.77  (2.5%)  314.47  
(2.1%)   -0.4% (  -4% -4%) 0.570
  
  BrowseDayOfYearSSDVFacets   12.75  (7.3%)   12.70 
(10.5%)   -0.4% ( -16% -   18%) 0.887   
   
 HighPhrase   55.21  (3.5%)   55.00  
(2.8%)   -0.4% (  -6% -6%) 0.712
  
  MedPhrase   77.25  (2.0%)   77.04  
(1.9%)   -0.3% (  -4% -3%) 0.658
  
   HighSloppyPhrase6.90  (3.9%)6.89  
(3.4%)   -0.2% (  -7% -7%) 0.841
  
  LowPhrase  463.60  (1.4%)  462.90  
(1.1%)   -0.2% (  -2% -2%) 0.701
  
   OrNotHighMed  424.68  (2.6%)  424.06  
(2.5%)   -0.1% (  -5% -5%) 0.857
  
 AndHighLow  680.76  (1.3%)  679.77  
(1.6%)   -0.1% (  -3% -2%) 0.757
  
   HighSpanNear   27.40  (1.5%)   27.37  
(1.3%)   -0.1% (  -2% -2%) 0.787
  
 AndHighMed  200.30  (1.3%)  200.12  
(1.3%)   -0.1% (  -2% -2%) 0.826
  
   OrNotHighLow  622.80  (1.2%)  622.27  
(1.1%)   -0.1% (  -2% -2%) 0.816
  
LowSloppyPhrase7.73  (2.9%)7.73 

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub


ChrisHegarty commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752098666

   I get similar bench results, the new impl is faster.
   
   ```
   Benchmark   (size)   Mode  Cnt  Score   
Error   Units
   BinaryDotProductBenchmark.dotProductNew   1024  thrpt5  6.499 ± 
0.299  ops/us
   BinaryDotProductBenchmark.dotProductNewNew1024  thrpt5  7.017 ± 
0.165  ops/us
   BinaryDotProductBenchmark.dotProductOld   1024  thrpt5  3.135 ± 
0.060  ops/us
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub


ChrisHegarty commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752099845

   My sense here is that accessing a `part` other than `0` is less performant 
that just reloading the data, which seems a little off. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752100681

   > My sense here is that accessing a `part` other than `0` is less performant 
that just reloading the data, which seems a little off.
   
   It seems to have a heavy cost no matter how i do it, like a bounds check 
that i can't trick it into eliminating?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752101786

   btw, another crazy avenue to possibly explore here another day, since we 
seem bottlenecked on integer multiply. We could try it on arm too. It is faster 
than the current binary code on my machine at least, but not by much so it 
needs tweaking. Need to also look at vector size limits and think about it as 
whole integers can only be exactly represented up to some size. Maybe 
accumulator has to be integer, needs some thought.
   
   ```
   // optimized 256/512 bit implementation, uses FPU
   int upperBound = PREFERRED_BYTE_SPECIES.loopBound(a.length - 
3*PREFERRED_BYTE_SPECIES.length());
   FloatVector acc1 = FloatVector.zero(FloatVector.SPECIES_PREFERRED);
   FloatVector acc2 = FloatVector.zero(FloatVector.SPECIES_PREFERRED);
   FloatVector acc3 = FloatVector.zero(FloatVector.SPECIES_PREFERRED);
   FloatVector acc4 = FloatVector.zero(FloatVector.SPECIES_PREFERRED);
   for (; i < upperBound; i += 4*PREFERRED_BYTE_SPECIES.length()) {
 ByteVector va8 = ByteVector.fromArray(PREFERRED_BYTE_SPECIES, a, 
i);
 ByteVector vb8 = ByteVector.fromArray(PREFERRED_BYTE_SPECIES, b, 
i);
 Vector va32 = va8.convertShape(VectorOperators.B2F, 
FloatVector.SPECIES_PREFERRED, 0);
 Vector vb32 = vb8.convertShape(VectorOperators.B2F, 
FloatVector.SPECIES_PREFERRED, 0);
 acc1 = acc1.add(va32.mul(vb32));
   
 ByteVector vc8 = ByteVector.fromArray(PREFERRED_BYTE_SPECIES, a, i 
+ PREFERRED_BYTE_SPECIES.length());
 ByteVector vd8 = ByteVector.fromArray(PREFERRED_BYTE_SPECIES, b, i 
+ PREFERRED_BYTE_SPECIES.length());
 Vector vc32 = vc8.convertShape(VectorOperators.B2F, 
FloatVector.SPECIES_PREFERRED, 0);
 Vector vd32 = vd8.convertShape(VectorOperators.B2F, 
FloatVector.SPECIES_PREFERRED, 0);
 acc2 = acc2.add(vc32.mul(vd32));
   
 ByteVector ve8 = ByteVector.fromArray(PREFERRED_BYTE_SPECIES, a, i 
+ 2*PREFERRED_BYTE_SPECIES.length());
 ByteVector vf8 = ByteVector.fromArray(PREFERRED_BYTE_SPECIES, b, i 
+ 2*PREFERRED_BYTE_SPECIES.length());
 Vector ve32 = ve8.convertShape(VectorOperators.B2F, 
FloatVector.SPECIES_PREFERRED, 0);
 Vector vf32 = vf8.convertShape(VectorOperators.B2F, 
FloatVector.SPECIES_PREFERRED, 0);
 acc3 = acc3.add(ve32.mul(vf32));
   
 ByteVector vg8 = ByteVector.fromArray(PREFERRED_BYTE_SPECIES, a, i 
+ 3*PREFERRED_BYTE_SPECIES.length());
 ByteVector vh8 = ByteVector.fromArray(PREFERRED_BYTE_SPECIES, b, i 
+ 3*PREFERRED_BYTE_SPECIES.length());
 Vector vg32 = vg8.convertShape(VectorOperators.B2F, 
FloatVector.SPECIES_PREFERRED, 0);
 Vector vh32 = vh8.convertShape(VectorOperators.B2F, 
FloatVector.SPECIES_PREFERRED, 0);
 acc3 = acc3.add(vg32.mul(vh32));
   }
   // TODO: vector tail opto
   // reduce
   FloatVector res1 = acc1.add(acc2);
   FloatVector res2 = acc3.add(acc4);
   res += (int) res1.add(res2).reduceLanes(VectorOperators.ADD)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Should we have an interface VectorValues which would be implemented by [Byte/Float]VectorValues classes [lucene]

2023-10-08 Thread via GitHub


shubhamvishu opened a new issue, #12635:
URL: https://github.com/apache/lucene/issues/12635

   ### Description
   
   Currently, there is lot of code duplication due to 
[ByteVectorValues](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ByteVectorValues.java)
 and 
[FloatVectorValues](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/FloatVectorValues.java)
 (Example - 
[one](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java#L48-L73),
 
[two](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java#L109-L301)
 etc.) because they are handled separately as one returns byte[] and other 
float[] for `vectorValue` method. The idea is if we should have a common 
interface `VectorValues`? that provides the common functionality and we could 
eventually get rid of this duplication. As a first step maybe we could try to 
move the below methods(common in above 2 classes) into its different interface 
`VectorValues`?
   
   - int dimension()
   - int size()
   - byte[]/float[] vectorValue()


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add interface VectorValues to be implemented by [Float/Byte]VectorValues [lucene]

2023-10-08 Thread via GitHub


shubhamvishu opened a new pull request, #12636:
URL: https://github.com/apache/lucene/pull/12636

   ### Description
   
   The classes 
[ByteVectorValues](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ByteVectorValues.java)
 and 
[FloatVectorValues](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/FloatVectorValues.java)
 share the common functionalities/methods listed below. This PR moves those 
methods to a new interface `VectorValues`.
   
   - int dimension()
   - int size()
   - byte[]/float[] vectorValue()
   
   Since the name is clashing, created two new names `vectorByteValue` and 
`vectorFloatValue`. Also retained the `vectorValue` method whose returned value 
needs to be casted appropriately. 
   
   There could be some other ideas as well like 
   1. Making `DISI` implementing an interface which `VectorValues`(in this PR) 
would extend making further room to remove duplication (or) 
   2. Making a wrapper over `ByteVectorValues` and `FloatVectorValues` etc?. 
   
   Looking for more ideas or suggestions on how to better handle this and avoid 
duplicating code.
   
   Addresses #12635
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub


rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752107370

   The other thought I had around conversion costs would be to look into 
reinterpret+shuffle/shift/mask crap ourselves, which seems really crazy but i'm 
running low on ideas.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10241: Updating OpenNLP to 1.9.4. [lucene]

2023-10-08 Thread via GitHub


epugh commented on PR #448:
URL: https://github.com/apache/lucene/pull/448#issuecomment-1752112078

   It would be nice if this was updated to the awesome new OpenNLP 2.x line!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable rank-unsafe optimization of top-k hit computations by quantizing scores. [lucene]

2023-10-08 Thread via GitHub


jpountz commented on PR #12628:
URL: https://github.com/apache/lucene/pull/12628#issuecomment-1752152301

   I'll try to give a bit more context how I ended up here. With recent work on 
vector search and excitement around it, I can't prevent myself from thinking 
that all users who are happy to trade some effectiveness for better efficiency 
for their vector searches should be happy to do it with their term-based 
searches as well.
   
   There is some literature on rank-unsafe evaluation of top-hits, the main one 
I'm familiar with is the original [WAND 
paper](https://www.researchgate.net/profile/David-Carmel-3/publication/221613425_Efficient_query_evaluation_using_a_two-level_retrieval_process/links/02bfe50e685450015300/Efficient-query-evaluation-using-a-two-level-retrieval-process.pdf)
 where authors suggest to configure a threshold factor F and run the WAND 
operator so that it only considers matches whose sum of maximum scores across 
query terms is at least F times the score of the k-th score in the priority 
queue. When F is equal to 1 or less, the returned top hits are the same as the 
ones produced by exhaustive evaluation. Setting F greater than 1 allows trading 
effectiveness for efficiency. Authors share some experiments that show how F=2 
affects P@10 and MAP, as well as the number of evaluated hits.
   
   I'd like to expose this kind of trade-off in Lucene. One problem I have with 
the above approach is that it is very WAND-specific. A similar idea that 
wouldn't be WAND-specific would be to configure the minimum score 
(`Scorer#setMinimumCompetitiveScore`) as the k-th score in the heap times some 
factor. This feels a bit too low-level for something to be exposed as an option 
on `IndexSearcher`, so I ended up trying this idea of quantizing scores, which 
is again very similar in terms of why it helps, and that I find easier to 
explain: it computes accurate top hits on quantized scores. To get a bit of 
time thinking of how to expose it on `IndexSearcher`, I'm starting with this 
collector wrapper which doesn't require making changes to lucene-core and would 
help put it in the hands of users who would be interested in trying it out.
   
   For reference, https://github.com/apache/lucene/pull/12446 is another PR 
where I'm exploring exposing rank-unsafe optimizations in Lucene.
   
   > Would the typical usage of this scorer be as an initial collection of k 
docs, which would then get re-scored over their full floating point scores?
   
   If someone wanted to do that, they could do it in a single pass by using a 
similar collector wrapper as the one I'm adding, and only changing scores 
passed to `Scorer#setMinimumCompetitiveScore`, not the ones returned by 
`Scorer#score`. I think the idea is rather for it to be used as a drop-in 
replacement for traditional top-k collection, just trading a bit of 
effectiveness for better efficiency. It's probably especially acceptable with 
high `k` values, when hits are expected to be fed to a reranker in a later 
stage.
   
   > Are there any numbers around how this effects recall or does the Tantivy 
benchmark account for actually getting the correct document?
   
   No, there are no measures of the effect on recall or other measures of 
effectiveness.
   
   > Very cool, surprisingly impactful!
   
   For reference, I noticed that it especially helps in the following cases:
- Stop words mixed with high-scoring clauses: in that case the quantization 
helps the score contribution of stop words become a rounding error more than an 
actual score contribution, which in-turn helps the pruning logic consider 
almost only the high-scoring clauses. A good such example from the benchmark is 
`the progressive`. The query `the incredibles` is an extreme case which is due 
to the fact that `incredibles` has an extremely low term frequency, so the 
query looks like a query on `the` in practice where a few hits get boosted 
because they contain `incredibles`.
- Clauses with similar maximum scores. These queries are complicated 
because it's rare that the same document yields the maximum score in the block 
for both clauses. So actual scores produced by such queries are often quite 
behind the sum of maximum scores of each clause. They can run like conjunctions 
in practice, but they can rarely skip blocks across all clauses. Quantization 
makes this more likely. A good example from the benchmark is `american south`.
- Conjunctions. Conjunctions suffer from the same problem as above, it's 
rare that all clauses yield their maximum score on the same document. This 
makes skipping whole blocks rare in practice. Quantization makes it more likely 
by creating more ties.
   
   > This is the Tantivy benchmark tooling, but you are comparing Lucene (main) 
to Lucene (your PR) right?
   
   Actually since this only introduces a single class, I copied this class to 
the source folder of the `engine` querying logic. So the baseline is Lucene 
9.8.0 and t

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-08 Thread via GitHub


mikemccand commented on PR #12633:
URL: https://github.com/apache/lucene/pull/12633#issuecomment-1752165322

   For comparison, this is how the curve (RAM required during construction vs 
final FST size) looks on trunk, using the god-like parameters as best I could.  
I sorted the results in reverse `FST (mb)` order but not that this results in a 
confusing mix of the first three columns (the god-like parameters).
   
   I'll try to turn these two tables into a single chart comparing RAM used and 
final FST size and maybe build time:
   
   |Share suffix?|Share non-singleton?|Max tail length|FST (mb)|RAM (mb)|Build 
time (sec)| 

   
|-||||||


   |True|True|1|584.3|0.0|30.5| 

   
   |True|False|1|584.3|0.0|30.5|

   
   |True|True|0|577.4|0.0|30.1| 

   
   |True|False|0|577.4|0.0|30.2|

   
   |False|True|0|577.4|0.0|29.9|

   
   |False|False|0|577.4|0.0|29.8|   

   
   |True|False|2|576.4|0.0|31.4|

   
   |True|True|2|569.9|14.5|32.4|

   
   |True|False|3|523.7|3.6|32.2|

   
   |True|True|3|509.7|29.0|34.1|

   
   |True|False|4|486.2|7.0|32.9|

   
   |True|True|4|468.9|58.0|36.7|

   
   |True|False|5|456.1|14.0|34.1|   

   
   |True|True|5|437.6|56.0|37.8|

   
   |True|False|6|435.4|28.0|35.3|   

   
   |True|False|7|419.8|58.0|36.6|   

   
   |True|True|6|416.1|116.0|41.2|   

   
   |True|False|8|408.1|56.0|37.6|   

   
   |True|True|7|399.8|116.0|41.7|   

   
   |True|False|9|398.8|116.0|39.9|

Re: [PR] Add interface VectorValues to be implemented by [Float/Byte]VectorValues [lucene]

2023-10-08 Thread via GitHub


benwtrent commented on PR #12636:
URL: https://github.com/apache/lucene/pull/12636#issuecomment-1752194821

   It was sort of this way before but we decided to switch it as a common 
interface required either:
   
   - having to use generics
   - an API where things weren't fully implemented or required casting.
   
   If we are worried about internal duplication there may be better options 
around using a better abstraction for scoring (see RandomVectorScorer) and 
writing to a flat file (on flush and merges). Those are the only two places we 
care about reading vectors during indexing and search.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] segmentInfos.replace() doesn't set userData [lucene]

2023-10-08 Thread via GitHub


Shibi-bala opened a new issue, #12637:
URL: https://github.com/apache/lucene/issues/12637

   ### Description
   
   Found that the [replace 
method](https://github.com/qcri/solr-6/blob/master/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L875-L878)
 doesn't set `userData` with the new user data from `other`. Unsure if this is 
an oversight, but if it is, I have a PR up [here
   ](https://github.com/apache/lucene/pull/12626).
   
   Existing:
   ```
 void replace(SegmentInfos other) {
   rollbackSegmentInfos(other.asList());
   lastGeneration = other.lastGeneration;
 }
   ```
   
   Fix:
   ```
 void replace(SegmentInfos other) {
   rollbackSegmentInfos(other.asList());
   lastGeneration = other.lastGeneration;
   userData = other.userData;
 }
   ```
   
   ### Version and environment details
   
   Lucene version 9.7


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid NPEx if the end of the stream has been reached without reading any characters [lucene]

2023-10-08 Thread via GitHub


pzygielo commented on PR #12611:
URL: https://github.com/apache/lucene/pull/12611#issuecomment-1752377046

   Thanks for checking.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Explicitly return needStats flag in TermStates [lucene]

2023-10-08 Thread via GitHub


yugushihuang opened a new pull request, #12638:
URL: https://github.com/apache/lucene/pull/12638

   ### Description
   
   A simple API in TermStates to expose the `needStats` flag. 
   
   Addresses #12617 #
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid NPEx if the end of the stream has been reached without reading any characters [lucene]

2023-10-08 Thread via GitHub


dweiss merged PR #12611:
URL: https://github.com/apache/lucene/pull/12611


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid NPEx if the end of the stream has been reached without reading any characters [lucene]

2023-10-08 Thread via GitHub


dweiss commented on PR #12611:
URL: https://github.com/apache/lucene/pull/12611#issuecomment-1752397871

   I've applied this to main and branch_9x (9.9). Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Explicitly return needStats flag in TermStates [lucene]

2023-10-08 Thread via GitHub


jpountz commented on PR #12638:
URL: https://github.com/apache/lucene/pull/12638#issuecomment-1752414836

   Can you explain how/when you plan to use this new API?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-08 Thread via GitHub


dweiss commented on PR #12633:
URL: https://github.com/apache/lucene/pull/12633#issuecomment-1752416032

   I didn't get into all the details but I think this looks good. Your 
questions are indeed intriguing - I can't provide any explanation off the top 
of my head, really.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org