[PR] Fix test failures for TestCoreParser#testSpanNearQueryWithoutSlopXML [lucene]

2023-10-27 Thread via GitHub


vigyasharma opened a new pull request, #12724:
URL: https://github.com/apache/lucene/pull/12724

   Addresses #12708 
   
   `xml.TestCoreParser#testSpanNearQueryWithoutSlopXML` fails because of 
changed exception message Java 22 EA. This change removes the test's dependency 
on Java exception message strings.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] xml.TestCoreParser#testSpanNearQueryWithoutSlopXML fails because of changed exception message [lucene]

2023-10-27 Thread via GitHub


vigyasharma commented on issue #12708:
URL: https://github.com/apache/lucene/issues/12708#issuecomment-1782426261

   Made a small change to assert on the exception type instead of checking the 
exception message string. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] `FSTCompiler.Builder` should have an option to stream the FST bytes directly to Directory [lucene]

2023-10-27 Thread via GitHub


dungba88 commented on issue #12543:
URL: https://github.com/apache/lucene/issues/12543#issuecomment-1782469977

   I put a new revision with support for DataOutput and FileChannel.
   
   When using DataOutput, if suffix sharing is enabled one also needs to pass a 
RandomAccessInput for reading. Otherwise it can be left null. So one can pass a 
IndexOutput, and RandomAccessInput can be created from IndexInput.
   
   When using FileChannel, one only needs to pass the FileChannel as that 
already allows both read & write at the same time. This FileChannel 
implementation is just for demonstration of feasibility.
   
   Some stuffs I'd like to discuss:
   - Should we write the rootNode + numBytes to the end of the FST instead of 
the front? We only have them after constructing the FST and we can't prepend a 
DataOutput (that's costly). Otherwise we would need to save the metadata 
separately from the main body. That's why I added a new method `saveMetadata()`
   - Should we move to value-based LRU cache? It has pros and cons:
 - Pros: We make NodeHash independent of FST completely. It would allow the 
suffix sharing without the need of RandomAccessInput, and thus without the need 
for IndexOutput & IndexInput to be open at the same time. Also accessing from 
RAM is much faster than accessing from disk.
 - Cons: More RAM required than the address-based cache. For truly minimal 
FST it would require the same (or more) RAM needed for the entire FST.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-27 Thread via GitHub


bruno-roustant commented on PR #12716:
URL: https://github.com/apache/lucene/pull/12716#issuecomment-1782623703

   Thank you @shubhamvishu for these experiments. The table answers exactly to 
the questions.
   And it means actually there is no point to change the bit mixing, since it 
does not bring any improvement.
   
   Making the rehash threshold at 1/2 would means the hash structure would 
occupy more memory on average since it would allocate an "array" and would use 
it between 25% to 50% occupancy. With the 2/3 threshold, it uses it between 33% 
and 66% occupancy. So staying at 2/3 is better.
   
   Sometimes an appealing idea is actually a false good idea. And it's great to 
have these experiments to avoid the pitfall! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-10-27 Thread via GitHub


shubhamvishu commented on PR #12716:
URL: https://github.com/apache/lucene/pull/12716#issuecomment-1782810001

   @bruno-roustant Absolutely! I totally agree with your point. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

2023-10-27 Thread via GitHub


jpountz commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1782814872

   FWIW I could reproduce the speedup from disabling patching locally on 
wikibigall:
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
   HighSloppyPhrase1.58  (4.9%)1.55  
(9.5%)   -1.9% ( -15% -   13%) 0.441
  CountTerm13139.34  (3.0%)12937.65  
(3.1%)   -1.5% (  -7% -4%) 0.111
  HighTermMonthSort 4393.07  (1.5%) 4371.13  
(2.1%)   -0.5% (  -4% -3%) 0.396
Prefix3  270.78  (3.4%)  271.17  
(3.2%)0.1% (  -6% -7%) 0.889
 Fuzzy1  100.16  (1.1%)  100.42  
(0.9%)0.3% (  -1% -2%) 0.420
 Fuzzy2   70.62  (1.1%)   70.84  
(1.0%)0.3% (  -1% -2%) 0.340
 HighPhrase   16.41  (3.8%)   16.49  
(5.3%)0.5% (  -8% -9%) 0.738
Respell   53.35  (1.8%)   53.62  
(1.6%)0.5% (  -2% -4%) 0.351
   HighTerm  418.58  (9.3%)  421.91  
(7.7%)0.8% ( -14% -   19%) 0.770
LowSloppyPhrase9.87  (2.5%)9.96  
(5.9%)0.9% (  -7% -9%) 0.544
   Wildcard   94.17  (2.8%)   95.04  
(3.3%)0.9% (  -5% -7%) 0.341
MedTerm  553.18  (8.3%)  559.16  
(6.9%)1.1% ( -13% -   17%) 0.656
LowTerm  784.38  (7.0%)  793.23  
(5.7%)1.1% ( -10% -   14%) 0.575
   PKLookup  264.32  (2.8%)  267.44  
(2.2%)1.2% (  -3% -6%) 0.138
   HighSpanNear5.37  (3.2%)5.44  
(3.3%)1.3% (  -5% -8%) 0.213
  OrHighLow  590.78  (3.0%)  598.60  
(2.6%)1.3% (  -4% -7%) 0.132
  LowPhrase   27.97  (3.7%)   28.42  
(4.9%)1.6% (  -6% -   10%) 0.245
LowSpanNear   11.09  (2.1%)   11.30  
(2.2%)1.8% (  -2% -6%) 0.007
MedSpanNear7.39  (3.3%)7.53  
(3.5%)1.9% (  -4% -8%) 0.079
 AndHighLow  819.00  (2.9%)  835.98  
(2.4%)2.1% (  -3% -7%) 0.015
  MedPhrase   91.27  (3.4%)   93.33  
(4.6%)2.3% (  -5% -   10%) 0.078
  HighTermDayOfYearSort  447.47  (1.8%)  457.76  
(1.8%)2.3% (  -1% -5%) 0.000
MedSloppyPhrase   16.79  (2.4%)   17.25  
(4.6%)2.7% (  -4% -9%) 0.017
  OrHighMed  157.99  (2.2%)  162.38  
(2.3%)2.8% (  -1% -7%) 0.000
 OrHighHigh   67.77  (1.7%)   69.71  
(1.9%)2.9% (   0% -6%) 0.000
AndHighHigh   48.92  (1.8%)   50.67  
(2.1%)3.6% (   0% -7%) 0.000
 AndHighMed  174.71  (2.3%)  181.03  
(2.6%)3.6% (  -1% -8%) 0.000
   CountAndHighHigh   38.11  (4.4%)   39.65  
(5.1%)4.1% (  -5% -   14%) 0.007
CountOrHighHigh   56.07 (16.2%)   58.48 
(18.5%)4.3% ( -26% -   46%) 0.435
 CountOrHighMed   86.95 (16.0%)   90.94 
(18.3%)4.6% ( -25% -   46%) 0.398
CountAndHighMed  116.35  (3.4%)  121.78  
(4.8%)4.7% (  -3% -   13%) 0.000
CountPhrase3.26 (11.2%)3.43 
(13.1%)5.1% ( -17% -   33%) 0.187
 IntNRQ  146.09 (35.1%)  171.55 
(36.7%)   17.4% ( -40% -  137%) 0.124
   ```
   
   `.doc` files were 12% larger overall (2.64GB to 2.96GB), `.pos` files were 
11% larger (11.03GB to 12.24GB), and the index was 9.7% larger (15.66GB to 
17.18GB).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]

2023-10-27 Thread via GitHub


shubhamvishu commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1782977506

   @gsmiller I finally got a chance to run the benchmarks for this change. 
Below are the results(looks all good to me). Let me know what do you think? 
Thanks!
   
   ```
 TaskQPS baseline  StdDevQPS my_modified_version  
StdDev  Pct diff p-value
 AndHighMed   96.52  (6.5%)   95.49  
(6.3%)   -1.1% ( -12% -   12%) 0.597
LowSpanNear  133.61  (4.7%)  132.45  
(4.0%)   -0.9% (  -9% -8%) 0.528
  OrHighMed   67.95  (3.2%)   67.37  
(4.6%)   -0.9% (  -8% -7%) 0.492
MedSpanNear5.79  (5.2%)5.74  
(4.2%)   -0.8% (  -9% -8%) 0.573
   OrNotHighLow  322.30  (2.3%)  319.70  
(2.0%)   -0.8% (  -4% -3%) 0.233
   BrowseDateSSDVFacets0.90  (8.2%)0.89  
(9.4%)   -0.8% ( -17% -   18%) 0.773
 AndHighLow  355.80  (4.2%)  352.99  
(4.1%)   -0.8% (  -8% -7%) 0.550
   MedTermDayTaxoFacets5.85  (6.2%)5.81  
(5.7%)   -0.7% ( -11% -   11%) 0.713
   OrNotHighMed  222.70  (3.6%)  221.51  
(4.0%)   -0.5% (  -7% -7%) 0.657
   OrHighNotMed  194.41  (3.6%)  193.40  
(3.4%)   -0.5% (  -7% -6%) 0.641
  MedPhrase   44.71  (3.1%)   44.50  
(3.4%)   -0.5% (  -6% -6%) 0.639
  OrNotHighHigh  179.60  (3.7%)  178.85  
(3.8%)   -0.4% (  -7% -7%) 0.726
   HighSpanNear8.33  (4.6%)8.30  
(4.1%)   -0.4% (  -8% -8%) 0.765
   OrHighNotLow  331.53  (3.8%)  330.15  
(3.7%)   -0.4% (  -7% -7%) 0.728
  OrHighNotHigh  248.18  (3.5%)  247.20  
(3.4%)   -0.4% (  -7% -6%) 0.718
MedIntervalsOrdered4.66  (4.6%)4.64  
(4.3%)   -0.4% (  -8% -8%) 0.780
 OrHighMedDayTaxoFacets4.39  (3.9%)4.38  
(4.2%)   -0.3% (  -8% -8%) 0.820
  OrHighLow  569.67  (2.6%)  568.04  
(2.6%)   -0.3% (  -5% -5%) 0.733
LowIntervalsOrdered2.64  (4.6%)2.64  
(4.6%)   -0.2% (  -8% -9%) 0.899
   HighTermTitleBDVSort5.20  (4.0%)5.19  
(4.1%)   -0.2% (  -7% -8%) 0.895
 OrHighHigh   35.47  (4.4%)   35.42  
(5.8%)   -0.1% (  -9% -   10%) 0.929
  BrowseDayOfYearTaxoFacets3.80 (15.9%)3.80 
(16.0%)   -0.1% ( -27% -   37%) 0.984
   BrowseDateTaxoFacets3.78 (15.5%)3.78 
(15.5%)   -0.0% ( -26% -   36%) 0.995
AndHighHigh   16.35  (3.6%)   16.35  
(7.5%)0.0% ( -10% -   11%) 0.996
AndHighMedDayTaxoFacets   49.40  (1.6%)   49.41  
(1.7%)0.0% (  -3% -3%) 0.975
  LowPhrase  254.05  (3.6%)  254.16  
(4.0%)0.0% (  -7% -7%) 0.972
 HighPhrase   40.53  (3.1%)   40.56  
(3.6%)0.1% (  -6% -6%) 0.939
   PKLookup  131.32  (2.3%)  131.44  
(2.0%)0.1% (  -4% -4%) 0.891
BrowseRandomLabelTaxoFacets3.28 (15.0%)3.28 
(14.8%)0.1% ( -25% -   35%) 0.983
  BrowseMonthTaxoFacets4.13 (34.7%)4.13 
(34.7%)0.1% ( -51% -  106%) 0.991
 Fuzzy1   65.72  (1.1%)   65.81  
(1.2%)0.1% (  -2% -2%) 0.714
LowTerm  513.23  (2.9%)  513.99  
(2.9%)0.1% (  -5% -6%) 0.870
   HighIntervalsOrdered1.62  (5.3%)1.62  
(5.0%)0.2% (  -9% -   11%) 0.909
  BrowseDayOfYearSSDVFacets3.75 (12.4%)3.76  
(8.6%)0.2% ( -18% -   24%) 0.956
MedTerm  382.67  (3.7%)  383.49  
(3.9%)0.2% (  -7% -8%) 0.857
LowSloppyPhrase   12.31  (2.4%)   12.34  
(2.1%)0.2% (  -4% -4%) 0.754
Respell   36.92  (1.6%)   37.02  
(1.6%)0.3% (  -2% -3%) 0.617
MedSloppyPhrase   22.87  (3.2%)   22.94  
(3.0%)0.3% (  -5% -6%) 0.751
   HighTerm  251.80  (4.3%)  252.92  
(4.6%)0.4% (  -8% -9%) 0.752
   AndHighHighDayTaxoFacets7.39  (2.7%)7.43  
(2.6%)0.5% (  -4% -5%) 0.57

Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-27 Thread via GitHub


msokolov commented on PR #12660:
URL: https://github.com/apache/lucene/pull/12660#issuecomment-1782983831

   as for the renaming maybe we can look to do it as a followup PR? 
IHnswGraphSearcher -> HnswGraphSearcher? Or maybe we can just use HnswSearcher 
right now?
   
   Re: controlling concurrency. Maybe eventually we figure out how to do this 
as part of ConcurrentMergeScheduler? Since it is all about merging, that seems 
like a logical place. It currently allows setting `maxMergeCount` and 
`maxThreadCount` independently, but enforces `maxThreadCount <= maxMergeCount`. 
We could potentially fiddle with this, maybe it would be as simple as relaxing 
that constraint and using "spare threads" to do concurrent graph merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize: Use DocIdSetIterator Reduce bkd docvalues iteration [lucene]

2023-10-27 Thread via GitHub


msokolov commented on code in PR #12723:
URL: https://github.com/apache/lucene/pull/12723#discussion_r1374646933


##
lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java:
##
@@ -216,7 +216,7 @@ private static class BKDPointTree implements PointTree {
 scratchMinIndexPackedValue,
 scratchMaxIndexPackedValue;
 private final int[] commonPrefixLengths;
-private final BKDReaderDocIDSetIterator scratchIterator;
+private final BKDReaderDocIDSet scratch;

Review Comment:
   Please don't rename this - it makes it more difficult to read the PR and 
find your significant contribution amidst what looks like noise. It's confusing 
that in some places you renamed it to `scratch` and in other places to 
`iterator` -- is there some difference?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Use max BPV encoding in postings if doc buffer size less than ForUtil.BLOCK_SIZE [lucene]

2023-10-27 Thread via GitHub


jpountz commented on issue #12717:
URL: https://github.com/apache/lucene/issues/12717#issuecomment-1783020278

   Does this actually matter for performance? My gut feeling is that either a 
value has a long postings list, and then the vast majority of blocks will be 
encoded with PFOR and should be fast, or it has a short postings list and then 
queries should be fast anyway?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]

2023-10-27 Thread via GitHub


ChrisHegarty commented on PR #12703:
URL: https://github.com/apache/lucene/pull/12703#issuecomment-1783135109

   I've not been able to spend all that much time on this this week, but here's 
my current thinking.
   
   The abstractions in the PR are currently not great (as discussed above), but 
putting that aside for now since we can get a sense of the potential real 
performance impact from this approach as it is - so I did some performance 
experiments other than micro jmh.
   
   It seems that this change improves the merge performance of vector data in 
segments by about 10% - not great, I was hoping for better. Is it worth 
proceeding with or maybe looking elsewhere? I'm not sure.   Here's how I 
determine the 10% - by just hacking on KnnGraphTester from luceneUtil so that 
it creates an index with more than one segment when ingesting 100k+ vectors 
with dimensions of 768, then timing the forceMerge. This is a very rough 
experiment, but shows that the potential gain is much less than I expected.  
Caution - I could have goofed up several things, from the actual implementation 
to the experiment merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Use max BPV encoding in postings if doc buffer size less than ForUtil.BLOCK_SIZE [lucene]

2023-10-27 Thread via GitHub


easyice commented on issue #12717:
URL: https://github.com/apache/lucene/issues/12717#issuecomment-1783202938

   @jpountz Thanks for your explanation, i got some flame graph that shows the 
`readVIntBlock` takes up a bit large proportion, I'll try to reproduce it with 
some mocked data


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize: Use DocIdSetIterator Reduce bkd docvalues iteration [lucene]

2023-10-27 Thread via GitHub


luyuncheng commented on PR #12723:
URL: https://github.com/apache/lucene/pull/12723#issuecomment-1783206971

   > this seems promising. I guess the only cost is the increased memory needed 
because we create the FixedBitSet? Can you say how large this might get in the 
worst case?
   
   @msokolov 
   In 
https://github.com/apache/lucene/blob/c701a5d9be4d8dc677ea994f8ef5fdd8945760d6/lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java#L66-L74
   
   AND 
   
https://github.com/apache/lucene/blob/c701a5d9be4d8dc677ea994f8ef5fdd8945760d6/lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java#L131-L140
   
   so, i think in `BITSET_IDS` type with `DocBaseBitSetIterator` in max range 
is about 8192 docids(16 * 512) which FixedBitSet 
memory is 128 longs. and BKDReaderDocIDSetIterator contains 512 ints


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize: Use DocIdSetIterator Reduce bkd docvalues iteration [lucene]

2023-10-27 Thread via GitHub


luyuncheng commented on code in PR #12723:
URL: https://github.com/apache/lucene/pull/12723#discussion_r1374824046


##
lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java:
##
@@ -216,7 +216,7 @@ private static class BKDPointTree implements PointTree {
 scratchMinIndexPackedValue,
 scratchMaxIndexPackedValue;
 private final int[] commonPrefixLengths;
-private final BKDReaderDocIDSetIterator scratchIterator;
+private final BKDReaderDocIDSet scratch;

Review Comment:
   Prior:
   `BKDReaderDocIDSetIterator` using int array to  do DocIdSetIterator.
   
   Now:
   But i want to use `DocBaseBitSetIterator` to reduce iterator and optimize 
`visit(DocIdSetIterator iterator)`. 
   
   AND also in type `DELTA_BPV_16` using int array to do iterator without renew 
int array.
   
   so i rename BKDReaderDocIDSetIterator to BKDReaderDocIDSet which usage is 
not a iterator.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize: Use DocIdSetIterator Reduce bkd docvalues iteration [lucene]

2023-10-27 Thread via GitHub


luyuncheng commented on code in PR #12723:
URL: https://github.com/apache/lucene/pull/12723#discussion_r1374824046


##
lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java:
##
@@ -216,7 +216,7 @@ private static class BKDPointTree implements PointTree {
 scratchMinIndexPackedValue,
 scratchMaxIndexPackedValue;
 private final int[] commonPrefixLengths;
-private final BKDReaderDocIDSetIterator scratchIterator;
+private final BKDReaderDocIDSet scratch;

Review Comment:
   `BKDReaderDocIDSetIterator` using int array to  do DocIdSetIterator.
   
   But i want to use `DocBaseBitSetIterator` to reduce iterator and optimize 
`visit(DocIdSetIterator iterator)`. 
   
   AND also in type `DELTA_BPV_16` using int array to do iterator without renew 
int array.
   
   so i rename BKDReaderDocIDSetIterator to BKDReaderDocIDSet which usage is 
not a iterator.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix test failures for TestCoreParser#testSpanNearQueryWithoutSlopXML [lucene]

2023-10-27 Thread via GitHub


uschindler commented on PR #12724:
URL: https://github.com/apache/lucene/pull/12724#issuecomment-1783287980

   Should I merge and backport it or do you want to do it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Return the same input vector if its a unit vector in VectorUtil#l2normalize [lucene]

2023-10-27 Thread via GitHub


shubhamvishu opened a new pull request, #12726:
URL: https://github.com/apache/lucene/pull/12726

   ### Description
   
   While going through 
[VectorUtil](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/VectorUtil.java)
 class, I observed we don't have a check for unit vector in 
`VectorUtil#l2normalize` so passing a unit vector goes thorough the whole L2 
normalization(which is totally not required and it should early exit?). I 
confirmed this by trying out a silly example of 
`VectorUtil.l2normalize(VectorUtil.l2normalize(nonUnitVector))` and it 
performed every calculation twice. We could also argue that user should not 
call for this for a unit vector but I believe there would be cases where user 
simply want to perform the L2 normalization without checking the vector or if 
there are some overflowing values.
   
   TL;DR : We should early exit in `VectorUtil#l2normalize`, returning the same 
input vector if its a unit vector
   
   This is easily avoidable if we introduce a light check to see if the L1 norm 
or squared sum of input vector is equal to 1.0 (or) maybe just check 
`Math.abs(l1norm - 1.0d) <= 1e-5` (as in this PR) because that unit vector dot 
product(`v x v`) are not exactly 1.0 but like example : `0.999403953552` 
etc. With `1e-5` delta here we would be assuming a vector v having `v x v` >= 
`0.9` is a unit vector or say already L2 normalized which seems fine as the 
delta is really small? and also the check is not heavy one?. 
   
   I'm not sure this there existed some sort of similar check before or 
something(I tried to check but didn't find any history) so looking forward to 
thoughts if this makes sense to be added or not. Thanks!
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Return the same input vector if its a unit vector in VectorUtil#l2normalize [lucene]

2023-10-27 Thread via GitHub


benwtrent commented on PR #12726:
URL: https://github.com/apache/lucene/pull/12726#issuecomment-1783318072

   @shubhamvishu could you add a "CHANGES.txt" entry under optimizations?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Ensure negative scores are not returned by vector similarity functions [lucene]

2023-10-27 Thread via GitHub


benwtrent opened a new pull request, #12727:
URL: https://github.com/apache/lucene/pull/12727

   We shouldn't ever return negative scores from vector similarity functions. 
Given vector panama and nearly antipodal float[] vectors, it is possible that 
cosine and (normalized) dot-product become slightly negative due to compounding 
floating point errors.
   
   Since we don't want to make panama vector incredibly slow, we stick to 
float32 operations for now, and just snap to `0` if the score is negative after 
our correction.
   
   closes: https://github.com/apache/lucene/issues/12700


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add back maxConn & beamWidth HNSW codec ctor [lucene]

2023-10-27 Thread via GitHub


benwtrent opened a new pull request, #12728:
URL: https://github.com/apache/lucene/pull/12728

   follow up to https://github.com/apache/lucene/pull/12582
   
   For user convenience, I added back the two parameter ctor for the HNSW codec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Return the same input vector if its a unit vector in VectorUtil#l2normalize [lucene]

2023-10-27 Thread via GitHub


shubhamvishu commented on PR #12726:
URL: https://github.com/apache/lucene/pull/12726#issuecomment-1783364698

   Oh nice! @benwtrent I have added a CHANGES.txt entry under optimizations 
now. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Adding new flat vector format and refactoring HNSW [lucene]

2023-10-27 Thread via GitHub


benwtrent opened a new pull request, #12729:
URL: https://github.com/apache/lucene/pull/12729

   Currently the HNSW codec does too many things, it not only indexes vectors, 
but stores them and determines how to store them given the vector type.
   
   This PR extracts out the vector storage into a new format 
`Lucene99FlatVectorsFormat`, additionally a "flat" version of 
`KnnVectorsFormat` is added called `FlatVectorsFormat`. This allows for some 
additional helper functions that allow an indexing codec (like HNSW) take 
advantage of the flat formats.
   
   Additionally, this PR refactors the new 
`Lucene99ScalarQuantizedVectorsFormat` to be a `FlatVectorsFormat`.
   
   Now, `Lucene99HnswVectorsFormat` can accept either a 
`Lucene99ScalarQuantizedVectorsFormat` to use int8 quantized vectors or 
`Lucene99FlatVectorsFormat` for raw vectors.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Clean up ByteBlockPool [lucene]

2023-10-27 Thread via GitHub


stefanvodita commented on code in PR #12506:
URL: https://github.com/apache/lucene/pull/12506#discussion_r1375044479


##
lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java:
##
@@ -170,42 +191,42 @@ public void reset(boolean zeroFillBuffers, boolean 
reuseFirst) {
   }
 
   /**
-   * Advances the pool to its next buffer. This method should be called once 
after the constructor
-   * to initialize the pool. In contrast to the constructor a {@link 
ByteBlockPool#reset()} call
-   * will advance the pool to its first buffer immediately.
+   * Allocates a new buffer and advances the pool to it. This method should be 
called once after the
+   * constructor to initialize the pool. In contrast to the constructor, a 
{@link
+   * ByteBlockPool#reset()} call will advance the pool to its first buffer 
immediately.
*/
   public void nextBuffer() {
 if (1 + bufferUpto == buffers.length) {
+  // The buffer array is full - expand it
   byte[][] newBuffers =
   new byte[ArrayUtil.oversize(buffers.length + 1, 
NUM_BYTES_OBJECT_REF)][];
   System.arraycopy(buffers, 0, newBuffers, 0, buffers.length);
   buffers = newBuffers;
 }
+// Allocate new buffer and advance the pool to it
 buffer = buffers[1 + bufferUpto] = allocator.getByteBlock();
 bufferUpto++;
-
 byteUpto = 0;
 byteOffset = Math.addExact(byteOffset, BYTE_BLOCK_SIZE);
   }
 
   /**
-   * Fill the provided {@link BytesRef} with the bytes at the specified 
offset/length slice. This
-   * will avoid copying the bytes, if the slice fits into a single block; 
otherwise, it uses the
-   * provided {@link BytesRefBuilder} to copy bytes over.
+   * Fill the provided {@link BytesRef} with the bytes at the specified offset 
and length. This will
+   * avoid copying the bytes if the slice fits into a single block; otherwise, 
it uses the provided
+   * {@link BytesRefBuilder} to copy bytes over.
*/
-  void setBytesRef(BytesRefBuilder builder, BytesRef result, long offset, int 
length) {
+  void setBytesRef(BytesRefBuilder builder, BytesRef result, int offset, int 
length) {

Review Comment:
   I missed that the purpose of the long was to increase the address space. 
I'll change this back.



##
lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java:
##
@@ -170,42 +191,42 @@ public void reset(boolean zeroFillBuffers, boolean 
reuseFirst) {
   }
 
   /**
-   * Advances the pool to its next buffer. This method should be called once 
after the constructor
-   * to initialize the pool. In contrast to the constructor a {@link 
ByteBlockPool#reset()} call
-   * will advance the pool to its first buffer immediately.
+   * Allocates a new buffer and advances the pool to it. This method should be 
called once after the
+   * constructor to initialize the pool. In contrast to the constructor, a 
{@link
+   * ByteBlockPool#reset()} call will advance the pool to its first buffer 
immediately.
*/
   public void nextBuffer() {
 if (1 + bufferUpto == buffers.length) {
+  // The buffer array is full - expand it
   byte[][] newBuffers =
   new byte[ArrayUtil.oversize(buffers.length + 1, 
NUM_BYTES_OBJECT_REF)][];
   System.arraycopy(buffers, 0, newBuffers, 0, buffers.length);
   buffers = newBuffers;
 }
+// Allocate new buffer and advance the pool to it
 buffer = buffers[1 + bufferUpto] = allocator.getByteBlock();
 bufferUpto++;
-
 byteUpto = 0;
 byteOffset = Math.addExact(byteOffset, BYTE_BLOCK_SIZE);
   }
 
   /**
-   * Fill the provided {@link BytesRef} with the bytes at the specified 
offset/length slice. This
-   * will avoid copying the bytes, if the slice fits into a single block; 
otherwise, it uses the
-   * provided {@link BytesRefBuilder} to copy bytes over.
+   * Fill the provided {@link BytesRef} with the bytes at the specified offset 
and length. This will
+   * avoid copying the bytes if the slice fits into a single block; otherwise, 
it uses the provided
+   * {@link BytesRefBuilder} to copy bytes over.
*/
-  void setBytesRef(BytesRefBuilder builder, BytesRef result, long offset, int 
length) {
+  void setBytesRef(BytesRefBuilder builder, BytesRef result, int offset, int 
length) {
 result.length = length;
 
-int bufferIndex = (int) (offset >> BYTE_BLOCK_SHIFT);

Review Comment:
   Good idea!



##
lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java:
##
@@ -259,14 +280,15 @@ public void readBytes(final long offset, final byte[] 
bytes, int bytesOffset, in
   @Override
   public long ramBytesUsed() {
 long size = BASE_RAM_BYTES;
-size += RamUsageEstimator.sizeOfObject(buffer);
 size += RamUsageEstimator.shallowSizeOf(buffers);
 for (byte[] buf : buffers) {
-  if (buf == buffer) {

Review Comment:
   I haven't figured out why `buffer` was a special case. Even if it were not a 
full block, would that make a difference if we're calling 
`RamUsageEstimator#

Re: [PR] Clean up ByteBlockPool [lucene]

2023-10-27 Thread via GitHub


stefanvodita commented on PR #12506:
URL: https://github.com/apache/lucene/pull/12506#issuecomment-1783530943

   I’ve integrated most of the suggestions. There’s just the matter of the name 
for the slice pool class and the right package for it. I don’t have a strong 
opinion on this. Maybe we move it to `org.apache.lucene.index` and keep the 
name. It goes well with `ByteSliceReader`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix test failures for TestCoreParser#testSpanNearQueryWithoutSlopXML [lucene]

2023-10-27 Thread via GitHub


vigyasharma merged PR #12724:
URL: https://github.com/apache/lucene/pull/12724


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Remove test dependency on Java default exception message [lucene]

2023-10-27 Thread via GitHub


vigyasharma opened a new pull request, #12730:
URL: https://github.com/apache/lucene/pull/12730

   Backport of fix in #12724
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove test dependency on Java default exception message [lucene]

2023-10-27 Thread via GitHub


vigyasharma commented on PR #12730:
URL: https://github.com/apache/lucene/pull/12730#issuecomment-1783652392

   Backport change already approved in 
https://github.com/apache/lucene/pull/12724. Merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove test dependency on Java default exception message [lucene]

2023-10-27 Thread via GitHub


vigyasharma merged PR #12730:
URL: https://github.com/apache/lucene/pull/12730


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix test failures for TestCoreParser#testSpanNearQueryWithoutSlopXML [lucene]

2023-10-27 Thread via GitHub


vigyasharma commented on PR #12724:
URL: https://github.com/apache/lucene/pull/12724#issuecomment-1783653077

   I've backported this to `branch_9x`. Do we need it in any other branches?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Speedup float cosine vectors, use FMA where fast and available to reduce error [lucene]

2023-10-27 Thread via GitHub


rmuir opened a new pull request, #12731:
URL: https://github.com/apache/lucene/pull/12731

   The intel fma is nice, and its easier to reason about when looking at 
assembly. We basically reduce the error for free where its available. Along 
with another change (reducing the unrolling for cosine, since it has 3 fma ops 
already), we can speed up cosine from 6 -> 8 uops/us.
   
   On the arm the fma leads to slight slowdowns, so we don't use it. Its not 
much, just something like 10%, but seems like the wrong tradeoff.
   
   If you run the code with `-XX-UseFMA` there's no slowdown, but no speedup 
either. And obviously, no changes for ARM here.
   
   ```
   Skylake AVX-256
   
   Main:
   Benchmark  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar1024  thrpt5   0.624 ± 
0.041  ops/us
   VectorUtilBenchmark.floatCosineVector1024  thrpt5   5.988 ± 
0.111  ops/us
   VectorUtilBenchmark.floatDotProductScalar1024  thrpt5   1.959 ± 
0.032  ops/us
   VectorUtilBenchmark.floatDotProductVector1024  thrpt5  12.058 ± 
0.920  ops/us
   VectorUtilBenchmark.floatSquareScalar1024  thrpt5   1.422 ± 
0.018  ops/us
   VectorUtilBenchmark.floatSquareVector1024  thrpt5   9.837 ± 
0.154  ops/us
   
   Patch:
   Benchmark  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar1024  thrpt5   0.638 ± 
0.006  ops/us
   VectorUtilBenchmark.floatCosineVector1024  thrpt5   8.164 ± 
0.084  ops/us
   VectorUtilBenchmark.floatDotProductScalar1024  thrpt5   1.997 ± 
0.027  ops/us
   VectorUtilBenchmark.floatDotProductVector1024  thrpt5  12.486 ± 
0.163  ops/us
   VectorUtilBenchmark.floatSquareScalar1024  thrpt5   1.445 ± 
0.014  ops/us
   VectorUtilBenchmark.floatSquareVector1024  thrpt5  11.682 ± 
0.129  ops/us
   
   Patch (with -jvmArgsAppend '-XX:-UseFMA'):
   Benchmark  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar1024  thrpt5   0.641 ± 
0.005  ops/us
   VectorUtilBenchmark.floatCosineVector1024  thrpt5   6.102 ± 
0.053  ops/us
   VectorUtilBenchmark.floatDotProductScalar1024  thrpt5   1.997 ± 
0.007  ops/us
   VectorUtilBenchmark.floatDotProductVector1024  thrpt5  12.177 ± 
0.170  ops/us
   VectorUtilBenchmark.floatSquareScalar1024  thrpt5   1.450 ± 
0.027  ops/us
   VectorUtilBenchmark.floatSquareVector1024  thrpt5  10.464 ± 
0.154  ops/us
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Concurrent HNSW Merge [lucene]

2023-10-27 Thread via GitHub


zhaih commented on PR #12660:
URL: https://github.com/apache/lucene/pull/12660#issuecomment-1783709669

   Thanks Mike and Ben for reviewing! I'll try to merge and backport it 
tomorrow. And create a issue for future refactoring


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org