[PR] Fix test failures for TestCoreParser#testSpanNearQueryWithoutSlopXML [lucene]
vigyasharma opened a new pull request, #12724: URL: https://github.com/apache/lucene/pull/12724 Addresses #12708 `xml.TestCoreParser#testSpanNearQueryWithoutSlopXML` fails because of changed exception message Java 22 EA. This change removes the test's dependency on Java exception message strings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] xml.TestCoreParser#testSpanNearQueryWithoutSlopXML fails because of changed exception message [lucene]
vigyasharma commented on issue #12708: URL: https://github.com/apache/lucene/issues/12708#issuecomment-1782426261 Made a small change to assert on the exception type instead of checking the exception message string. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] `FSTCompiler.Builder` should have an option to stream the FST bytes directly to Directory [lucene]
dungba88 commented on issue #12543: URL: https://github.com/apache/lucene/issues/12543#issuecomment-1782469977 I put a new revision with support for DataOutput and FileChannel. When using DataOutput, if suffix sharing is enabled one also needs to pass a RandomAccessInput for reading. Otherwise it can be left null. So one can pass a IndexOutput, and RandomAccessInput can be created from IndexInput. When using FileChannel, one only needs to pass the FileChannel as that already allows both read & write at the same time. This FileChannel implementation is just for demonstration of feasibility. Some stuffs I'd like to discuss: - Should we write the rootNode + numBytes to the end of the FST instead of the front? We only have them after constructing the FST and we can't prepend a DataOutput (that's costly). Otherwise we would need to save the metadata separately from the main body. That's why I added a new method `saveMetadata()` - Should we move to value-based LRU cache? It has pros and cons: - Pros: We make NodeHash independent of FST completely. It would allow the suffix sharing without the need of RandomAccessInput, and thus without the need for IndexOutput & IndexInput to be open at the same time. Also accessing from RAM is much faster than accessing from disk. - Cons: More RAM required than the address-based cache. For truly minimal FST it would require the same (or more) RAM needed for the entire FST. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]
bruno-roustant commented on PR #12716: URL: https://github.com/apache/lucene/pull/12716#issuecomment-1782623703 Thank you @shubhamvishu for these experiments. The table answers exactly to the questions. And it means actually there is no point to change the bit mixing, since it does not bring any improvement. Making the rehash threshold at 1/2 would means the hash structure would occupy more memory on average since it would allocate an "array" and would use it between 25% to 50% occupancy. With the 2/3 threshold, it uses it between 33% and 66% occupancy. So staying at 2/3 is better. Sometimes an appealing idea is actually a false good idea. And it's great to have these experiments to avoid the pitfall! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]
shubhamvishu commented on PR #12716: URL: https://github.com/apache/lucene/pull/12716#issuecomment-1782810001 @bruno-roustant Absolutely! I totally agree with your point. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]
jpountz commented on issue #12696: URL: https://github.com/apache/lucene/issues/12696#issuecomment-1782814872 FWIW I could reproduce the speedup from disabling patching locally on wikibigall: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value HighSloppyPhrase1.58 (4.9%)1.55 (9.5%) -1.9% ( -15% - 13%) 0.441 CountTerm13139.34 (3.0%)12937.65 (3.1%) -1.5% ( -7% -4%) 0.111 HighTermMonthSort 4393.07 (1.5%) 4371.13 (2.1%) -0.5% ( -4% -3%) 0.396 Prefix3 270.78 (3.4%) 271.17 (3.2%)0.1% ( -6% -7%) 0.889 Fuzzy1 100.16 (1.1%) 100.42 (0.9%)0.3% ( -1% -2%) 0.420 Fuzzy2 70.62 (1.1%) 70.84 (1.0%)0.3% ( -1% -2%) 0.340 HighPhrase 16.41 (3.8%) 16.49 (5.3%)0.5% ( -8% -9%) 0.738 Respell 53.35 (1.8%) 53.62 (1.6%)0.5% ( -2% -4%) 0.351 HighTerm 418.58 (9.3%) 421.91 (7.7%)0.8% ( -14% - 19%) 0.770 LowSloppyPhrase9.87 (2.5%)9.96 (5.9%)0.9% ( -7% -9%) 0.544 Wildcard 94.17 (2.8%) 95.04 (3.3%)0.9% ( -5% -7%) 0.341 MedTerm 553.18 (8.3%) 559.16 (6.9%)1.1% ( -13% - 17%) 0.656 LowTerm 784.38 (7.0%) 793.23 (5.7%)1.1% ( -10% - 14%) 0.575 PKLookup 264.32 (2.8%) 267.44 (2.2%)1.2% ( -3% -6%) 0.138 HighSpanNear5.37 (3.2%)5.44 (3.3%)1.3% ( -5% -8%) 0.213 OrHighLow 590.78 (3.0%) 598.60 (2.6%)1.3% ( -4% -7%) 0.132 LowPhrase 27.97 (3.7%) 28.42 (4.9%)1.6% ( -6% - 10%) 0.245 LowSpanNear 11.09 (2.1%) 11.30 (2.2%)1.8% ( -2% -6%) 0.007 MedSpanNear7.39 (3.3%)7.53 (3.5%)1.9% ( -4% -8%) 0.079 AndHighLow 819.00 (2.9%) 835.98 (2.4%)2.1% ( -3% -7%) 0.015 MedPhrase 91.27 (3.4%) 93.33 (4.6%)2.3% ( -5% - 10%) 0.078 HighTermDayOfYearSort 447.47 (1.8%) 457.76 (1.8%)2.3% ( -1% -5%) 0.000 MedSloppyPhrase 16.79 (2.4%) 17.25 (4.6%)2.7% ( -4% -9%) 0.017 OrHighMed 157.99 (2.2%) 162.38 (2.3%)2.8% ( -1% -7%) 0.000 OrHighHigh 67.77 (1.7%) 69.71 (1.9%)2.9% ( 0% -6%) 0.000 AndHighHigh 48.92 (1.8%) 50.67 (2.1%)3.6% ( 0% -7%) 0.000 AndHighMed 174.71 (2.3%) 181.03 (2.6%)3.6% ( -1% -8%) 0.000 CountAndHighHigh 38.11 (4.4%) 39.65 (5.1%)4.1% ( -5% - 14%) 0.007 CountOrHighHigh 56.07 (16.2%) 58.48 (18.5%)4.3% ( -26% - 46%) 0.435 CountOrHighMed 86.95 (16.0%) 90.94 (18.3%)4.6% ( -25% - 46%) 0.398 CountAndHighMed 116.35 (3.4%) 121.78 (4.8%)4.7% ( -3% - 13%) 0.000 CountPhrase3.26 (11.2%)3.43 (13.1%)5.1% ( -17% - 33%) 0.187 IntNRQ 146.09 (35.1%) 171.55 (36.7%) 17.4% ( -40% - 137%) 0.124 ``` `.doc` files were 12% larger overall (2.64GB to 2.96GB), `.pos` files were 11% larger (11.03GB to 12.24GB), and the index was 9.7% larger (15.66GB to 17.18GB). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]
shubhamvishu commented on PR #12427: URL: https://github.com/apache/lucene/pull/12427#issuecomment-1782977506 @gsmiller I finally got a chance to run the benchmarks for this change. Below are the results(looks all good to me). Let me know what do you think? Thanks! ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighMed 96.52 (6.5%) 95.49 (6.3%) -1.1% ( -12% - 12%) 0.597 LowSpanNear 133.61 (4.7%) 132.45 (4.0%) -0.9% ( -9% -8%) 0.528 OrHighMed 67.95 (3.2%) 67.37 (4.6%) -0.9% ( -8% -7%) 0.492 MedSpanNear5.79 (5.2%)5.74 (4.2%) -0.8% ( -9% -8%) 0.573 OrNotHighLow 322.30 (2.3%) 319.70 (2.0%) -0.8% ( -4% -3%) 0.233 BrowseDateSSDVFacets0.90 (8.2%)0.89 (9.4%) -0.8% ( -17% - 18%) 0.773 AndHighLow 355.80 (4.2%) 352.99 (4.1%) -0.8% ( -8% -7%) 0.550 MedTermDayTaxoFacets5.85 (6.2%)5.81 (5.7%) -0.7% ( -11% - 11%) 0.713 OrNotHighMed 222.70 (3.6%) 221.51 (4.0%) -0.5% ( -7% -7%) 0.657 OrHighNotMed 194.41 (3.6%) 193.40 (3.4%) -0.5% ( -7% -6%) 0.641 MedPhrase 44.71 (3.1%) 44.50 (3.4%) -0.5% ( -6% -6%) 0.639 OrNotHighHigh 179.60 (3.7%) 178.85 (3.8%) -0.4% ( -7% -7%) 0.726 HighSpanNear8.33 (4.6%)8.30 (4.1%) -0.4% ( -8% -8%) 0.765 OrHighNotLow 331.53 (3.8%) 330.15 (3.7%) -0.4% ( -7% -7%) 0.728 OrHighNotHigh 248.18 (3.5%) 247.20 (3.4%) -0.4% ( -7% -6%) 0.718 MedIntervalsOrdered4.66 (4.6%)4.64 (4.3%) -0.4% ( -8% -8%) 0.780 OrHighMedDayTaxoFacets4.39 (3.9%)4.38 (4.2%) -0.3% ( -8% -8%) 0.820 OrHighLow 569.67 (2.6%) 568.04 (2.6%) -0.3% ( -5% -5%) 0.733 LowIntervalsOrdered2.64 (4.6%)2.64 (4.6%) -0.2% ( -8% -9%) 0.899 HighTermTitleBDVSort5.20 (4.0%)5.19 (4.1%) -0.2% ( -7% -8%) 0.895 OrHighHigh 35.47 (4.4%) 35.42 (5.8%) -0.1% ( -9% - 10%) 0.929 BrowseDayOfYearTaxoFacets3.80 (15.9%)3.80 (16.0%) -0.1% ( -27% - 37%) 0.984 BrowseDateTaxoFacets3.78 (15.5%)3.78 (15.5%) -0.0% ( -26% - 36%) 0.995 AndHighHigh 16.35 (3.6%) 16.35 (7.5%)0.0% ( -10% - 11%) 0.996 AndHighMedDayTaxoFacets 49.40 (1.6%) 49.41 (1.7%)0.0% ( -3% -3%) 0.975 LowPhrase 254.05 (3.6%) 254.16 (4.0%)0.0% ( -7% -7%) 0.972 HighPhrase 40.53 (3.1%) 40.56 (3.6%)0.1% ( -6% -6%) 0.939 PKLookup 131.32 (2.3%) 131.44 (2.0%)0.1% ( -4% -4%) 0.891 BrowseRandomLabelTaxoFacets3.28 (15.0%)3.28 (14.8%)0.1% ( -25% - 35%) 0.983 BrowseMonthTaxoFacets4.13 (34.7%)4.13 (34.7%)0.1% ( -51% - 106%) 0.991 Fuzzy1 65.72 (1.1%) 65.81 (1.2%)0.1% ( -2% -2%) 0.714 LowTerm 513.23 (2.9%) 513.99 (2.9%)0.1% ( -5% -6%) 0.870 HighIntervalsOrdered1.62 (5.3%)1.62 (5.0%)0.2% ( -9% - 11%) 0.909 BrowseDayOfYearSSDVFacets3.75 (12.4%)3.76 (8.6%)0.2% ( -18% - 24%) 0.956 MedTerm 382.67 (3.7%) 383.49 (3.9%)0.2% ( -7% -8%) 0.857 LowSloppyPhrase 12.31 (2.4%) 12.34 (2.1%)0.2% ( -4% -4%) 0.754 Respell 36.92 (1.6%) 37.02 (1.6%)0.3% ( -2% -3%) 0.617 MedSloppyPhrase 22.87 (3.2%) 22.94 (3.0%)0.3% ( -5% -6%) 0.751 HighTerm 251.80 (4.3%) 252.92 (4.6%)0.4% ( -8% -9%) 0.752 AndHighHighDayTaxoFacets7.39 (2.7%)7.43 (2.6%)0.5% ( -4% -5%) 0.57
Re: [PR] Concurrent HNSW Merge [lucene]
msokolov commented on PR #12660: URL: https://github.com/apache/lucene/pull/12660#issuecomment-1782983831 as for the renaming maybe we can look to do it as a followup PR? IHnswGraphSearcher -> HnswGraphSearcher? Or maybe we can just use HnswSearcher right now? Re: controlling concurrency. Maybe eventually we figure out how to do this as part of ConcurrentMergeScheduler? Since it is all about merging, that seems like a logical place. It currently allows setting `maxMergeCount` and `maxThreadCount` independently, but enforces `maxThreadCount <= maxMergeCount`. We could potentially fiddle with this, maybe it would be as simple as relaxing that constraint and using "spare threads" to do concurrent graph merging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize: Use DocIdSetIterator Reduce bkd docvalues iteration [lucene]
msokolov commented on code in PR #12723: URL: https://github.com/apache/lucene/pull/12723#discussion_r1374646933 ## lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java: ## @@ -216,7 +216,7 @@ private static class BKDPointTree implements PointTree { scratchMinIndexPackedValue, scratchMaxIndexPackedValue; private final int[] commonPrefixLengths; -private final BKDReaderDocIDSetIterator scratchIterator; +private final BKDReaderDocIDSet scratch; Review Comment: Please don't rename this - it makes it more difficult to read the PR and find your significant contribution amidst what looks like noise. It's confusing that in some places you renamed it to `scratch` and in other places to `iterator` -- is there some difference? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use max BPV encoding in postings if doc buffer size less than ForUtil.BLOCK_SIZE [lucene]
jpountz commented on issue #12717: URL: https://github.com/apache/lucene/issues/12717#issuecomment-1783020278 Does this actually matter for performance? My gut feeling is that either a value has a long postings list, and then the vast majority of blocks will be encoded with PFOR and should be fast, or it has a short postings list and then queries should be fast anyway? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Load vector data directly from the memory segment [lucene]
ChrisHegarty commented on PR #12703: URL: https://github.com/apache/lucene/pull/12703#issuecomment-1783135109 I've not been able to spend all that much time on this this week, but here's my current thinking. The abstractions in the PR are currently not great (as discussed above), but putting that aside for now since we can get a sense of the potential real performance impact from this approach as it is - so I did some performance experiments other than micro jmh. It seems that this change improves the merge performance of vector data in segments by about 10% - not great, I was hoping for better. Is it worth proceeding with or maybe looking elsewhere? I'm not sure. Here's how I determine the 10% - by just hacking on KnnGraphTester from luceneUtil so that it creates an index with more than one segment when ingesting 100k+ vectors with dimensions of 768, then timing the forceMerge. This is a very rough experiment, but shows that the potential gain is much less than I expected. Caution - I could have goofed up several things, from the actual implementation to the experiment merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use max BPV encoding in postings if doc buffer size less than ForUtil.BLOCK_SIZE [lucene]
easyice commented on issue #12717: URL: https://github.com/apache/lucene/issues/12717#issuecomment-1783202938 @jpountz Thanks for your explanation, i got some flame graph that shows the `readVIntBlock` takes up a bit large proportion, I'll try to reproduce it with some mocked data -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize: Use DocIdSetIterator Reduce bkd docvalues iteration [lucene]
luyuncheng commented on PR #12723: URL: https://github.com/apache/lucene/pull/12723#issuecomment-1783206971 > this seems promising. I guess the only cost is the increased memory needed because we create the FixedBitSet? Can you say how large this might get in the worst case? @msokolov In https://github.com/apache/lucene/blob/c701a5d9be4d8dc677ea994f8ef5fdd8945760d6/lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java#L66-L74 AND https://github.com/apache/lucene/blob/c701a5d9be4d8dc677ea994f8ef5fdd8945760d6/lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java#L131-L140 so, i think in `BITSET_IDS` type with `DocBaseBitSetIterator` in max range is about 8192 docids(16 * 512) which FixedBitSet memory is 128 longs. and BKDReaderDocIDSetIterator contains 512 ints -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize: Use DocIdSetIterator Reduce bkd docvalues iteration [lucene]
luyuncheng commented on code in PR #12723: URL: https://github.com/apache/lucene/pull/12723#discussion_r1374824046 ## lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java: ## @@ -216,7 +216,7 @@ private static class BKDPointTree implements PointTree { scratchMinIndexPackedValue, scratchMaxIndexPackedValue; private final int[] commonPrefixLengths; -private final BKDReaderDocIDSetIterator scratchIterator; +private final BKDReaderDocIDSet scratch; Review Comment: Prior: `BKDReaderDocIDSetIterator` using int array to do DocIdSetIterator. Now: But i want to use `DocBaseBitSetIterator` to reduce iterator and optimize `visit(DocIdSetIterator iterator)`. AND also in type `DELTA_BPV_16` using int array to do iterator without renew int array. so i rename BKDReaderDocIDSetIterator to BKDReaderDocIDSet which usage is not a iterator. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize: Use DocIdSetIterator Reduce bkd docvalues iteration [lucene]
luyuncheng commented on code in PR #12723: URL: https://github.com/apache/lucene/pull/12723#discussion_r1374824046 ## lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java: ## @@ -216,7 +216,7 @@ private static class BKDPointTree implements PointTree { scratchMinIndexPackedValue, scratchMaxIndexPackedValue; private final int[] commonPrefixLengths; -private final BKDReaderDocIDSetIterator scratchIterator; +private final BKDReaderDocIDSet scratch; Review Comment: `BKDReaderDocIDSetIterator` using int array to do DocIdSetIterator. But i want to use `DocBaseBitSetIterator` to reduce iterator and optimize `visit(DocIdSetIterator iterator)`. AND also in type `DELTA_BPV_16` using int array to do iterator without renew int array. so i rename BKDReaderDocIDSetIterator to BKDReaderDocIDSet which usage is not a iterator. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix test failures for TestCoreParser#testSpanNearQueryWithoutSlopXML [lucene]
uschindler commented on PR #12724: URL: https://github.com/apache/lucene/pull/12724#issuecomment-1783287980 Should I merge and backport it or do you want to do it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Return the same input vector if its a unit vector in VectorUtil#l2normalize [lucene]
shubhamvishu opened a new pull request, #12726: URL: https://github.com/apache/lucene/pull/12726 ### Description While going through [VectorUtil](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/VectorUtil.java) class, I observed we don't have a check for unit vector in `VectorUtil#l2normalize` so passing a unit vector goes thorough the whole L2 normalization(which is totally not required and it should early exit?). I confirmed this by trying out a silly example of `VectorUtil.l2normalize(VectorUtil.l2normalize(nonUnitVector))` and it performed every calculation twice. We could also argue that user should not call for this for a unit vector but I believe there would be cases where user simply want to perform the L2 normalization without checking the vector or if there are some overflowing values. TL;DR : We should early exit in `VectorUtil#l2normalize`, returning the same input vector if its a unit vector This is easily avoidable if we introduce a light check to see if the L1 norm or squared sum of input vector is equal to 1.0 (or) maybe just check `Math.abs(l1norm - 1.0d) <= 1e-5` (as in this PR) because that unit vector dot product(`v x v`) are not exactly 1.0 but like example : `0.999403953552` etc. With `1e-5` delta here we would be assuming a vector v having `v x v` >= `0.9` is a unit vector or say already L2 normalized which seems fine as the delta is really small? and also the check is not heavy one?. I'm not sure this there existed some sort of similar check before or something(I tried to check but didn't find any history) so looking forward to thoughts if this makes sense to be added or not. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Return the same input vector if its a unit vector in VectorUtil#l2normalize [lucene]
benwtrent commented on PR #12726: URL: https://github.com/apache/lucene/pull/12726#issuecomment-1783318072 @shubhamvishu could you add a "CHANGES.txt" entry under optimizations? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Ensure negative scores are not returned by vector similarity functions [lucene]
benwtrent opened a new pull request, #12727: URL: https://github.com/apache/lucene/pull/12727 We shouldn't ever return negative scores from vector similarity functions. Given vector panama and nearly antipodal float[] vectors, it is possible that cosine and (normalized) dot-product become slightly negative due to compounding floating point errors. Since we don't want to make panama vector incredibly slow, we stick to float32 operations for now, and just snap to `0` if the score is negative after our correction. closes: https://github.com/apache/lucene/issues/12700 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add back maxConn & beamWidth HNSW codec ctor [lucene]
benwtrent opened a new pull request, #12728: URL: https://github.com/apache/lucene/pull/12728 follow up to https://github.com/apache/lucene/pull/12582 For user convenience, I added back the two parameter ctor for the HNSW codec. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Return the same input vector if its a unit vector in VectorUtil#l2normalize [lucene]
shubhamvishu commented on PR #12726: URL: https://github.com/apache/lucene/pull/12726#issuecomment-1783364698 Oh nice! @benwtrent I have added a CHANGES.txt entry under optimizations now. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Adding new flat vector format and refactoring HNSW [lucene]
benwtrent opened a new pull request, #12729: URL: https://github.com/apache/lucene/pull/12729 Currently the HNSW codec does too many things, it not only indexes vectors, but stores them and determines how to store them given the vector type. This PR extracts out the vector storage into a new format `Lucene99FlatVectorsFormat`, additionally a "flat" version of `KnnVectorsFormat` is added called `FlatVectorsFormat`. This allows for some additional helper functions that allow an indexing codec (like HNSW) take advantage of the flat formats. Additionally, this PR refactors the new `Lucene99ScalarQuantizedVectorsFormat` to be a `FlatVectorsFormat`. Now, `Lucene99HnswVectorsFormat` can accept either a `Lucene99ScalarQuantizedVectorsFormat` to use int8 quantized vectors or `Lucene99FlatVectorsFormat` for raw vectors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clean up ByteBlockPool [lucene]
stefanvodita commented on code in PR #12506: URL: https://github.com/apache/lucene/pull/12506#discussion_r1375044479 ## lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java: ## @@ -170,42 +191,42 @@ public void reset(boolean zeroFillBuffers, boolean reuseFirst) { } /** - * Advances the pool to its next buffer. This method should be called once after the constructor - * to initialize the pool. In contrast to the constructor a {@link ByteBlockPool#reset()} call - * will advance the pool to its first buffer immediately. + * Allocates a new buffer and advances the pool to it. This method should be called once after the + * constructor to initialize the pool. In contrast to the constructor, a {@link + * ByteBlockPool#reset()} call will advance the pool to its first buffer immediately. */ public void nextBuffer() { if (1 + bufferUpto == buffers.length) { + // The buffer array is full - expand it byte[][] newBuffers = new byte[ArrayUtil.oversize(buffers.length + 1, NUM_BYTES_OBJECT_REF)][]; System.arraycopy(buffers, 0, newBuffers, 0, buffers.length); buffers = newBuffers; } +// Allocate new buffer and advance the pool to it buffer = buffers[1 + bufferUpto] = allocator.getByteBlock(); bufferUpto++; - byteUpto = 0; byteOffset = Math.addExact(byteOffset, BYTE_BLOCK_SIZE); } /** - * Fill the provided {@link BytesRef} with the bytes at the specified offset/length slice. This - * will avoid copying the bytes, if the slice fits into a single block; otherwise, it uses the - * provided {@link BytesRefBuilder} to copy bytes over. + * Fill the provided {@link BytesRef} with the bytes at the specified offset and length. This will + * avoid copying the bytes if the slice fits into a single block; otherwise, it uses the provided + * {@link BytesRefBuilder} to copy bytes over. */ - void setBytesRef(BytesRefBuilder builder, BytesRef result, long offset, int length) { + void setBytesRef(BytesRefBuilder builder, BytesRef result, int offset, int length) { Review Comment: I missed that the purpose of the long was to increase the address space. I'll change this back. ## lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java: ## @@ -170,42 +191,42 @@ public void reset(boolean zeroFillBuffers, boolean reuseFirst) { } /** - * Advances the pool to its next buffer. This method should be called once after the constructor - * to initialize the pool. In contrast to the constructor a {@link ByteBlockPool#reset()} call - * will advance the pool to its first buffer immediately. + * Allocates a new buffer and advances the pool to it. This method should be called once after the + * constructor to initialize the pool. In contrast to the constructor, a {@link + * ByteBlockPool#reset()} call will advance the pool to its first buffer immediately. */ public void nextBuffer() { if (1 + bufferUpto == buffers.length) { + // The buffer array is full - expand it byte[][] newBuffers = new byte[ArrayUtil.oversize(buffers.length + 1, NUM_BYTES_OBJECT_REF)][]; System.arraycopy(buffers, 0, newBuffers, 0, buffers.length); buffers = newBuffers; } +// Allocate new buffer and advance the pool to it buffer = buffers[1 + bufferUpto] = allocator.getByteBlock(); bufferUpto++; - byteUpto = 0; byteOffset = Math.addExact(byteOffset, BYTE_BLOCK_SIZE); } /** - * Fill the provided {@link BytesRef} with the bytes at the specified offset/length slice. This - * will avoid copying the bytes, if the slice fits into a single block; otherwise, it uses the - * provided {@link BytesRefBuilder} to copy bytes over. + * Fill the provided {@link BytesRef} with the bytes at the specified offset and length. This will + * avoid copying the bytes if the slice fits into a single block; otherwise, it uses the provided + * {@link BytesRefBuilder} to copy bytes over. */ - void setBytesRef(BytesRefBuilder builder, BytesRef result, long offset, int length) { + void setBytesRef(BytesRefBuilder builder, BytesRef result, int offset, int length) { result.length = length; -int bufferIndex = (int) (offset >> BYTE_BLOCK_SHIFT); Review Comment: Good idea! ## lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java: ## @@ -259,14 +280,15 @@ public void readBytes(final long offset, final byte[] bytes, int bytesOffset, in @Override public long ramBytesUsed() { long size = BASE_RAM_BYTES; -size += RamUsageEstimator.sizeOfObject(buffer); size += RamUsageEstimator.shallowSizeOf(buffers); for (byte[] buf : buffers) { - if (buf == buffer) { Review Comment: I haven't figured out why `buffer` was a special case. Even if it were not a full block, would that make a difference if we're calling `RamUsageEstimator#
Re: [PR] Clean up ByteBlockPool [lucene]
stefanvodita commented on PR #12506: URL: https://github.com/apache/lucene/pull/12506#issuecomment-1783530943 I’ve integrated most of the suggestions. There’s just the matter of the name for the slice pool class and the right package for it. I don’t have a strong opinion on this. Maybe we move it to `org.apache.lucene.index` and keep the name. It goes well with `ByteSliceReader`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix test failures for TestCoreParser#testSpanNearQueryWithoutSlopXML [lucene]
vigyasharma merged PR #12724: URL: https://github.com/apache/lucene/pull/12724 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Remove test dependency on Java default exception message [lucene]
vigyasharma opened a new pull request, #12730: URL: https://github.com/apache/lucene/pull/12730 Backport of fix in #12724 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove test dependency on Java default exception message [lucene]
vigyasharma commented on PR #12730: URL: https://github.com/apache/lucene/pull/12730#issuecomment-1783652392 Backport change already approved in https://github.com/apache/lucene/pull/12724. Merging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove test dependency on Java default exception message [lucene]
vigyasharma merged PR #12730: URL: https://github.com/apache/lucene/pull/12730 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix test failures for TestCoreParser#testSpanNearQueryWithoutSlopXML [lucene]
vigyasharma commented on PR #12724: URL: https://github.com/apache/lucene/pull/12724#issuecomment-1783653077 I've backported this to `branch_9x`. Do we need it in any other branches? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Speedup float cosine vectors, use FMA where fast and available to reduce error [lucene]
rmuir opened a new pull request, #12731: URL: https://github.com/apache/lucene/pull/12731 The intel fma is nice, and its easier to reason about when looking at assembly. We basically reduce the error for free where its available. Along with another change (reducing the unrolling for cosine, since it has 3 fma ops already), we can speed up cosine from 6 -> 8 uops/us. On the arm the fma leads to slight slowdowns, so we don't use it. Its not much, just something like 10%, but seems like the wrong tradeoff. If you run the code with `-XX-UseFMA` there's no slowdown, but no speedup either. And obviously, no changes for ARM here. ``` Skylake AVX-256 Main: Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.floatCosineScalar1024 thrpt5 0.624 ± 0.041 ops/us VectorUtilBenchmark.floatCosineVector1024 thrpt5 5.988 ± 0.111 ops/us VectorUtilBenchmark.floatDotProductScalar1024 thrpt5 1.959 ± 0.032 ops/us VectorUtilBenchmark.floatDotProductVector1024 thrpt5 12.058 ± 0.920 ops/us VectorUtilBenchmark.floatSquareScalar1024 thrpt5 1.422 ± 0.018 ops/us VectorUtilBenchmark.floatSquareVector1024 thrpt5 9.837 ± 0.154 ops/us Patch: Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.floatCosineScalar1024 thrpt5 0.638 ± 0.006 ops/us VectorUtilBenchmark.floatCosineVector1024 thrpt5 8.164 ± 0.084 ops/us VectorUtilBenchmark.floatDotProductScalar1024 thrpt5 1.997 ± 0.027 ops/us VectorUtilBenchmark.floatDotProductVector1024 thrpt5 12.486 ± 0.163 ops/us VectorUtilBenchmark.floatSquareScalar1024 thrpt5 1.445 ± 0.014 ops/us VectorUtilBenchmark.floatSquareVector1024 thrpt5 11.682 ± 0.129 ops/us Patch (with -jvmArgsAppend '-XX:-UseFMA'): Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.floatCosineScalar1024 thrpt5 0.641 ± 0.005 ops/us VectorUtilBenchmark.floatCosineVector1024 thrpt5 6.102 ± 0.053 ops/us VectorUtilBenchmark.floatDotProductScalar1024 thrpt5 1.997 ± 0.007 ops/us VectorUtilBenchmark.floatDotProductVector1024 thrpt5 12.177 ± 0.170 ops/us VectorUtilBenchmark.floatSquareScalar1024 thrpt5 1.450 ± 0.027 ops/us VectorUtilBenchmark.floatSquareVector1024 thrpt5 10.464 ± 0.154 ops/us ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Concurrent HNSW Merge [lucene]
zhaih commented on PR #12660: URL: https://github.com/apache/lucene/pull/12660#issuecomment-1783709669 Thanks Mike and Ben for reviewing! I'll try to merge and backport it tomorrow. And create a issue for future refactoring -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org