Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1751999625 Here are the results from running `test_all_sizes.py` then `results_to_md.py`: |NodeHash size|FST (mb)|RAM (mb)|FST build time (sec)|

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752003494 Thanks for looking into this @rmuir, I've been thinking similar myself (just didn't get around to anything other than the thinking! ) On my Mac M2. JDK 20.0.2. ```

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752024575 ``` // sum into accumulators Vector prod16 = prod16_1.add(prod16_2); acc = acc.add(prod16.convert(VectorOperators.S2I, 0)); acc = acc.add(prod16.convert(VectorOper

Re: [PR] Enable rank-unsafe optimization of top-k hit computations by quantizing scores. [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on PR #12628: URL: https://github.com/apache/lucene/pull/12628#issuecomment-1752028823 Very cool, surprisingly impactful! > I ran the Tantivy benchmark with TOP_10 and TOP_100 commands This is the Tantivy benchmark tooling, but you are comparing Lucene (mai

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752029230 And of course, `ZERO_EXTEND_S2I`, will work in the maximum boundary case, but not in others. So the question is then just about the maximum value of the bytes in these input arrays

Re: [I] Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on issue #12542: URL: https://github.com/apache/lucene/issues/12542#issuecomment-1752030874 Talking to @sokolovm at Community Over Code 2023 he suggested another idea here: instead of a (RAM hungry) hash table, couldn't we use the growing FST itself to lookup suffixes?

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on PR #12631: URL: https://github.com/apache/lucene/pull/12631#issuecomment-1752031210 > sum | 31606784 | 27188690 | -13.98% WHOA, wow! This is a massive gain for such a tiny change :) I'll try to review soon! Nice to revisit ancient `TODO`s in the source code

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752033176 > What is the maximum value that we can see in the input bytes? All possible values is how i test > Can they every hold `-128`? Yes! > Do we need to handle "ove

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752035773 Ok, cool. If there is not already one, we should add a test to the Panama / scalar unit test for the boundary values. -- This is an automated message from the Apache Git Service.

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752036396 yeah agreed: we should test the boundaries for all 3 functions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on code in PR #12631: URL: https://github.com/apache/lucene/pull/12631#discussion_r1349699402 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/FieldReader.java: ## @@ -99,6 +102,26 @@ public final class FieldReader extends Terms { */

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752039360 yeah, you are right, i am wrong. the trick only works in the unsigned case, Byte.MIN_VALUE is a problem :( -- This is an automated message from the Apache Git Service. To respond to the

[PR] add tests for vectorutils integer boundaries [lucene]

2023-10-08 Thread via GitHub
rmuir opened a new pull request, #12634: URL: https://github.com/apache/lucene/pull/12634 Let's improve the testing for the boundary cases and check them explicitly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752041404 at least we can improve the testing out of this: https://github.com/apache/lucene/pull/12634 -- This is an automated message from the Apache Git Service. To respond to the message, pleas

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub
gf2121 commented on code in PR #12631: URL: https://github.com/apache/lucene/pull/12631#discussion_r1349705693 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java: ## @@ -81,8 +81,11 @@ public final class Lucene90BlockTreeTermsRe

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752049654 don't worry, i have a plan B. it is just frustrating due to the nightmare of operating on the mac, combined with the fact this benchmark and lucene source is a separate repo. it makes the

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752050233 see latest commit for the idea. on my mac it gives a decent boost. it uses "32-bit" vector by loading 64-bit vector from array but only processing half of it. The tests should fail as i n

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on code in PR #12631: URL: https://github.com/apache/lucene/pull/12631#discussion_r1349711457 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java: ## @@ -81,8 +81,11 @@ public final class Lucene90BlockTreeTer

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on PR #12631: URL: https://github.com/apache/lucene/pull/12631#issuecomment-1752050479 I kicked off a `luceneutil` run ... I'll post results here soonish. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [PR] add tests for vectorutils integer boundaries [lucene]

2023-10-08 Thread via GitHub
rmuir merged PR #12634: URL: https://github.com/apache/lucene/pull/12634 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752063622 ok on my mac i see: ``` Benchmark (size) Mode Cnt Score Error Units BinaryCosineBenchmark.cosineDistanceNew 1024 thrpt5 2.

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on PR #12631: URL: https://github.com/apache/lucene/pull/12631#issuecomment-1752064474 `luceneutil` results on `wikimediumall` look good -- looks like all noise (even for `PKLookup`), or, any signal (change) is very low, making the ~15% reduction very much worth it.

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752098666 I get similar bench results, the new impl is faster. ``` Benchmark (size) Mode Cnt Score Error Units BinaryDotProductBenchmark.

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752099845 My sense here is that accessing a `part` other than `0` is less performant that just reloading the data, which seems a little off. -- This is an automated message from the Apache

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752100681 > My sense here is that accessing a `part` other than `0` is less performant that just reloading the data, which seems a little off. It seems to have a heavy cost no matter how i do

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752101786 btw, another crazy avenue to possibly explore here another day, since we seem bottlenecked on integer multiply. We could try it on arm too. It is faster than the current binary code on my

[I] Should we have an interface VectorValues which would be implemented by [Byte/Float]VectorValues classes [lucene]

2023-10-08 Thread via GitHub
shubhamvishu opened a new issue, #12635: URL: https://github.com/apache/lucene/issues/12635 ### Description Currently, there is lot of code duplication due to [ByteVectorValues](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ByteVectorValues.ja

[PR] Add interface VectorValues to be implemented by [Float/Byte]VectorValues [lucene]

2023-10-08 Thread via GitHub
shubhamvishu opened a new pull request, #12636: URL: https://github.com/apache/lucene/pull/12636 ### Description The classes [ByteVectorValues](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ByteVectorValues.java) and [FloatVectorValues](http

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752107370 The other thought I had around conversion costs would be to look into reinterpret+shuffle/shift/mask crap ourselves, which seems really crazy but i'm running low on ideas. -- This is an

Re: [PR] LUCENE-10241: Updating OpenNLP to 1.9.4. [lucene]

2023-10-08 Thread via GitHub
epugh commented on PR #448: URL: https://github.com/apache/lucene/pull/448#issuecomment-1752112078 It would be nice if this was updated to the awesome new OpenNLP 2.x line! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and u

Re: [PR] Enable rank-unsafe optimization of top-k hit computations by quantizing scores. [lucene]

2023-10-08 Thread via GitHub
jpountz commented on PR #12628: URL: https://github.com/apache/lucene/pull/12628#issuecomment-1752152301 I'll try to give a bit more context how I ended up here. With recent work on vector search and excitement around it, I can't prevent myself from thinking that all users who are happy to

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-08 Thread via GitHub
mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1752165322 For comparison, this is how the curve (RAM required during construction vs final FST size) looks on trunk, using the god-like parameters as best I could. I sorted the results in reve

Re: [PR] Add interface VectorValues to be implemented by [Float/Byte]VectorValues [lucene]

2023-10-08 Thread via GitHub
benwtrent commented on PR #12636: URL: https://github.com/apache/lucene/pull/12636#issuecomment-1752194821 It was sort of this way before but we decided to switch it as a common interface required either: - having to use generics - an API where things weren't fully implemented or r

[I] segmentInfos.replace() doesn't set userData [lucene]

2023-10-08 Thread via GitHub
Shibi-bala opened a new issue, #12637: URL: https://github.com/apache/lucene/issues/12637 ### Description Found that the [replace method](https://github.com/qcri/solr-6/blob/master/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L875-L878) doesn't set `userData` with t

Re: [PR] Avoid NPEx if the end of the stream has been reached without reading any characters [lucene]

2023-10-08 Thread via GitHub
pzygielo commented on PR #12611: URL: https://github.com/apache/lucene/pull/12611#issuecomment-1752377046 Thanks for checking. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

[PR] Explicitly return needStats flag in TermStates [lucene]

2023-10-08 Thread via GitHub
yugushihuang opened a new pull request, #12638: URL: https://github.com/apache/lucene/pull/12638 ### Description A simple API in TermStates to expose the `needStats` flag. Addresses #12617 # -- This is an automated message from the Apache Git Service. To respond to the m

Re: [PR] Avoid NPEx if the end of the stream has been reached without reading any characters [lucene]

2023-10-08 Thread via GitHub
dweiss merged PR #12611: URL: https://github.com/apache/lucene/pull/12611 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Avoid NPEx if the end of the stream has been reached without reading any characters [lucene]

2023-10-08 Thread via GitHub
dweiss commented on PR #12611: URL: https://github.com/apache/lucene/pull/12611#issuecomment-1752397871 I've applied this to main and branch_9x (9.9). Thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] Explicitly return needStats flag in TermStates [lucene]

2023-10-08 Thread via GitHub
jpountz commented on PR #12638: URL: https://github.com/apache/lucene/pull/12638#issuecomment-1752414836 Can you explain how/when you plan to use this new API? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-08 Thread via GitHub
dweiss commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1752416032 I didn't get into all the details but I think this looks good. Your questions are indeed intriguing - I can't provide any explanation off the top of my head, really. -- This is an auto