Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
uschindler commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1773038045 > Just dropping an ACK here, for now. I do get the issues, and I agree that there could be better ways to frame things at the vector API level. Let's write a proposal together i

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772982177 Just dropping an ACK here, for now. I do get the issues, and I agree that there could be better ways to frame things at the vector API level. -- This is an automated message from

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
uschindler commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772952661 Hi, > The jvm already has these. For example a user can set max vector width and avx instructiom level already. I assume that avx 512 users who are running on downclock-suscept

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772915097 > > Unfortunately this approach is slightly suboptimal for your Rocket Lake which doesn't suffer from downclocking, but it is a disaster elsewhere, so we have to play it safe. > > W

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772706786 such a method would solve 95% of my problems, if it would throw UnsupportedOperationException or return `null` if the hardware/hotspot doesnt support all the requested VectorOperators. -

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772702122 I would really just fix the api: instead of `IntVector.SPECIES_PREFERRED` constant which is meaningless, it should be a method taking `VectorOperation...` about how you plan to use it. it

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772695285 Vector API should also fix its bugs. It is totally senseless to have `IntVector.SPECIES_PREFERRED` and `FloatVector.SPECIES_PREFERRED` and then always set them to '512' on every avx-512 ma

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
uschindler commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772679957 > Unfortunately this approach is slightly suboptimal for your Rocket Lake which doesn't suffer from downclocking, but it is a disaster elsewhere, so we have to play it safe. We

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
uschindler commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772673981 > to have any decent performance, we really need information on the CPU in question and its vector capabilities. And the idea you can write "one loop" that "runs anywhere" is an obvio

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772669239 also i think JMH is bad news when it comes to downclocking. It does not show the true performance impact of this. It slows down other things on the machine as well: the user might have oth

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772661255 to have any decent performance, we really need information on the CPU in question and its vector capabilities. And the idea you can write "one loop" that "runs anywhere" is an obvious pipe

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772656211 @ChrisHegarty there are plenty of actions we could take... but I implemented this specific same optimization in question safely in #12681 See https://en.wikipedia.org/wiki/Advanced_

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-20 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1772535368 Thanks @rmuir @gf2121 I need to spend a bit more evaluating this. But it looks like no action is needed here? -- This is an automated message from the Apache Git Service. To res

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-19 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1770713206 Thank you @gf2121 , it is confirmed. I include just the part of the table that is relevant. It is really great that you caught this. | ID | Description

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-18 Thread via GitHub
gf2121 commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1770155713 > @gf2121 i think we could diagnose it further with https://github.com/travisdowns/avx-turbo Thanks @rmuir for profile guide! Sorry for the delay. It took me some time to app

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-15 Thread via GitHub
uschindler commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1763463569 I backported this one to 9.x. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-14 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1763058511 @benwtrent it isn't a panama thing. these functions are 32-bit (they return `int` and `float`). There is no hope for these getting faster, I just hope you understand that.

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-14 Thread via GitHub
benwtrent commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1762993804 Thank y'all so much for digging into this @rmuir @gf2121 @ChrisHegarty @uschindler ! Maybe one day Panama Vector will mature into allow us to do nicer things with `byte` compari

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-14 Thread via GitHub
rmuir merged PR #12632: URL: https://github.com/apache/lucene/pull/12632 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-14 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1762984112 I'm gonna merge this but we should continue to explore the intel case. Not sure what we can do there though. -- This is an automated message from the Apache Git Service. To respond to th

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-13 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1762084009 And at least the theory makes sense, this integer multiply is definitely "avx512 heavy", so if u have a cpu susceptible to throttling, better to do 256bit multiplies that we do today. I gu

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-13 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1762073195 I guess you will have to probably `modprobe msr` first. I already have the `msr` module loaded for other nefarious purposes. -- This is an automated message from the Apache Git Service.

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-13 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1762068277 I compiled the code and ran it easily, just `git clone + make`. You do have to run it as root to get the useful output, I took a risk on my machine: ``` think:avx-turbo[master]$ sudo

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-13 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1762063378 @gf2121 i think we could diagnose it further with https://github.com/travisdowns/avx-turbo -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-13 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1762055897 @gf2121 @ChrisHegarty you can see the issue from his assembler output with the failed intel optimization: the current code does 2 x 256-bit vpmull on ymm registers, the proposed simplif

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-13 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1762051354 oh @gf2121 I missed that you added this, thank you! I am looking at it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-11 Thread via GitHub
gf2121 commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1757126242 Get it ! :) [profile.log](https://github.com/apache/lucene/files/12866842/profile.log) -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-10 Thread via GitHub
gf2121 commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1755588137 Thanks @rmuir ! I will try. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific co

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-10 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1755272993 @gf2121 maybe, if you have time, you could run benchmark with `-prof perfasm` and upload the output here? It could solve the mystery. I am curious if it is just a cpu difference,

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1753372410 > In this PR is no change in square distance!? It only optimizes cosine and dotProduct. See the [first commit of this PR](132bf28ecf86f06f6a015f5797139d7dcf3d2fb0) and [the corresp

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
uschindler commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1753361544 > I rerun on java 21, `squareDistanceNewNew` looks faster: In this PR is no change in square distance!? It only optimizes cosine and dotProduct. -- This is an automated messa

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1753077559 I rerun on java 21, `squareDistanceNewNew` looks faster: ``` openjdk version "21" 2023-09-19 OpenJDK Runtime Environment (build 21+35-2513) OpenJDK 64-Bit Server VM (build 21+35

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1753006621 > Especially clang already makes a reasonable choice that's only sub-optimal because of CPU quirks (32x32 => 32-bit SIMD mulitplication costs more on recent Intel microarchitectures than 2

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1753003897 > Oh! I didn't look back as far as the original commit on this PR, sorry. I see now that @rmuir tried exactly the same thing. > > @gf2121 Strange that we see different results. Could

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752963144 Oh! I didn't look back as far as the original commit on this PR, sorry. I see now that @rmuir tried exactly the same thing. @gf2121 Strange that we see different results. C

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
gf2121 commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752758194 > Benchmark (size) Mode Cnt Score Error Units BinaryDotProductBenchmark.dotProductNew 1024 thrpt5 20.675 ± 0.051 ops/us Binar

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-09 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752723993 @rmuir Building on your idea, and focusing again on the x64 case, I get a bit of a boost by just converting directly to int (rather than the short dance). On my Rocket

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752107370 The other thought I had around conversion costs would be to look into reinterpret+shuffle/shift/mask crap ourselves, which seems really crazy but i'm running low on ideas. -- This is an

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752101786 btw, another crazy avenue to possibly explore here another day, since we seem bottlenecked on integer multiply. We could try it on arm too. It is faster than the current binary code on my

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752100681 > My sense here is that accessing a `part` other than `0` is less performant that just reloading the data, which seems a little off. It seems to have a heavy cost no matter how i do

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752099845 My sense here is that accessing a `part` other than `0` is less performant that just reloading the data, which seems a little off. -- This is an automated message from the Apache

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752098666 I get similar bench results, the new impl is faster. ``` Benchmark (size) Mode Cnt Score Error Units BinaryDotProductBenchmark.

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752063622 ok on my mac i see: ``` Benchmark (size) Mode Cnt Score Error Units BinaryCosineBenchmark.cosineDistanceNew 1024 thrpt5 2.

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752050233 see latest commit for the idea. on my mac it gives a decent boost. it uses "32-bit" vector by loading 64-bit vector from array but only processing half of it. The tests should fail as i n

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752049654 don't worry, i have a plan B. it is just frustrating due to the nightmare of operating on the mac, combined with the fact this benchmark and lucene source is a separate repo. it makes the

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752041404 at least we can improve the testing out of this: https://github.com/apache/lucene/pull/12634 -- This is an automated message from the Apache Git Service. To respond to the message, pleas

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752039360 yeah, you are right, i am wrong. the trick only works in the unsigned case, Byte.MIN_VALUE is a problem :( -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752036396 yeah agreed: we should test the boundaries for all 3 functions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752035773 Ok, cool. If there is not already one, we should add a test to the Panama / scalar unit test for the boundary values. -- This is an automated message from the Apache Git Service.

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752033176 > What is the maximum value that we can see in the input bytes? All possible values is how i test > Can they every hold `-128`? Yes! > Do we need to handle "ove

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752029230 And of course, `ZERO_EXTEND_S2I`, will work in the maximum boundary case, but not in others. So the question is then just about the maximum value of the bytes in these input arrays

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752024575 ``` // sum into accumulators Vector prod16 = prod16_1.add(prod16_2); acc = acc.add(prod16.convert(VectorOperators.S2I, 0)); acc = acc.add(prod16.convert(VectorOper

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-08 Thread via GitHub
ChrisHegarty commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1752003494 Thanks for looking into this @rmuir, I've been thinking similar myself (just didn't get around to anything other than the thinking! ) On my Mac M2. JDK 20.0.2. ```

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-07 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1751939374 I don't know how to do the same tricks for the BinarySquare one due to the subtraction. So I'm done for now. I think given the reports from @gf2121 the 256/512-bit experiment was a