jpountz commented on code in PR #14203:
URL: https://github.com/apache/lucene/pull/14203#discussion_r1997535617
##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PointsWriter.java:
##
@@ -105,15 +107,22 @@ public Lucene90PointsWriter(
}
}
+ public Luce
gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2727214320
> Sorry for making it hard for you to move this PR forward, I was a bit
annoyed that we needed something complicated to speed things up, I like the
simplicity of specializedDecodeMaskInRe
jpountz commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2726996663
Again, thanks a lot for running benchmarks.
> I can refactor the code to the specialized decoding if it makes sense to
you
That would be great, thank you. Sorry for making i
gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2723963174
@jpountz Hi, do you have any idea how should we move forward on this
optimization? several thoughts:
* We can add another step32 for the hybrid-step decoding, which makes the
code
gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2726739699
On the AVX-512 machine:
* Specialized read does not vectorize the remainder loop, it seems the
complier failed to inline it.
* Specialized decode vectorizes the remainder loop.
jpountz commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2726015514
Thanks for running benchmarks. So it looks like the JVM doesn't think these
shorter loops (with step 128) are worth unrolling? This makes me wonder how
something like that performs on y
gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2725390772
> There must be something that happens with this 512 step that doesn't
happen otherwise such as using different instructions, loop unrolling, better
CPU pipelining or something else.
jpountz commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2724038977
I have some small concerns:
- The fact that the 512 step is tied to the number of points per leaf,
though it's not a big deal at all, postings are similar: their encoding logic
is sp
github-actions[bot] commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2699339952
This PR has not had activity in the past 2 weeks, labeling it as stale. If
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you
for your contributi
gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2664968205
Confused +1 ... but the comparison of step512(baseline) and
step32(candidate):
```
TaskQPS baseline StdDevQPS
my_modified_version StdDev
jpountz commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2663774846
Thanks for running benchmarks. I'm confused as to why running inner loops of
size 512 would be to much better than inner loops of size 32. This doesn't feel
right? Does luceneutil also r
gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2659556195
[perf_asm.log](https://github.com/user-attachments/files/18801016/perf_asm.log)
Profile suggests that loops get vectorized.
--
This is an automated message from the Apache Git Se
gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2659531719
Results on my machines are a bit disappointing
```
java -version
openjdk version "23.0.2" 2025-01-21
OpenJDK Runtime Environment (build 23.0.2+7-58)
OpenJDK 64-Bit Server VM
jpountz commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2659510128
Yes, exactly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To un
gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2659505128
Thanks for feedback!
And sorry for my poor english.. Do you mean something like this by `single
batch size of 16 of 32` ?
```
private static void readDelta16(IndexInput in, i
jpountz commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2659455382
Thanks for iterating and running benchmarks. I played with the
micro-benchmark and I get almost the same result if I use a single batch size
of 16 of 32 (AMD Ryzen with AVX2 but no AVX-5
gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2658082076
Comparison of VectorAPI(Baseline) and InnerLoop(Candidate)
```
TaskQPS baseline StdDevQPS
my_modified_version StdDevPct diff
gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2657334178
> is my understanding correct that it performs even better?
Yeah!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub
jpountz commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2657293043
These results look even better than the results that you had previously
reported for the vector API, is my understanding correct that it performs even
better?
--
This is an automated
gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2653948712
I refactor code to inner-loop. Result on wikimediumall AVX512
```
TaskQPS baseline StdDevQPS
my_modified_version StdDevPct d
gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2653001052
Inner loop performance get better on the newest commit.
```
Mac M2
Benchmark(bpv) (countVariable) Mode CntScore
Error Units
BKDCodec
gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2652803442
> applied the 0xFF mask to scratch in the shift loop
This helps generate `vpand` in assembly, but not help performance too much.
> Sorry for pushing
Not at all, it's in
jpountz commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2652267868
> #current bpv=24 gets vectorized on the shift loop, but not for the
remainder loop.
This is an interesting observation. I wonder if a small refactoring could
help it get auto-vec
gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2651208564
Thanks for feedback! I implement the fixed-size inner loop and print out
assembly for all.
[perf_asm.log](https://github.com/user-attachments/files/18752147/perf_asm.log)
* When pr
jpountz commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2649210893
Thanks for looking into it. Were you able to confirm that the difference
with the variable count is indeed that auto-vectorization not getting enabled
as opposed to something else such a
25 matches
Mail list logo