On Mon, 26 May 2025 06:51:12 GMT, Emanuel Peter <epe...@openjdk.org> wrote:
>> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs >> for X86 platforms [1]. However, the current implementation is not optimal >> for AArch64 SVE platform, which natively supports vector instructions for >> subword gather load operations using an int vector for indices (see [2][3]). >> >> Two key areas require improvement: >> 1. At the Java level, vector indices generated for range validation could be >> reused for the subsequent gather load operation on architectures with native >> vector instructions like AArch64 SVE. However, the current implementation >> prevents compiler reuse of these index vectors due to divergent control >> flow, potentially impacting performance. >> 2. At the compiler IR level, the additional `offset` input for >> `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR >> complexity and complicates backend implementation. Furthermore, generating >> `add` instructions before each memory access negatively impacts performance. >> >> This patch refactors the implementation at both the Java level and compiler >> mid-end to improve efficiency and maintainability across different >> architectures. >> >> Main changes: >> 1. Java-side API refactoring: >> - Explicitly passes generated index vectors to hotspot, eliminating >> duplicate index vectors for gather load instructions on >> architectures like AArch64. >> 2. C2 compiler IR refactoring: >> - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword >> types by removing the memory offset input and incorporating it into the >> memory base `addr` at the IR level. This simplifies backend implementation, >> reduces add operations, and unifies the IR across all types. >> 3. Backend changes: >> - Streamlines X86 implementation of subword gather operations following >> the removal of the offset input from the IR level. >> >> Performance: >> The performance of the relative JMH improves up to 27% on a X86 AVX512 >> system. Please see the data below: >> >> Benchmark Mode Cnt Unit >> SIZE Before After Gain >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms >> 64 53682.012 52650.325 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms >> 256 14484.252 14255.156 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms >> 1024 3664.900 3595.615 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms >> 4096 908.31... > > src/hotspot/share/opto/vectorIntrinsics.cpp line 1203: > >> 1201: // Class<? extends Vector<Integer>> vectorIndexClass, >> int indexLength, >> 1202: // Object base, long offset, >> 1203: // W indexVector1, W index_vector2, W index_vector3, W >> index_vector4, > > Are we doing a mix of `CamelCase` and `with_underscore` on purpose? This would be a naming style issue. Thanks for pointing out. I will fix it once I update this PR. It should be aligned with definition in `VectorSupport.java`. > src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java line > 495: > >> 493: Class<? extends Vector<Integer>> vectorIndexClass, >> 494: int indexLength, Object base, long offset, >> 495: W indexVector1, W indexVector2, W indexVector3, W >> indexVector4, > > Here you went for all `camelCase`, just an observation :) Yeah, I choose to use `camelCase` style here to align with other parameter naming styles in this method. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25138#discussion_r2106693179 PR Review Comment: https://git.openjdk.org/jdk/pull/25138#discussion_r2106692107