On Mon, 26 May 2025 06:51:12 GMT, Emanuel Peter <epe...@openjdk.org> wrote:

>> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs 
>> for X86 platforms [1]. However, the current implementation is not optimal 
>> for AArch64 SVE platform, which natively supports vector instructions for 
>> subword gather load operations using an int vector for indices (see [2][3]).
>> 
>> Two key areas require improvement:
>> 1. At the Java level, vector indices generated for range validation could be 
>> reused for the subsequent gather load operation on architectures with native 
>> vector instructions like AArch64 SVE. However, the current implementation 
>> prevents compiler reuse of these index vectors due to divergent control 
>> flow, potentially impacting performance.
>> 2. At the compiler IR level, the additional `offset` input for 
>> `LoadVectorGather`/`LoadVectorGatherMasked` with subword types  increases IR 
>> complexity and complicates backend implementation. Furthermore, generating 
>> `add` instructions before each memory access negatively impacts performance.
>> 
>> This patch refactors the implementation at both the Java level and compiler 
>> mid-end to improve efficiency and maintainability across different 
>> architectures.
>> 
>> Main changes:
>> 1. Java-side API refactoring:
>>    - Explicitly passes generated index vectors to hotspot, eliminating 
>> duplicate index vectors for gather load instructions on
>>      architectures like AArch64.
>> 2. C2 compiler IR refactoring:
>>    - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword 
>> types by removing the memory offset input and incorporating it into the 
>> memory base `addr` at the IR level. This simplifies backend implementation, 
>> reduces add operations, and unifies the IR across all types.
>> 3. Backend changes:
>>    - Streamlines X86 implementation of subword gather operations following 
>> the removal of the offset input from the IR level.
>> 
>> Performance:
>> The performance of the relative JMH improves up to 27% on a X86 AVX512 
>> system. Please see the data below:
>> 
>> Benchmark                                                 Mode   Cnt Unit    
>> SIZE    Before      After    Gain
>> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
>> 64    53682.012   52650.325  0.98
>> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
>> 256   14484.252   14255.156  0.98
>> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
>> 1024   3664.900    3595.615  0.98
>> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
>> 4096    908.31...
>
> src/hotspot/share/opto/vectorIntrinsics.cpp line 1203:
> 
>> 1201: //                 Class<? extends Vector<Integer>> vectorIndexClass, 
>> int indexLength,
>> 1202: //                 Object base, long offset,
>> 1203: //                 W indexVector1, W index_vector2, W index_vector3, W 
>> index_vector4,
> 
> Are we doing a mix of `CamelCase` and `with_underscore` on purpose?

This would be a naming style issue. Thanks for pointing out. I will fix it once 
I update this PR. It should be aligned with definition in `VectorSupport.java`.

> src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java line 
> 495:
> 
>> 493:                   Class<? extends Vector<Integer>> vectorIndexClass,
>> 494:                   int indexLength, Object base, long offset,
>> 495:                   W indexVector1, W indexVector2, W indexVector3, W 
>> indexVector4,
> 
> Here  you went for all `camelCase`, just an observation :)

Yeah, I choose to use `camelCase` style here to align with other parameter 
naming styles in this method.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/25138#discussion_r2106693179
PR Review Comment: https://git.openjdk.org/jdk/pull/25138#discussion_r2106692107

Reply via email to