On Mon, 23 Mar 2026 18:43:41 GMT, Andrew Haley <[email protected]> wrote:

>> Please use [this 
>> link](https://github.com/openjdk/jdk/pull/28541/changes?w=1) to view the 
>> files changed.
>> 
>> Profile counters scale very badly.
>> 
>> The overhead for profiled code isn't too bad with one thread, but as the 
>> thread count increases, things go wrong very quickly.
>> 
>> For example, here's a benchmark from the OpenJDK test suite, run at 
>> TieredLevel 3 with one thread, then three threads:
>> 
>> 
>> Benchmark (randomized) Mode Cnt Score Error Units
>> InterfaceCalls.test2ndInt5Types false avgt 4 27.468 ± 2.631 ns/op
>> InterfaceCalls.test2ndInt5Types false avgt 4 240.010 ± 6.329 ns/op
>> 
>> 
>> This slowdown is caused by high memory contention on the profile counters. 
>> Not only is this slow, but it can also lose profile counts.
>> 
>> This patch is for C1 only. It'd be easy to randomize C1 counters as well in 
>> another PR, if anyone thinks it's worth doing.
>> 
>> One other thing to note is that randomized profile counters degrade very 
>> badly with small decimation ratios. For example, using a ratio of 2 with 
>> `-XX:ProfileCaptureRatio=2` with a single thread results in
>> 
>> 
>> Benchmark                        (randomized)  Mode  Cnt   Score   Error  
>> Units
>> InterfaceCalls.test2ndInt5Types         false  avgt    4  80.147 ± 9.991  
>> ns/op
>> 
>> 
>> The problem is that the branch prediction rate drops away very badly, 
>> leading to many mispredictions. It only really makes sense to use higher 
>> decimation ratios, e.g. 64.
>
> Andrew Haley has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fix up any out-of-range offsets

Had some cycles to look at it, while the contention with other work is not high 
on Monday morning. (Clunky pun intended.) Impressive work!

make/Hsdis.gmk line 47:

> 45:     CAPSTONE_ARCH := CS_ARCH_$(CAPSTONE_ARCH_AARCH64_NAME)
> 46:     CAPSTONE_MODE := CS_MODE_ARM
> 47:   else ifeq ($(call isTargetCpuArch, arm), true)

Looks not relevant for this PR, upstream it separately?

src/hotspot/cpu/aarch64/c1_FrameMap_aarch64.cpp line 205:

> 203:   if (ProfileCaptureRatio > 1) {
> 204:     // Use the highest remaining register for r_profile_rng.
> 205:     r_profile_rng = *remaining.rbegin();

OK, so we reserve `r26` or `r27` for RNG counter, right?

Is this a good trade for level=1 C1 code that users run with 
`-XX:TieredStopAtLevel=1` for that sake of startup performance? Why can't / 
shouldn't we use the PRNG state straight in `JavaThread`? That would also 
obviate the need to save/restore this register, thus simplifying the machinery 
and avoiding subtle bugs.

It is even worse on x86 that does not have too many registers to begin with. I 
wonder if there is a way to sense on this level if we are compiling for 
tier=2,3 or tier=1, and only reserve on tier=2,3?

If we going to reserve more registers, maybe start writing up release note with 
possible caveats.

src/hotspot/cpu/aarch64/c1_FrameMap_aarch64.hpp line 150:

> 148:     }
> 149: 
> 150:     // Use r26 for randomized profile captures.

...or `r27`?

src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 1235:

> 1233:   assert(masm()->is_C1_MacroAssembler(), "must be");
> 1234: 
> 1235:   Label nope;

Style: `L_nope`.

src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 1271:

> 1269:   int profile_capture_ratio = ProfileCaptureRatio;
> 1270:   int ratio_shift = exact_log2(profile_capture_ratio);
> 1271:   auto threshold = (1ull << 32) >> ratio_shift;

Is this just `auto threshold = 1 << (32 - ratio_shift)`?

src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 1316:

> 1314:       = __ form_address(rscratch2, mdo,
> 1315:                         md->byte_offset_of_slot(data, 
> DataLayout::flags_offset()),
> 1316:                         LogBytesPerWord);

Oh, OK, cute. So this encodes the address more efficiently, knowing the address 
is in the units of heap words? Can be upstreamed separately then?

src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 1987:

> 1985: 
> 1986: void LIR_Assembler::align_call(LIR_Code code) {
> 1987:   __ save_profile_rng();

Looks misplaced. Do you want to do it in `LIR_Assembler::call`, or something 
gets in the way?

src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 2563:

> 2561:       __ adjust_mdo_address(&counter_address, dest_opr->type());
> 2562:     }
> 2563:     if (step->is_register()) {

I remember looking at this before. Do we even have the cases where step is not 
a constant? Might simplify some code...

src/hotspot/cpu/aarch64/c1_MacroAssembler_aarch64.cpp line 287:

> 285: void C1_MacroAssembler::step_random(Register state, Register temp, 
> Register data) {
> 286:   if (VM_Version::supports_crc32()) {
> 287:     /* CRC used as a psuedo-random-number generator */

"pseudo"

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 2828:

> 2826:   if (value < 0)  { incrementw(reg, -value);      return; }
> 2827:   if (value == 0) {                               return; }
> 2828:   if (value < (1 << 24)) { subw(reg, reg, value); return; }

Oh, another nice micro-optimization, I see? More stuff to upstream separately?

src/hotspot/cpu/arm/c1_LIRAssembler_arm.cpp line 2527:

> 2525:           break;
> 2526:         }
> 2527:         // On 32-bit platforms, 64-bit profile counters are never used

Assert this and remove the commented code.

src/hotspot/cpu/arm/c1_LIRGenerator_arm.cpp line 778:

> 776:       // Call out to runtime because we don't have enough registers to
> 777:       // expand compareAndSet(long) inline.
> 778:       arm_cas_long(addr, cmp_value, new_value, result);

Um. Is this still true at the end? How does this manifest? C1 regalloc cannot 
give you enough registers when you are emitting the profiling counter update, 
or?

src/hotspot/cpu/arm/c1_LIRGenerator_arm.cpp line 1307:

> 1305:   __ cmp(lir_cond(cond), left, right);
> 1306:   profile_branch(x, cond);
> 1307:   // If we're subsampling counter updates, then profiling code kills 
> flags

And are we sure on AArch64 it does not? I don't see a similar change there. 

I see at least this in `LIR_Assembler::increment_profile_ctr` on AArch64:


        if (!step->is_constant()) {
          // If step is 0, make sure the stub check below always fails
          __ cmp(step->as_register(), (u1)0);
          __ mov(rscratch1, InvocationCounter::count_increment * 
ProfileCaptureRatio);
          __ csel(dest, dest, rscratch1, __ NE);
        }


Also ugh, doing two tests just to restore the flags. But that one seems 
unavoidable.

src/hotspot/cpu/arm/c1_MacroAssembler_arm.cpp line 256:

> 254: 
> 255: void C1_MacroAssembler::step_random(Register state, Register temp, 
> Register data) {
> 256:   // if (VM_Version::supports_crc32()) {

Can be just deleted, IMO.

src/hotspot/cpu/x86/c1_LIRAssembler_x86.cpp line 1285:

> 1283:     auto threshold = (1ull << 32) >> ratio_shift;
> 1284:     __ cmpl(r_profile_rng, threshold);
> 1285:     __ jcc(Assembler::aboveEqual, nope);

`jccb`

src/hotspot/share/c1/c1_Compilation.cpp line 366:

> 364: }
> 365: 
> 366: THREAD_LOCAL const char *compilation_name;

Debugging leftover?

src/hotspot/share/c1/c1_LIR.cpp line 1082:

> 1080:     (_step, _result, _freq_op,
> 1081:      _md_reg, _md_op, _md_offset_op, _overflow_stub);
> 1082:   if (overflow_stub()) {

`overflow_stub() != nullptr`

src/hotspot/share/compiler/compiler_globals.hpp line 391:

> 389:           "information at the bailout point")                            
>    \
> 390:                                                                          
>    \
> 391:   product(int, ProfileCaptureRatio, 1, EXPERIMENTAL,                     
>    \

Put it at `64`, if you believe this is a right tradeoff.

src/hotspot/share/runtime/javaThread.cpp line 546:

> 544:     do {
> 545:       state = os::random();
> 546:     } while (state == 0);

Curious: why should not the initial state be zero?

-------------

PR Review: https://git.openjdk.org/jdk/pull/28541#pullrequestreview-4030481858
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3009725177
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3009990535
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3009993506
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010000039
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010014074
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010038774
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010051720
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010062967
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010088991
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010116717
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010122301
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010159281
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010194360
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010200886
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010219672
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010251921
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010260753
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010283586
PR Review Comment: https://git.openjdk.org/jdk/pull/28541#discussion_r3010294250

Reply via email to