On 3/28/2025 8:28 PM, Alex Deucher wrote:
On Thu, Mar 27, 2025 at 9:50 AM Christian König
<[email protected]> wrote:
Am 27.03.25 um 10:37 schrieb SRINIVASAN SHANMUGAM:
On 3/27/2025 2:54 PM, Christian König wrote:
Over all this change doesn't seem to make much sense to me.
Why exactly is isolation->spearhead not pointing to the dummy kernel job we
submit?
Does the owner check or gang_submit check in
amdgpu_device_enforce_isolation() fail to set up the spearhead?
I'm currently debugging exactly that.
Good news is that I can reproduce the problem.
I have to take that back. I've tested the cleaner shader functionality a bit
this morning and as far as I can see this works exactly as intended.
Srini, what exactly is your use case which doesn't work?
Hi Christian, Good Morning!
The usecase is to trigger the cleaner shader, using sysfs "run_cleaner_shader"
independent of enabling "enforce_isolation", so that cleaner shader packet gets
submitted to COMP_1.0.0 ring by default, without prior enabling any enforce_isolation via sysfs,
I've tested exactly that and it seems to work perfectly fine:
kworker/u96:1-209 [020] ..... 86.655999: amdgpu_isolation:
prev=0000000000000000, next=ffffffffffffffff
kworker/u96:1-209 [020] ..... 86.656190: amdgpu_cleaner_shader:
ring=gfx_0.0.0, seqno=2
<...>-11 [022] ..... 150.607688: amdgpu_isolation:
prev=ffffffffffffffff, next=0000000000000000
kworker/u96:0-11 [022] ..... 150.608228: amdgpu_cleaner_shader:
ring=comp_1.0.0, seqno=2
kworker/u96:0-11 [022] ..... 150.620597: amdgpu_isolation:
prev=0000000000000000, next=ffffffffffffffff
kworker/u96:0-11 [022] ..... 150.620624: amdgpu_cleaner_shader:
ring=gfx_0.0.0, seqno=1527
The only thing which might be confusing is that when you issue the cleaner
shader multiple times when the GPU is idle it would only run once.
But that should be easy to change if necessary.
The problem is that it doesn't take into account KFD jobs. We need to
be able to run the cleaner shader even if there have been no KGD jobs,
Alex
Thank a lot for the awareness Alex!
Yeah I think since "run_cleaner_shader" sysfs entry is not associated
with any owners as it comes from kernel empty job, [Typically I used to
run "run_cleaner_shader" via sysfs (with old enforce_isolation code) in
the terminal mode, and expect cleaner shader to be triggered, and was
expecting same even with this new enforce_isolation code, prior running
a like app or for ex: IGT_COMP], So currently with this new code, it
looks like it works this way -> only if previously if any app (for ex:
IGT_COMP ran once) has submitted any jobs ie., it first checks for any
owners (IGT_COMP), if they had submitted jobs, (and in addition to this,
we don't run any "enforce_isolation" via sysfs, before running IGT_COMP
app), and now if we run the "run_cleaner_shader" sysfs entry, now it
submits the cleaner shader packet,
root@rtg-navi32:/home/rtg# ./install.sh -> Install amdgpu driver
rm ....
cp ....
unloading existing amdgpu driver ...
loading amdgpu driver ...
root@rtg-navi32:/home/rtg#
root@rtg-navi32:/home/rtg# ./run_igt_test_COMPUTE.sh
IGT-Version: 1.28-ga7ef4e2ba (x86_64) (Linux:
6.12.0-amdstaging-drm-next-lol-050225 x86_64)
Using IGT_SRANDOM=1743174485 for randomisation
Opened device: /dev/dri/card0
Initialized amdgpu, driver version 3.63
amdgpu: GFX1101 (family_id, chip_external_rev): (145, 32)
amdgpu: chip_class GFX11
Starting subtest: cs-compute-with-IP-COMPUTE
Starting dynamic subtest: cs-compute
Dynamic subtest cs-compute: SUCCESS (0.131s)
Subtest cs-compute-with-IP-COMPUTE: SUCCESS (0.131s)
root@rtg-navi32:/home/rtg#
root@rtg-navi32:/home/rtg#
root@rtg-navi32:/home/rtg#
root@rtg-navi32:/home/rtg# ./run_cleaner_shader.sh
root@rtg-navi32:/home/rtg#
Dmesg:
~$ sudo dmesg -C && sudo dmesg -w
[46759.723734] Console: switching to colour dummy device 80x25
[46759.858134] amdgpu 0000:0b:00.0: amdgpu: amdgpu: finishing device.
[46760.059772] [drm] amdgpu: ttm finalized
[46763.899566] ACPI: bus type drm_connector unregistered
[46764.223941] ACPI: bus type drm_connector registered
[46766.048868] Setting dangerous option gpu_recovery - tainting kernel
[46766.048876] Setting dangerous option no_queue_eviction_on_vm_fault -
tainting kernel
[46766.048880] Setting dangerous option halt_if_hws_hang - tainting kernel
[46766.132452] [drm] amdgpu kernel modesetting enabled.
[46766.160768] amdgpu: Virtual CRAT table created for CPU
[46766.161561] amdgpu: Topology: Add CPU node
[46766.162282] [drm] initializing kernel modesetting (IP DISCOVERY
0x1002:0x747E 0x1002:0x0E37 0xFF).
[46766.162322] [drm] register mmio base: 0xFCC00000
[46766.162325] [drm] register mmio size: 1048576
[46766.172772] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 0
<soc21_common>
[46766.172778] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 1
<gmc_v11_0>
[46766.172782] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 2
<ih_v6_0>
[46766.172786] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 3 <psp>
[46766.172789] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 4 <smu>
[46766.172793] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 5 <dm>
[46766.172796] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 6
<gfx_v11_0>
[46766.172800] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 7
<sdma_v6_0>
[46766.172803] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 8
<vcn_v4_0>
[46766.172807] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 9
<jpeg_v4_0>
[46766.172810] amdgpu 0000:0b:00.0: amdgpu: detected ip block number 10
<mes_v11_0>
[46766.186911] amdgpu 0000:0b:00.0: No more image in the PCI ROM
[46766.186937] amdgpu 0000:0b:00.0: amdgpu: Fetched VBIOS from ROM BAR
[46766.186944] amdgpu: ATOM BIOS: 113-D7120601-4
[46766.188597] amdgpu 0000:0b:00.0: amdgpu: CP RS64 enable
[46766.190411] amdgpu 0000:0b:00.0: vgaarb: deactivate vga console
[46766.190417] amdgpu 0000:0b:00.0: amdgpu: Trusted Memory Zone (TMZ)
feature not supported
[46766.190433] amdgpu 0000:0b:00.0: amdgpu: MODE1 reset
[46766.190437] amdgpu 0000:0b:00.0: amdgpu: GPU mode1 reset
[46766.190611] amdgpu 0000:0b:00.0: amdgpu: GPU smu mode1 reset
[46766.711756] amdgpu 0000:0b:00.0: amdgpu: MEM ECC is not presented.
[46766.711764] amdgpu 0000:0b:00.0: amdgpu: SRAM ECC is not presented.
[46766.711805] amdgpu 0000:0b:00.0: amdgpu: DF poison setting is
inconsistent(1:0:0:0)!
[46766.711811] amdgpu 0000:0b:00.0: amdgpu: Poison setting is
inconsistent in DF/UMC(0:1)!
[46766.711832] [drm] vm size is 262144 GB, 4 levels, block size is
9-bit, fragment size is 9-bit
[46766.711846] amdgpu 0000:0b:00.0: amdgpu: VRAM: 12272M
0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
[46766.711852] amdgpu 0000:0b:00.0: amdgpu: GART: 512M
0x00007FFF00000000 - 0x00007FFF1FFFFFFF
[46766.711868] [drm] Detected VRAM RAM=12272M, BAR=256M
[46766.711872] [drm] RAM width 192bits GDDR6
[46766.713643] [drm] amdgpu: 12272M of VRAM memory ready
[46766.713648] [drm] amdgpu: 7915M of GTT memory ready.
[46766.713770] [drm] GART: num cpu pages 131072, num gpu pages 131072
[46766.714036] [drm] PCIE GART of 512M enabled (table at
0x0000008000000000).
[46766.716878] [drm] Loading DMUB firmware via PSP: version=0x07002D00
[46766.716905] KGD Cleaner shader +++++++++++ Enabled cleaner shader in
gfx_v11_0_3
[46766.716908] KGD Cleaner shader +++++++++++ Initializing cleaner
shader software in gfx_v11_0_3
[46766.716918] KGD Cleaner shader +++++++++++ Cleaner shader size: 240
[46766.717280] amdgpu 0000:0b:00.0: amdgpu: GPU recovery disabled.
[46766.717284] amdgpu 0000:0b:00.0: amdgpu: GPU recovery disabled.
[46766.717376] KGD Cleaner shader +++++++++++ Exiting gfx_v11_0_sw_init
[46766.717447] amdgpu 0000:0b:00.0: amdgpu: GPU recovery disabled.
[46766.717454] amdgpu 0000:0b:00.0: amdgpu: Found VCN firmware Version
ENC: 1.23 DEC: 9 VEP: 0 Revision: 15
[46766.717591] amdgpu 0000:0b:00.0: amdgpu: Found VCN firmware Version
ENC: 1.23 DEC: 9 VEP: 0 Revision: 15
[46766.717723] amdgpu 0000:0b:00.0: amdgpu: GPU recovery disabled.
[46766.794246] amdgpu 0000:0b:00.0: amdgpu: reserve 0xa700000 from
0x82e0000000 for PSP TMR
[46767.038502] amdgpu 0000:0b:00.0: amdgpu: RAP: optional rap ta ucode
is not available
[46767.038512] amdgpu 0000:0b:00.0: amdgpu: SECUREDISPLAY: securedisplay
ta ucode is not available
[46767.038593] amdgpu 0000:0b:00.0: amdgpu: smu driver if version =
0x0000003d, smu fw if version = 0x00000040, smu fw program = 0, smu fw
version = 0x00505300 (80.83.0)
[46767.038598] amdgpu 0000:0b:00.0: amdgpu: SMU driver if version not
matched
[46767.148050] amdgpu 0000:0b:00.0: amdgpu: SMU is initialized successfully!
[46767.148935] [drm] Display Core v3.2.326 initialized on DCN 3.2
[46767.148941] [drm] DP-HDMI FRL PCON supported
[46767.151156] [drm] DMUB hardware initialized: version=0x07002D00
[46767.159473] snd_hda_intel 0000:0b:00.1: bound 0000:0b:00.0 (ops
amdgpu_dm_audio_component_bind_ops [amdgpu])
[46767.193422] amdgpu 0000:0b:00.0: amdgpu: Setup CP MES MSCRATCH
address : 0x80. 0x184000
[46767.195715] KGD Cleaner shader +++++++++++ Entering
gfx11_kiq_set_resources for ring: 0000000079456a04
[46767.195727] KGD Cleaner shader +++++++++++ Cleaner shader MC address
in gfx11_kiq_set_resources: 80001030 queue_mask ffffffffffffffff
[46767.195732] KGD Cleaner shader +++++++++++ Exiting
gfx11_kiq_set_resources for ring: 0000000079456a04
[46767.286986] amdgpu: HMM registered 12272MB device memory
[46767.289999] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[46767.290025] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[46767.290386] amdgpu: Virtual CRAT table created for GPU
[46767.291758] amdgpu: Topology: Add dGPU node [0x747e:0x1002]
[46767.291765] kfd kfd: amdgpu: added device 1002:747e
[46767.291791] amdgpu 0000:0b:00.0: amdgpu: SE 3, SH per SE 2, CU per SH
10, active_cu_number 54
[46767.291824] amdgpu 0000:0b:00.0: amdgpu: ring gfx_0.0.0 uses VM inv
eng 0 on hub 0
[46767.291829] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.0.0 uses VM inv
eng 1 on hub 0
[46767.291832] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.1.0 uses VM inv
eng 4 on hub 0
[46767.291836] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.2.0 uses VM inv
eng 6 on hub 0
[46767.291839] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.3.0 uses VM inv
eng 7 on hub 0
[46767.291842] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.0.1 uses VM inv
eng 8 on hub 0
[46767.291845] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.1.1 uses VM inv
eng 9 on hub 0
[46767.291848] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.2.1 uses VM inv
eng 10 on hub 0
[46767.291852] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.3.1 uses VM inv
eng 11 on hub 0
[46767.291855] amdgpu 0000:0b:00.0: amdgpu: ring sdma0 uses VM inv eng
12 on hub 0
[46767.291858] amdgpu 0000:0b:00.0: amdgpu: ring sdma1 uses VM inv eng
13 on hub 0
[46767.291861] amdgpu 0000:0b:00.0: amdgpu: ring vcn_unified_0 uses VM
inv eng 0 on hub 8
[46767.291864] amdgpu 0000:0b:00.0: amdgpu: ring vcn_unified_1 uses VM
inv eng 1 on hub 8
[46767.291868] amdgpu 0000:0b:00.0: amdgpu: ring jpeg_dec uses VM inv
eng 4 on hub 8
[46767.291871] amdgpu 0000:0b:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM
inv eng 14 on hub 0
[46767.293485] amdgpu 0000:0b:00.0: amdgpu: Using BAMACO for runtime pm
[46767.300079] [drm] Initialized amdgpu 3.63.0 for 0000:0b:00.0 on minor 0
[46767.300917] KGD Gangsubmit Enforce Isolation +++++++++++: Checking
cleaner shader conditions in amdgpu_vm_flush() before emitting cleaner
shader packet:
[46767.300925] - Enable cleaner shader: 1
[46767.300927] - Emit cleaner shader: 0000000000000000
[46767.300931] - Job base s_fence is not NULL: 0
[46767.300934] - Job base s_fence is NULL
[46767.300936] - Isolation spearhead: 00000000712ed22d
[46767.300939] - Fence is scheduled == isolation spearhead: 0
[46767.300942] KGD Gangsubmit Enforce Isolation +++++++++++: Cleaner
shader needed: 0
[46767.300945] AMDGPU VM Flush: No operations needed, exiting
[46767.300955] KGD Gangsubmit Enforce Isolation +++++++++++: Checking
cleaner shader conditions in amdgpu_vm_flush() before emitting cleaner
shader packet:
[46767.300958] - Enable cleaner shader: 1
[46767.300961] - Emit cleaner shader: 0000000000000000
[46767.300963] - Job base s_fence is not NULL: 0
[46767.300966] - Job base s_fence is NULL
[46767.300969] - Isolation spearhead: 00000000712ed22d
[46767.300972] - Fence is scheduled == isolation spearhead: 0
...
[46781.441652] KGD Gangsubmit Enforce Isolation +++++++++++: Checking
cleaner shader conditions in amdgpu_vm_flush() before emitting cleaner
shader packet:
[46781.441657] - Enable cleaner shader: 1
[46781.441661] - Emit cleaner shader: 00000000b46f6457
[46781.441665] - Job base s_fence is not NULL: 1
[46781.441669] - Job base s_fence address: 00000000246a799b
[46781.441673] - Job base s_fence scheduled address: 00000000246a799b
[46781.441677] - Isolation spearhead: 00000000a1dbd218
[46781.441681] - Fence is scheduled == isolation spearhead: 0
[46781.441685] KGD Gangsubmit Enforce Isolation +++++++++++: Cleaner
shader needed: 0
[46781.441774] KGD Gangsubmit Enforce Isolation +++++++++++: Checking
cleaner shader conditions in amdgpu_vm_flush() before emitting cleaner
shader packet:
[46781.441779] - Enable cleaner shader: 1
[46781.441783] - Emit cleaner shader: 0000000000000000
[46781.441787] - Job base s_fence is not NULL: 1
[46781.441791] - Job base s_fence address: 0000000096d4591a
[46781.441795] - Job base s_fence scheduled address: 0000000096d4591a
[46781.441799] - Isolation spearhead: 00000000a1dbd218
[46781.441803] - Fence is scheduled == isolation spearhead: 0
[46781.441808] KGD Gangsubmit Enforce Isolation +++++++++++: Cleaner
shader needed: 0
[46781.441812] AMDGPU VM Flush: No operations needed, exiting
[46781.441921] [IGT] amd_basic: finished subtest cs-compute, SUCCESS
[46781.442094] [IGT] amd_basic: finished subtest
cs-compute-with-IP-COMPUTE, SUCCESS
[46781.457577] [IGT] amd_basic: exiting, ret=0
[46781.474178] Console: switching to colour frame buffer device 160x64
**root@rtg-navi32:/home/rtg# ./run_cleaner_shader.sh**
[46791.806206] KGD Gangsubmit Enforce Isolation +++++++++++: Checking
cleaner shader conditions in amdgpu_vm_flush() before emitting cleaner
shader packet:
[46791.806215] - Enable cleaner shader: 1
[46791.806219] - Emit cleaner shader: 00000000b46f6457
[46791.806224] - Job base s_fence is not NULL: 1
[46791.806228] - Job base s_fence address: 00000000d251a46d
[46791.806232] - Job base s_fence scheduled address: 00000000d251a46d
[46791.806236] - Isolation spearhead: 00000000d251a46d
[46791.806240] - Fence is scheduled == isolation spearhead: 1
[46791.806244] KGD Gangsubmit Enforce Isolation +++++++++++: Cleaner
shader needed: 1
[46791.806249] amdgpu 0000:0b:00.0: amdgpu: KGD Cleaner shader
+++++++++++: Emitting cleaner shader in amdgpu_vm_flush() for ring:
comp_1.0.0
[46791.806255] KGD Cleaner shader +++++++++++ Entering
gfx_v11_0_ring_emit_cleaner_shader for ring: comp_1.0.0
[46791.806260] KGD Cleaner shader +++++++++++ SENDING OUT CLEANER_SHADER
PACKET3_RUN_CLEANER_SHADER onto ring: comp_1.0.0, pipe id: 0, queue id:
0 +++++++++++++++++++++
[46791.806264] KGD Cleaner shader +++++++++++ Cleaner shader completed
on ring: comp_1.0.0 in 0 ms
[46791.806269] KGD Cleaner shader +++++++++++ Exiting
gfx_v11_0_ring_emit_cleaner_shader for ring: comp_1.0.0
[46791.806274] KGD Gangsubmit Enforce Isolation +++++++++++: Executing
cleaner shader for job 0000000051e25f9f on ring comp_1.0.0
I think some how we need to takecare of handling enforce_isolation for
kernel empty jobs going to COMP_1.0.0, before any live app or for ex:
"IGT_COMP" ie., before running any real application ie., something like
"isolation->owner != owner" in this path amdgpu_gfx_run_cleaner_shader
-> amdgpu_gfx_run_cleaner_shader_job to be this fence addresses to be
equal "&job->base.s_fence->scheduled == isolation->spearhead;"
Best regards,
Srini
Regards,
Christian.
AFAIK, this "isolation->spearhead" initialization is not being takencare in this path "amdgpu_gfx_run_cleaner_shader -> amdgpu_gfx_run_cleaner_shader_job" (ie., when we
trigger cleaner shader, using sysfs "run_cleaner_shader"), and this check "&job->base.s_fence->scheduled == isolation->spearhead;" is having the problem ie.,
"&job->base.s_fence->scheduled" address are is not matching with "isolation->spearhead" address, which results into zero & thus fails to emit cleaner shader,
when running using "run_cleaner_shader" sysfs entry, in "amdgpu_vm_flush()" function
Best regards,
Srini
Regards,
Christian.
Regards,
Christian.