Table of Contents:
1. Introduction
2. History of WhateverRed
2.1. Wrapping/Redirecting kext logic with Lilu
2.2. VTables and our Reverse Engineering
2.3. Debugging with a black screen
2.4. Firmware injection and other HWLibs troubles
2.5. AMDRadeonX5000 Video Decoding/Encoding and SDMA engine mismatches
2.6. SDMA0 power on via SMC
2.7. SDMA0 Accel channel skipping memory mapping commands
3. Current issue
3.1. VM Protection Faults
3.2. Analysis of the diagnostic dump
3.3. A deeper dive into the protection fault
4. What we know so far
4.1. The VM Blocks and the PDEs/PTEs
4.2. The VM registers
4.3. The PDE/PTE flags
4.4. The translate_further mode
4.5. The VMPTConfig in AMD kexts
4.6. How the entryCount is determined on AMDGPU
4.7. The GPUVM settings on AMDRadeonX5000 vs. AMDGPU
5. What we have tried
5.1. PTE/PDE flags experimentations
5.2. Experimentation with VMPTConfig and related settings
6. How you can help
6.1. Unanswered questions
6.2. Ways to contact us
-- 1. Introduction --
Hello everyone.
We are a small team of 3 people trying to get Hackintoshes (PCs running macOS)
with AMD (Vega) iGPUs (specifically Raven/Raven2/Renoir and their derivatives,
such as Picasso) to have graphics acceleration on AMD laptops.
To be precise, we are fixing broken and/or missing logic via patching the
existing kexts (currently AMDRadeonX5000 for GCN 5 (GFX 9) and AMDRadeonX6000
for VCN (GFX 10), AMDRadeonX6000Framebuffer for DCN instead of
AMD10000Controller since it is DCE).
The team members are:
- Visual, the Project Owner, is a Greek 17 year old CS student with extensive
knowledge on Operating System development. He writes most of the kext code and
provides insight on OS and Driver behaviour when possible.
- NyanCatTW1, the Automation Engineer, is a 17-year-old student who lives in
Taiwan. The NYCU CSIE admitted him last year. He also does most of the Reverse
Engineering.
- Allen Chen, the tester with a Renoir laptop, perseverance and some ideas;
helps with the effort occasionally, currently striving to become NyanCatTW1's
classmate again, as they were six years ago
Our kext, WhateverRed has successfully gotten the aforesaid kexts to
deterministically power up and start the IPs/MEs in the GPU, such as GFX and
SDMA. Attached are partial highlights of a dmesg log from the main testing
system:
[ 27.351538]: netdbg: Disabled via boot arg
[ 27.351543]: rad: patching device type table
[ 27.351558]: rad: Automagically getting VBIOS from VFCT table
...
[ 27.505319]: [3:0:0] [Accel] >>> Calling TTL::initialize()
[ 27.505331]: [AMD INFO] TTL Interface: Boot mode Normal.
...
[ 27.649777]: [3:0:0] [Accel] <<< TTL::initialize() Completed
successfully.
...
[ 27.662027]: Accelerator successfully registered with controller.
...
[ 29.346963]: rad: _SmuRaven_Initialize returned 0x1
[ 29.346967]: rad: Sending PPSMC_MSG_PowerUpSdma (0xE) to the SMC
[ 29.347052]: rad: _Raven_SendMsgToSmcWithParameter returned 0x1
...
[ 29.365343]: rad: powerUpHW: this = 0xffffff935ca3d000
[ 29.377219]: rad: powerUpHW returned 1
[ 29.377228]: [3:0:0]: Controller is enabled, finish initialization
[ 29.424252]: Adding AGDP mode validate property
[ 29.425160]: kPEDisableScreen 1
[ 29.425685]: [3:0:0] [FB:0] AmdRadeonFramebuffer::setCursorImage() !!!
Driver is offline.
[ 29.425695]: [3:0:0] [FB:1] AmdRadeonFramebuffer::setCursorImage() !!!
Driver is offline.
The project is hosted on GitHub (https://github.com/NootInc/WhateverRed) with
135 stargazers as of 2023-02-04.
Currently, everything seems to go smoothly up to the point WindowServer tries
-and fails- to make use of the iGPU (See Chapter 3 for details)
We first ran into the issue on 2022-11-27, but as of 2023-02-04, we haven't
been able to find a way to fix it.
This is why we're asking for help on the amd-gfx mailing list. However,
considering the complexity of both the project and the issue, we suspect it
would be necessary to give you a brief review of the project's history, the
issue we currently are facing, everything we know about the issue, and what we
have tried first.
It'll be a long ride (about 25 minutes) so feel free to skip right to Chapter 6
if you don't have the time.
-- 2. History of WhateverRed --
For your interest, we have documented a large portion of our previous work
here. But feel free to skip to the problem itself (Chapter 3) in case that's
more practical for you.
-- 2.1 Wrapping/Redirecting kext logic with Lilu --
First of all, it is quite probable that you are wondering how we are even
debugging these kexts, even modifying them; the answer is Lilu. Lilu allows you
to hook symbols and replace them with your own logic, and also save the
original to a different place. This is done possible by looking for the symbol,
saving the original logic, and replacing the instructions in the original
location with a trampoline to our hook. Here is an example log:
[ 36.334082]: rad: hwWriteReg: this = 0xffffffa02c933000 regIndex =
0x2881 regVal = 0x3B
When hooks aren't sufficient, we can apply find/replace patches, called lookup
patches. While loopup patches allow us to modify the binary code directly, we
avoid creating them whenever possible since 1) binary may change slightly and
2) it is of higher complexity, as it is written in raw machine code.
-- 2.2. VTables and our Reverse Engineering --
Some of you probably know that the kexts are written in C++, and heavily
utilise virtual methods. A rough estimate brings us to at least thousands, if
not tens of thousands, of virtual methods. Therefore, to be able to understand
the logic of the classes, we must reconstruct the VTables and add them as the
first field of the respective class. Some guides explain how to construct
VTables in Ghidra by hand, such as http://hwreblog.com/projects/ghidra.html.
NyanCatTW1 wrote a script that automatically creates structs out of VTables in
program memory and attaches them to the respective class' structs
(https://github.com/NyanCatTW1/RedMetaClassAnalyzer/blob/main/RedMetaClassAnalyzer.py).
This was done because there are at least hundreds of these structures,
therefore, it is illogical to attempt reconstructing them by hand. This also
allows us to apply VTables discovered from one kext to another, using the
VTable database of the script, which is populated during VTable analysis. This
has been useful, for instance, when we were trying to decipher usages of
AMDRadeonX5000_AMDRadeonHWLibs inside other kexts, such as AMDRadeonX5000 kext.
We also found the VTable database handy when trying to infer a class' type from
a virtual call's offset (findEntryAtIndex.py), and when trying to discover
VTable offset differences between AMDRadeonX5000_AMDHardware and
AMDRadeonX6000_AMDHardware (VTableFindDiff.py).
-- 2.3. Debugging with a black screen --
MacOS has a verbose boot mode, however, the screen goes black after the
following messages:
[ 40.450667]: Accelerator successfully registered with controller.
[ 40.717950]: IOConsoleUsers: gIOScreenLockState 3, hs 0, bs 0, now 0, sm
0x0
[ 40.950416]: kPEDisableScreen 1
To continue receiving information after the screen goes black, NetDbg was born.
The machine sends debug messages to a remote server via a TCP socket -as long
as a connection to it is available-.
This was useful enough on its own, but not enough to collect kernel panics. We
attempted to redirect dmesg messages and panic messages to NetDbg; sadly, due
to implementation differences of backtrace dumps between versions of XNU, and
weird SSE exceptions occurring inside the logic of wraps of some kernel
symbols, we eventually abandoned the plan. Nevertheless, we managed to improve
NetDbg's stability in the process; we reused the same socket throughout and
improved the error handling. The NetDbg server backend was written in Python
and was eventually rewritten by Visual in Rust.
NetDbg lasted us for about four months, after which the driver ceased to cause
(instant) kernel panics, which allowed us to SSH into the device and collect
the full dmesg.
-- 2.4. Firmware injection and other HWLibs troubles --
AMDRadeonX5000HWLibs requires a few PSP firmware to load to finish the
initialisation. It doesn't support PSP v10 and v12 natively, but fortunately,
we discovered that they are very similar to PSP v11. We spoofed its version by
wrapping _psp_sw_init and modifying the version inside param1 before calling it.
All seemed fine and dandy so far. However, the SMU also needed spoofing.
We attempted spoofing the version from SMU v10 to SMU v9.0.1. Still, it was not
working. This was causing us trouble for a bit. Fortunately, after reading the
AMDGPU code for a while, we found that the System BIOS is the one that loads
the SMC firmware on APUs. So, we patched _smu_get_fw_constants and
_smu_9_0_1_internal_hw_init to do nothing but return success.
Afterwards, we injected up-to-date and correct firmware for the PSP, such as
ASD, DTM, and HDCP. However, the methods (_psp_*_load) that load the firmware
do not take a pointer and size to the firmware; the firmware is hardcoded in
the logic itself. So, we looked at the assembly and figured that we can add a
few arguments to the original method calls, along with some binary patches,
which swap the hard-coded values out with the values in the registers, to
replace the fixed pointers and sizes with our selection of firmware for each
type of ASIC.
-- 2.5. AMDRadeonX5000 Video Decoding/Encoding and SDMA engine mismatches --
Each GFX version kext in macOS has different engines. AMDRadeonX5000, which
supports GFX 9 and Vega 10 (which is GCN 5), was the closest one to our
hardware.
Everything matches, except for two things: The Video Decoding/Encoding engines,
and the SDMA engine population in the ASIC.
Let's start with the former; the GCN 5 kexts in macOS utilise VCE and UVD. But,
this is incorrect for our hardware, since, as you probably know,
Raven/Raven2/Renoir use VCN. Luckily for us, the next revision, RDNA 1.0, aka
AMDRadeonX6000, has the VCN engine in it. Nonetheless, everything else
mismatches. So, we somehow have to fool macOS to load the X6000 kext, but do
nothing and detach, while remaining loaded in memory for us to swap the VCE/UVD
engines with the VCN engine from it.
To do so, we added AMDRadeonX6000 to Info.plist, which tells the system what to
load, in what order, and made it load before AMDRadeonX5000, by increasing its
probe score from 0 to 1. Afterwards, we ensured that it doesn't attach by
making the start function always return false. This is, as far as we can tell,
sufficient to have macOS load and keep the kext in the memory, as it doesn't
seem to unload kexts automatically; however, this newer kext revision doesn't
match the VTables of the HWEngine and HWChannel class of the X5000 kext,
causing kernel panics and random unrelated methods to be called.
So, we needed to patch the VTable offsets of the calls contained therein. This
was made relatively easily by our scripts, however, was a very tiresome and
repetitive task, as we still needed to create the binary patches by hand. We
still went ahead and did it, and so far we have had no kernel panics related to
it, but we may have still missed a few methods that might cause problems in the
future.
Now for the SDMA engines. As you may know, the iGPUs only have 1 SDMA engine,
SDMA0. The kexts are not made to have only one SDMA engine; the code is
structured to have both SDMA0 and SDMA1 utilised. We began by removing the
SDMA1 Engine from the allocateHWEngines method, then with a simple binary patch
in AMDRadeonX5000_AMDHardware::startHWEngines, do only one iteration of the
HWEngine::start loop instead of two (starting only SDMA0). n,
createAccelChannels was still trying to get the SDMA1 HWChannel, which no
longer exists. So we created a wrap for the getHWChannel function, to redirect
SDMA1 queries to return SDMA0, which seems to work.
-- 2.6. SDMA0 power on via SMC --
After convincing the driver not to attempt to initialise or utilise the SDMA1
HWEngine or its respective HWChannel, we managed to get the controller to
enable itself.
[ 31.377982]: [3:0:0]: Controller is enabled, finish initialization
However, about 5 seconds after that message, we received the following error
log:
[ 36.415551]: [3:0:0]: channel 15 VMPT is hung!
(lastReadTimestamp=0x00000000) channelResetMask 0x00000000
[ 36.924786]: void IOAccelEventMachine2::restart_channel():
GPURestartBegin stampIdx=15 type=4
[ 37.325791]: [3:0:0] GPU HangState 0x00000000, HangFlags 0x00000004:
IndividualEngineHang 0, NonEngineBlockHang 0, FenceNotRetired 1, PerEngineReset
0, FullAsicReset 1
[ 37.998126]: GPU Log Version: 2
Restart Channel: 15 VMPT
---THE STATE OF THE DRIVER---
AMDRadeonX5000_AMDVega10GraphicsAccelerator state: DISABLED
PCIe Device: [3:0:0], DID=0x15d8, RID=0xdf, SSID=0x380a
TotalVideoRAMBytes: 0x0000000020000000 (536870912)
Uptime 0:00:37.326026
...
[15] Channel: VMPT (HW [05]); Priority 0; last reset at 0:00:00.000000
CompletedTS = 0x00000000, SubmittedTS = 0x00000001
Sent to HW: TS = 0x00000001 (HW TS = 0x00000001, WPTR = 0x80) at
0:00:31.372558
...
[05] HWChannel: SDMA0, Priority 2, last reset 0:00:00.000000
CompletedTS = 0x00000000, SubmittedTS = 0x00000001
PendingTS = 0x1, sent at 0:00:31.372558, AccelChannel: 15, TS = 0x1
...
The error occurred on 2022-10-08, but we only managed to fix it on 2022-11-27.
Throughout the period, we investigated the content of the command buffer,
injected the Raven version of ME/CE/PFP/MEC/MEC JT and SDMA firmware, messed
with the CAIL properties, and even set up an Arch Linux instance on the laptop
to experiment with AMDGPU's logic.
It turns out the Linux instance was what helped us solve the problem. More
precisely, we came up with the idea of messing with Linux's SDMA code until the
SDMA engine freezes, just like the error. And as an ingenious step, we started
by finding and messing with APU-specific logic.
After three days of search and experimentation, we discovered the following
code in sdma_v4_0.c:
static int sdma_v4_0_hw_init(void *handle)
{
struct amdgpu_device *adev = (struct amdgpu_device *)handle;
if (adev->flags & AMD_IS_APU)
amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_SDMA,
false);
...
}
Surprisingly, commenting the power gating line out causes the same SDMA freeze,
just like on MacOS. After digging into what the
amdgpu_dpm_set_powergating_by_smu call does, we discover that it eventually
sends PPSMC_MSG_PowerUpSdma (0xE) to the SMC. To replicate this behaviour, we
wrapped _SmuRaven_Initialize and called _Raven_SendMsgToSmcWithParameter with
the message above after the original call.
And at last, the SDMA finally responded, which led us to a different problem.
-- 2.7. SDMA0 Accel channel skipping memory mapping commands --
We removed SDMA1, and SDMA0 is responding. But for some reason, the memory
mapping commands sent by the kexts do not get scheduled in order; after a few
VMPT (Virtual Memory Page Table) commands get processed, the scheduler
immediately jumps to process the unmapped IBs (Indirect Buffers) from
WindowServer, which fails, of course. We tried a variety of things:
- Hacking the scheduler to wait
- Creating a fake SDMA1 engine, redirecting it to SDMA0
All in all, it took us months to fix it. Ultimately, the problem was a value in
a field which seemed to decide the order, two bits of it precisely. We still
don't know what that field does, but the value gets set differently when the
engine type is SDMA1. Visual had been suspicious of this field for two months.
We tried the SDMA1 value, and it fixed the problem, leading us to the current
issue: GPUVM page faults related to IBs (Indirect Buffers) sent by WindowServer.
-- 3. Current issue --
-- 3.1. VM Protection Faults --
About five seconds after the "controller enable" message, the following errors
occur:
[ 36.626349]: virtual IOReturn
IOAccelEventMachine2::waitForStamp(int32_t, stamp_t, stamp_t *): initial wait
for 1 second expired. Continue wait for 4 seconds. stamp 2 (gpu_stamp=0)
[ 36.717958]: virtual void IOAccelEventMachineFast2::checkGPUProgress() -
Signaling hardware error on channel 0..
[ 37.250283]: void
IOAccelEventMachine2::handleFinishChannelRestart(IOReturn, int32_t, uint32_t):
GPURestartDequeued stampIdx=0 type=2 fromWhere=1 waitingOnIdx=0
[ 37.250289]: [3:0:0]: channel 0 event timeout
[ 38.978678]: [3:0:0] GPU HangState 0x0000000e, HangFlags 0x00000005:
IndividualEngineHang 1, NonEngineBlockHang 0, FenceNotRetired 1, PerEngineReset
1, FullAsicReset 0
[ 39.661790]: GPU Log Version: 2
Restart Channel: 0 GFX
---THE STATE OF THE DRIVER---
AMDRadeonX5000_AMDVega10GraphicsAccelerator state: ENABLED
...
[00] Channel: GFX (HW [00]); Priority 2; last reset at 0:00:00.000000
CompletedTS = 0x00000000, SubmittedTS = 0x00000001
Sent to HW: TS = 0x00000001 (HW TS = 0x00000002, WPTR = 0x100) at
0:00:43.375063
FirstPendingCB: Process ID = 145, Name = WindowServer; SubmitContext =
Unknown (0)
GPUAddress = 0x0000000400480000, Size = 0x00000190, VMID = 1
...
[05] Channel: SDMA0 (HW [05]); Priority 2; last reset at 0:00:00.000000
CompletedTS = 0x00000000, SubmittedTS = 0x0000000f
Sent to HW: TS = 0x00000001 (HW TS = 0x0000001d, WPTR = 0xe80) at
0:00:43.375067
ScheduledTS = 0x00000005, enqueued at 0:00:37.954289
FirstPendingCB: Process ID = 145, Name = WindowServer; SubmitContext =
Unknown (0)
GPUAddress = 0x0000000400280000, Size = 0x00000007, VMID = 1
...
[00] HWChannel: GFX, Priority 2, last reset 0:00:00.000000
CompletedTS = 0x00000001, SubmittedTS = 0x00000002
PendingTS = 0x2, sent at 0:00:43.375063, AccelChannel: 0, TS = 0x1
...
[05] HWChannel: SDMA0, Priority 2, last reset 0:00:00.000000
CompletedTS = 0x0000001c, SubmittedTS = 0x00000020
PendingTS = 0x1d, sent at 0:00:43.375067, AccelChannel: 5, TS = 0x1
...
SDMA0: BUSY, MicroEngine: ACTIVE, LastCmd: 0x00000000
Q0: HALT, ReadPtr = 0x00000e10 (0x0000000000003840), WritePtr =
0x00001000 (0x0000000000004000)
IB: ENABLED, GPUAddress = 0x0000000400280000, ConsumedSize =
0x00000000, RemainSize=0x00000070
...
VM Protection Fault (GFX): YES
Page GPUAddress = 0x0000000400480000, VMID = 1
Failing Protection = VALID, READ, EXECUTE, NACK
Memory Client ID = 4
Memory Client R/W = READ
Page table: 0x000000040047f000 .. 0x0000000400480000
[060000006a931077]
[0000000069497077]
VM Protection Fault (MM): YES
Page GPUAddress = 0x0000000400280000, VMID = 1
Failing Protection = VALID, READ, NACK
Memory Client ID = 0
Memory Client R/W = READ
Page table: 0x000000040027f000 .. 0x0000000400280000
[000000001018c2f1]
[06000000691b8077]
...
This error occurs several times as it attempts to restart the GFX/SDMA channels
multiple times, but that doesn't do anything since the issue remains, resulting
in a black screen for a few minutes before the machine eventually restarts due
to a watchdogd timeout because WindowServer fails to check in successfully.
The above dump occurs whenever a channel is frozen and is referred to as a
diagnostic dump in the code, so we follow suit.
-- 3.2. Analysis of the diagnostic dump --
We shall go through the log one part at a time and observe what might be
happening.
[05] Channel: SDMA0 (HW [05]); Priority 2; last reset at 0:00:00.000000
CompletedTS = 0x00000000, SubmittedTS = 0x0000000f
Sent to HW: TS = 0x00000001 (HW TS = 0x0000001d, WPTR = 0xe80) at
0:00:43.375067
ScheduledTS = 0x00000005, enqueued at 0:00:37.954289
FirstPendingCB: Process ID = 145, Name = WindowServer; SubmitContext =
Unknown (0)
GPUAddress = 0x0000000400280000, Size = 0x00000007, VMID = 1
The "TS" here is short for "timestamp", it is the fence of the submitted
command buffer.
WindowServer sent a command buffer to the SDMA accel channel, which failed,
because CompletedTS didn't increase.
<!-- Iirc VMID 2~15 are also available to userspace -->
The submitted command buffer refers to VMID 1, which is the VMID used for
user-submitted command buffers (VMID 0 is the GART, aka)
[05] HWChannel: SDMA0, Priority 2, last reset 0:00:00.000000
CompletedTS = 0x0000001c, SubmittedTS = 0x00000020
PendingTS = 0x1d, sent at 0:00:43.375067, AccelChannel: 5, TS = 0x12
Here we see the HWChannel behind the SDMA0 accel channel. It doesn't provide
new info, though.
SDMA0: BUSY, MicroEngine: ACTIVE, LastCmd: 0x00000000
Q0: HALT, ReadPtr = 0x00000e10 (0x0000000000003840), WritePtr =
0x00001000 (0x0000000000004000)
IB: ENABLED, GPUAddress = 0x0000000400280000, ConsumedSize =
0x00000000, RemainSize=0x00000070
The IB (Indirect Buffer) of the previous TS never got consumed.
VM Protection Fault (MM): YES
Page GPUAddress = 0x0000000400280000, VMID = 1
Failing Protection = VALID, READ, NACK
Memory Client ID = 0
Memory Client R/W = READ
Page table: 0x000000040027f000 .. 0x0000000400280000
[000000001018c2f1]
[06000000691b8077]
A VM protection fault occurred while the SDMA0 was trying to access the
physical address behind 0x400280000, which is the IB.
From the diagnostic dump, we can see that protection faults are occurring on
VMID 1, which causes the engines to freeze when they try to access the IB, and
that caused both channels to time out.
-- 3.3. A deeper dive into the protection fault --
(Prerequisite: Familiarity with Sections 4.1 ~ 4.4)
There are many types of failing protection. So far we have seen the following:
- VALID: Failure to find the right PDE/PTE
- READ, WRITE, EXECUTE: Permission not set (always happen when VALID is set)
- NACK: Purpose unknown, but it seems to stand for Negative-Acknowledgment.
- PDE0: Purpose unknown, but it seems to refer to the PDB0 block level.
- TRANSLATE FURTHER: This appears to occur when AMDGPU_PTE_TF is set, but
the address inside the entry is invalid.
There are also these weird hex values:
[000000001018c2f1]
[06000000691b8077]
After going through the writeMappedEntriesDiagnosisReport/getVMPT functions, we
have determined that it is the flag value of the PTEs that map
0x000000040027f000 ~ 0x0000000400280000. When there is a newline, it means that
the two PTEs are non-contiguous, aka not within the same VM block; also, during
a few experiments, the protection fault would go away. But the GPU still
doesn't consume the IB properly, as seen in this dump:
IB: ENABLED, GPUAddress = 0x0000000400280000, ConsumedSize = 0x00039ee0,
RemainSize=0x00000070
We assume whenever this occurs, it means that the GPU got the wrong physical
address, instead of the one pointing to the IB.
-- 4. What we know so far --
Even though none of our attempts has successfully resolved the issue, we have
established a basic understanding of the GPUVM in the process. The following
info is provided both to introduce how the AMD kexts work with the GPUVM, and
to allow our misconceptions to be pointed out.
-- 4.1. The VM Blocks and the PDEs/PTEs --
There are four types of VM Blocks: PDB2, PDB1, PDB0, and PTB. A VM block is an
array of PDEs/PTEs, with the number of entries mostly determined by the
block_size and the VM size.
A PDE contains flags and a GPU address to a VM block of the next level. PDEs on
PDB2 lead to a PDB1, PDEs on PDB1 lead to a PDB0, and PDEs on PDB0 lead to a
PTB.
A PTE contains flags and the physical address of the mapped virtual address. We
are uncertain whether the physical address is of the CPU or the GPU, but we
think it's a CPU physical address. A PTE maps 4KB of virtual address by default.
The GPUVM supports down to one level (PTB) and up to four levels
(PDB2->PDB1->PDB0->PTB).
There is a root block to each VMID, whose type is determined by the number of
levels (PTB on 1-level, PDB0 on 2-level, and so on)
-- 4.2. The VM registers --
Inside mmVM_CONTEXT1_CNTL, there are three relevant fields, other than many
interrupt-related options:
ENABLE_CONTEXT determines if the context is enabled. PAGE_BLOCK_SIZE is set to
block_size - 9 when translate_further is off; when translate_further is on, it
is set to block_size. When PAGE_TABLE_DEPTH is set to x, entries on the first x
levels are considered PDEs by default. We call these PDE-default levels. The
lower levels, where entries are considered PTEs by default, are called
PTE-default levels; for instance, when PAGE_TABLE_DEPTH is set to 2, and the
VMPT has three levels, PDB1 and PDB0 will be PDE-default levels, whereas PTB
becomes a PTE-default level.
mmVM_CONTEXT1_PAGE_TABLE_START_ADDR and mmVM_CONTEXT1_PAGE_TABLE_END_ADDR
determine the range of virtual addresses mapped by the context; this also
implies the VM size.
mmVM_CONTEXT1_PAGE_TABLE_BASE_ADDR specifies the GPU physical address of the
root block.
-- 4.3. The PDE/PTE flags --
The seven lowest bits seem straightforward, so we'll skip to the less obvious
bits.
AMDGPU_PTE_FRAG seems to make a PTE on a PTB map more bytes than the default.
Assume that the value of the frag is x, the PTE will instead map 4KB * 2^x
bytes of memory.
AMDGPU_PDE_PTE makes the GPU treat an entry as PTE, even if it's on a
PDE-default level. Setting this bit to true on PTE-default levels appears to
cause a VALID fault regardless of the entry's content.
AMDGPU_PTE_TF makes the GPU treat an entry as PDE, even if it's on a
PTE-default level. Setting this bit to true on PDE-default levels appears to
cause a VALID fault regardless of the entry's content.
AMDGPU_PDE_BFS is set on PDEs on PDB1 only when translate_further is on. We are
unsure of its purpose.
We have no idea what AMDGPU_PTE_MTYPE_VG10 is, other than it might be related
to caching.
-- 4.4. The translate_further mode --
translate_further is only enabled if the GC version is 9.1.0/9.2.2 and rev_id
>= 2. When it's on, the VM is always set to three levels, but with the depth
set to one. This means that both PDB0 and PTB are now PTE-default levels,
instead of just PTB; PDB1 is the only PDE-default level.
To reflect this fact, PDEs on PDB0 have the AMDGPU_PTE_TF bit set, which tells
the GPU to go to the next level to finish address translation (hence the name
translate further.)
We are uncertain of its purpose, but it seems to help extend the VM size under
certain circumstances.
The AMDRadeonX5000 kext defaults to using 3-level VMPT with translate_further
on. The AMDGPU code also uses translate_further by default on Raven2. However,
according to Visual, AMDGPU code remained functional when he forced
translate_further off.
-- 4.5. The VMPTConfig in AMD kexts --
Inside AMDRadeonX5000_AMDGFX9VMM, there is a structure named VMPTConfig. The
default VMPTConfig in X5000 is:
// {incr, entryCount, vmBlockSize}
{
{0x10000000, 0x200, 0x1000}, // PDB1
{0x10000, 0x1000, 0x8000}, // PDB0
{0x1000, 0x10, 0x1000}, // PTB
}
Whereas the one in X4000 is:
{
{0x10000000, 0x200, 0x1000}, // PDB0
{0x1000, 0x10000, 0x80000}, // PTB
}
As you can see, it's an array of three-element tuples, with every tuple
representing a level on the VMPT. We will use the X5000 VMPTConfig as an
example and explain what the three fields mean to our understanding.
First of all, the second field (entryCount) is the number of entries of a block
on that level. For instance, PDB0 has an entryCount of 0x1000, which means that
a block on the PDB0 level contains 0x1000 entries, each of which is either a
PTE or a pointer to a PTB block (PDE)
Afterwards, the third field (vmBlockSize) is the size of a block on that level
in bytes. This is easy to explain on most levels, as every entry is 8 bytes in
size, so vmBlockSize would be eight times the entryCount. However, the X5000
PTB is an exception. This is because the blocks are required to be aligned to
4kb. Therefore, it allocates 4kb for each PTB block, wasting 0x1000 - 0x80 =
3968 bytes of space.
Finally, the first field (incr) is the amount of virtual memory controlled by
an entry of a block at that level.
Wait, let me explain. This is straightforward on PTB, as every PTE maps 4kb by
default. As of PDB0, because every PTB block controls 0x1000 (incr) * 0x10
(entryCount) = 0x10000 bytes of virtual memory, therefore, a PDB0 entry
pointing to a PTB block also controls 0x10000 bytes.
An important property of the config is that levels[i].incr = levels[i + 1].incr
* levels[i + 1].entryCount, except for PTB which hardcodes the incr to 4KB. We
used this property to create and experiment with different VMPTConfigs. Another
property is that the VM size is equal to levels[0].incr * levels[0].entryCount,
as the root block contains the entire VM.
Both X5000 and X4000 have a VM range of 0x400000000 ~ 0x2400000000, making the
VM size 0x2000000000 = 128 GB. We are unsure why the VM size in the kexts is
smaller than that of AMDGPU which sets the VM size to at least 128 TB.
-- 4.6. How the entryCount is determined on AMDGPU --
There are two main components to it. The first is this function:
/**
* amdgpu_vm_num_entries - return the number of entries in a PD/PT
*
* @adev: amdgpu_device pointer
* @level: VMPT level
*
* Returns:
* The number of entries in a page directory or page table.
*/
static unsigned amdgpu_vm_num_entries(struct amdgpu_device *adev,
unsigned level)
{
unsigned shift = amdgpu_vm_level_shift(adev,
adev->vm_manager.root_level);
if (level == adev->vm_manager.root_level)
/* For the root directory */
return round_up(adev->vm_manager.max_pfn, 1ULL << shift)
>> shift;
else if (level != AMDGPU_VM_PTB)
/* Everything in between */
return 512;
else
/* For the page tables on the leaves */
return AMDGPU_VM_PTE_COUNT(adev);
}
Then the following macro:
/* number of entries in page table */
#define AMDGPU_VM_PTE_COUNT(adev) (1 << (adev)->vm_manager.block_size)
According to our understanding, the entryCount is determined as follows:
1. The PTB has 2^block_size entries. (Notice that block_size !=
PAGE_BLOCK_SIZE in AMDGPU)
2. PDB0~PDB2 has 512 entries unless they are the root level.
3. The root level has vmSize/incr entries. This is to satisfy the property
that vmSize = levels[0].incr * levels[0].entryCount.
These are another set of key info that we used to manually craft VMPTConfigs.
-- 4.7. The GPUVM settings of AMD kexts vs. AMDGPU --
First of all, let's summarize some crucial settings of the GPUVM:
- The block_size
- The PAGE_BLOCK_SIZE
- The number of levels
- The PAGE_BLOCK_DEPTH
- VMPTConfig aka. the incr, entryCount, and vmBlockSize.
With that taken care of, here are four sets of configurations that we have seen:
AMDRadeonX5000:
- block_size is unknown (Because translate_further is on)
- PAGE_BLOCK_SIZE = 7
- Three levels
- PAGE_BLOCK_DEPTH = 1 (Because translate_further is on)
- VMPTConfig:
{
{0x10000000, 0x200, 0x1000}, // PDB1
{0x10000, 0x1000, 0x8000}, // PDB0
{0x1000, 0x10, 0x1000}, // PTB
}
AMDRadeonX4000:
- block_size = log2(0x10000) = 16
- PAGE_BLOCK_SIZE = 7 (which equals 16 - 9)
- Two levels
- PAGE_BLOCK_DEPTH = 1
- VMPTConfig:
{
{0x10000000, 0x200, 0x1000}, // PDB0
{0x1000, 0x10000, 0x80000}, // PTB
}
AMDGPU with translate_further on:
- block_size = 9
- PAGE_BLOCK_SIZE = 9
- Three levels
- PAGE_BLOCK_DEPTH = 1
- VMPTConfig is unknown
AMDGPU with translate_further off:
- block_size = 9
- PAGE_BLOCK_SIZE = 0
- Three levels
- PAGE_BLOCK_DEPTH = 2
- VMPTConfig (inferred):
{
{0x40000000, Unknown, Unknown}, // PDB1
{0x200000, 0x200, 0x1000}, // PDB0
{0x1000, 0x200, 0x1000}, // PTB
}
-- 5. What we have tried --
We attempted dozens of methods to fix the issue. However, none have managed to
get one VMID 1 IB to work. The following are all that we have tried so far.
-- 5.1. PTE/PDE flags experimentations --
(Prerequisite: Familiarity with Sections 4.1 ~ 4.4)
The PTE/PDE flags are determined by getPTEValue and getPDEValue, respectively.
We, therefore, wrapped them and experimented with adjusting the flags.
Unsetting AMDGPU_PDE_PTE for all PDE values:
Q0: HALT, ReadPtr = 0x00001b90 (0x0000000000006e40), WritePtr = 0x00001d80
(0x0000000000007600)
IB: ENABLED, GPUAddress = 0x0000000400280000, ConsumedSize =
0x00000000, RemainSize=0x00000070
VCN0: Disabled
VM Protection Fault (GFX): YES
Page GPUAddress = 0x0000000400480000, VMID = 1
Failing Protection = VALID, READ, EXECUTE, PDE0
Memory Client ID = 4
Memory Client R/W = READ
Page table: 0x000000040047f000 .. 0x0000000400480000
[060000006645a077] [0000000065b3d077]
VM Protection Fault (MM): YES
Page GPUAddress = 0x0000000400280000, VMID = 1
Failing Protection = VALID, READ, PDE0
Memory Client ID = 0
Memory Client R/W = READ
Page table: 0x000000040027f000 .. 0x0000000400280000
[06000000673d6077] [060000006841e077]
- Unsetting AMDGPU_PTE_TF and AMDGPU_PDE_BFS
SDMA0: BUSY, MicroEngine: ACTIVE, LastCmd: 0x00000000
Q0: HALT, ReadPtr = 0x00001a10 (0x0000000000006840), WritePtr =
0x00001c00 (0x0000000000007000)
IB: ENABLED, GPUAddress = 0x0000000400280000, ConsumedSize =
0x00000000, RemainSize=0x00000070
VCN0: Disabled
VM Protection Fault (GFX): YES
Page GPUAddress = 0x0000000400480000, VMID = 1
Failing Protection = VALID, READ, EXECUTE, NACK
Memory Client ID = 4
Memory Client R/W = READ
Page table: 0x000000040047f000 .. 0x0000000400480000
[060000006822b077] [0000000066ad7077]
VM Protection Fault (MM): YES
Page GPUAddress = 0x0000000400280000, VMID = 1
Failing Protection = VALID, READ, NACK
Memory Client ID = 0
Memory Client R/W = READ
Page table: 0x000000040027f000 .. 0x0000000400280000
[060000006743b077] [0600000067977077]
- Unsetting AMDGPU_PDE_BFS
SDMA0: BUSY, MicroEngine: ACTIVE, LastCmd: 0x00000000
Q0: ACTIVE, ReadPtr = 0x00001b10 (0x0000000000006c40), WritePtr =
0x00001d00 (0x0000000000007400)
IB: ENABLED, GPUAddress = 0x0000000400280000, ConsumedSize =
0x00000000, RemainSize=0x00000070
VCN0: Disabled
VM Protection Fault (GFX): NO
VM Protection Fault (MM): NO
No fault, but stuck.
Setting AMDGPU_PDE_BFS to 12:
SDMA0: BUSY, MicroEngine: ACTIVE, LastCmd: 0x00000000
Q0: HALT, ReadPtr = 0x00001190 (0x0000000000004640), WritePtr =
0x00001380 (0x0000000000004e00)
IB: ENABLED, GPUAddress = 0x0000000400280000, ConsumedSize =
0x00000000, RemainSize=0x00000070
VCN0: Disabled
VM Protection Fault (GFX): YES
Page GPUAddress = 0x0000000400480000, VMID = 1
Failing Protection = VALID, READ, EXECUTE, TRANSLATE FURTHER
Memory Client ID = 4
Memory Client R/W = READ
Page table: 0x000000040047f000 .. 0x0000000400480000
[060000006d266077]
[000000006dcea077]
VM Protection Fault (MM): YES
Page GPUAddress = 0x0000000400280000, VMID = 1
Failing Protection = VALID, READ, NACK
Memory Client ID = 0
Memory Client R/W = READ
Page table: 0x000000040027f000 .. 0x0000000400280000
[000000001018c2f1]
[060000006cccc077]
- Setting AMDGPU_PDE_BFS to 9
VM Protection Fault (GFX): YES
Page GPUAddress = 0x0000000400480000, VMID = 1
Failing Protection = VALID, READ, EXECUTE, NACK
Memory Client ID = 4
Memory Client R/W = READ
Page table: 0x000000040047f000 .. 0x0000000400480000
[060000006af62077]
[000000006a82d077]
VM Protection Fault (MM) YES
Page GPUAddress = 0x0000000400280000, VMID = 1
Failing Protection = VALID, READ, NACK
Memory Client ID = 0
Memory Client R/W = READ
Page table: 0x000000040027f000 .. 0x0000000400280000
[000000001018c2f1]
[060000006b6f8077]
-- 5.2. Experimentation with VMPTConfig and related settings --
(Prerequisite: Familiarity with Sections 4.1 ~ 4.7)
We have attempted to replicate three-level, two-level, and one-level
configurations according to what we learned in the AMDGPU code.
We adjust the VM Size, the number of levels, and the depth by modifying their
respective fields after calling AMDRadeonX5000_AMDGFX9VMM::init.
PAGE_BLOCK_SIZE is set in
AMDRadeonX5000_AMDGFX9Hardware::initializeVmContextCntlRegs, which in turn
calls AMDRadeonX5000_AMDHWVMM::getVMPTBCoverage to calculate the value to use.
We wrap getVMPTBCoverage and change its value to indirectly set the
PAGE_BLOCK_SIZE.
We verify that PAGE_BLOCK_SIZE and PAGE_BLOCK_DEPTH have been set correctly by
wrapping AMDRadeonX5000_AMDHWRegisters::write and checking that the value
written to mmVM_CONTEXT1_CNTL (0x2881) is correct.
We set VMPTConfig by modifying the fields directly and then applying a binary
patch to prevent AMDRadeonX5000_AMDGFX9VMM::init from overriding the values.
No action is required for block_size, as it is equal to the log2 of PTB's
entryCount.
Now, the following are the configurations we tried and the errors they yielded:
Three-level with translate_further on (the default):
- VM Size: 0x2000000000 (128 GB)
- block_size is unknown
- PAGE_BLOCK_SIZE = 7
- Three levels
- PAGE_BLOCK_DEPTH = 1
- VMPTConfig:
{
{0x10000000, 0x200, 0x1000}, // PDB1
{0x10000, 0x1000, 0x8000}, // PDB0
{0x1000, 0x10, 0x1000}, // PTB
}
Results:
VM Protection Fault (GFX): YES
Page GPUAddress = 0x0000000400480000, VMID = 1
Failing Protection = VALID, READ, EXECUTE, NACK
Memory Client ID = 4
Memory Client R/W = READ
Page table: 0x000000040047f000 .. 0x0000000400480000
[060000006a931077]
[0000000069497077]
VM Protection Fault (MM): YES
Page GPUAddress = 0x0000000400280000, VMID = 1
Failing Protection = VALID, READ, NACK
Memory Client ID = 0
Memory Client R/W = READ
Page table: 0x000000040027f000 .. 0x0000000400280000
[000000001018c2f1]
[06000000691b8077]
Three-level with translate_further off:
- VM Size: 0x2000000000 (128 GB)
- block_size = 9
- PAGE_BLOCK_SIZE = 0
- Three levels
- PAGE_BLOCK_DEPTH = 2
- VMPTConfig:
{
{0x40000000, 0x80, 0x1000}, // PDB1
{0x200000, 0x200, 0x1000}, // PDB0
{0x1000, 0x200, 0x1000}, // PTB
}
Results:
VM Protection Fault (GFX): YES
Page GPUAddress = 0x0000000400480000, VMID = 1
Failing Protection = VALID, READ, EXECUTE, NACK
Memory Client ID = 4
Memory Client R/W = READ
Page table: 0x000000040047f000 .. 0x0000000400480000
[06000000683dc077] [0000000067882077]
VM Protection Fault (MM): YES
Page GPUAddress = 0x0000000400280000, VMID = 1
Failing Protection = VALID, READ, NACK
Memory Client ID = 0
Memory Client R/W = READ
Page table: 0x000000040027f000 .. 0x0000000400280000
[06000000689f8077] [0600000067a94077]
Two-level:
- VM Size: 0x2000000000 (128 GB)
- block_size = 16
- PAGE_BLOCK_SIZE = 7
- Two levels
- PAGE_BLOCK_DEPTH = 1
- VMPTConfig:
{
{0x10000000, 0x200, 0x1000}, // PDB0
{0x1000, 0x10000, 0x80000}, // PTB
}
Results:
VM Protection Fault (GFX): YES
Page GPUAddress = 0x0000000400480000, VMID = 1
Failing Protection = VALID, READ, EXECUTE, NACK
Memory Client ID = 4
Memory Client R/W = READ
Page table: 0x000000040047f000 .. 0x0000000400480000
[060000005fc30077] [000000005d537077]
VM Protection Fault (MM): YES
Page GPUAddress = 0x0000000400280000, VMID = 1
Failing Protection = VALID, READ, NACK
Memory Client ID = 0
Memory Client R/W = READ
Page table: 0x000000040027f000 .. 0x0000000400280000
[000000001019b2f1] [060000005f580077]
One-level:
- VM Size: 0x400000000 (16 GB, otherwise the VMPT can't fit within the 256
MB aperture)
- block_size = 0
- PAGE_BLOCK_SIZE = 0 (In accordance to gmc_v9_0_gart_init logic)
- One level
- PAGE_BLOCK_DEPTH = 0
- VMPTConfig:
{
{0x1000, 0x400000, 0x2000000}, // PTB
}
Results:
SDMA0: BUSY, MicroEngine: ACTIVE, LastCmd: 0x00000000
Q0: HALT, ReadPtr = 0x00001b90 (0x0000000000006e40), WritePtr =
0x00001d80 (0x0000000000007600)
IB: ENABLED, GPUAddress = 0x0000000400280000, ConsumedSize =
0x00039ee0, RemainSize=0x00000070
VM Protection Fault (GFX): YES
Page GPUAddress = 0x0000000400480000, VMID = 1
Failing Protection = VALID, READ, EXECUTE, TRANSLATE FURTHER
Memory Client ID = 4
Memory Client R/W = READ
Page table: 0x000000040047f000 .. 0x0000000400480000
[0600000068e40077] [0000000067e84077]
VM Protection Fault (MM): NO
Out of all configurations we have tested, this one has the strangest result.
The strange things are:
1. According to the wraps, the AMDGPU_PTE_TF flag was never set on any
entries. But if so, where did the TRANSLATE FURTHER protection error come from?
2. How does the MM(SDMA0) get the wrong physical address, while GFX throws
a protection fault?
This strange result inspired us to seek help on this mailing list.
-- 6. How you can help --
First of all, thank you for investing approximately 25 minutes towards reading
this far (in case you skipped down here, that's fine too.)
Now, with all that taken care of, we think it's time that we drop this line
that you've been waiting for:
We are asking interested developers to help us by providing us with
suggestions, guidance, knowledge, or documents that resolves one or multiple of
the unanswered questions below, as long as it's practical to do so.
Even if you know nothing about questions 1~10, that's fine too! Just pointing
us to related resources or developers who know this realm better is going to
help us accelerate this acceleration project, and eventually bring full
Hackintosh experiences to hundreds of AMD laptop users.
-- 6.1. Unanswered questions --
Despite having started working on this as early as 2022-07-10, we still have
countless questions about the internal of the driver and the iGPU itself.
However, since we are asking for guidance with the VM issue, we'll keep that
our focus.
So, here is what we'd like to know and/or figure out:
1. What does the NACK failing protection imply, and what actions can we
take to fix it?
2. What does the PDE0 failing protection imply?
3. What does the value of PAGE_BLOCK_SIZE mean? We know what block_size is,
but we still don't know about PAGE_BLOCK_SIZE.
4. What are the purposes of AMDGPU_PDE_BFS, and why is it only set when
translate_further is on?
5. What do different AMDGPU_PTE_MTYPE_VG10 values mean, and could it be
what's causing our issue? All we know is that it's related to caching in the
MMU.
6. What are the purposes of fields in mmVM_L2_CNTL3? Specifically, why are
BANK_SELECT and L2_CACHE_BIGK_FRAGMENT_SIZE set to different values when
translate_further is on in AMDGPU?
7. What might be causing the strange phenomenon during the one-level VMPT
test?
8. Why does none of the VMPTConfigs work, despite them matching everything
we see in the AMDGPU code?
9. Is there any way to gain more insight into the VM address translation
process and where it went wrong?
10. Have we done any mistakes with our analysis in Chapter 3/4? If so, how
can we correct it?
11. Are there any resources we can refer to, other than the AMDGPU code,
search engines, and this mailing list?
-- 6.2. Ways to contact us --
- Via replying to this thread
- Via our Telegram group: https://t.me/+J6GPgy8g-445NDE1