[Public]
Presently, there is this one also - drm_dev_wedged_event. Perhaps it's better
to modify this to include additional info like pre and post reset along with
cause of reset?
Thanks,
Lijo
-Original Message-
From: amd-gfx On Behalf Of Yang Wang
Sent: Friday, September 26, 2025 12:0
[Public]
The intention is to let kgd2kfd_interrupt thread know that KFD is done with
interrupt handling and exit at the earliest (that is even without going through
kfd node loop). I was thinking of checking ih_wq NULL value, but since that
value is not under lock, it's not necessary that kgd2k
Use the uevent mechanism to expose the GPU reset state,
so that the system tool can more accurately monitor the device reset status.
example:
$ sudo cat /sys/kernel/debug/dri//amdgpu_gpu_recover
KERNEL[172.053149] change
/devices/pci:00/:00:03.1/:03:00.0/:04:00.0/:05:00.0 (
Hi Matthew,
Ping ?
Regards,
Arun.
On 9/23/2025 2:32 PM, Arunpravin Paneer Selvam wrote:
Replace the freelist (O(n)) used for free block management with a
red-black tree, providing more efficient O(log n) search, insert,
and delete operations. This improves scalability and performance
when mana
On 9/25/2025 5:45 AM, Kuehling, Felix wrote:
> On 2025-09-23 03:26, Zhu Lingshan wrote:
>> This commit records the id of the owner
>> kfd_process into a kfd process_info when
>> create it.
>>
>> Signed-off-by: Zhu Lingshan
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 2 ++
>> d
On 9/25/2025 5:41 AM, Kuehling, Felix wrote:
> On 2025-09-23 03:25, Zhu Lingshan wrote:
>> This commit introduces a new id field for
>> struct kfd process, which helps identify
>> a kfd process among multiple contexts that
>> all belong to a single user space program.
>>
>> The sysfs entry of a se
On 9/25/2025 5:50 AM, Kuehling, Felix wrote:
> On 2025-09-23 03:26, Zhu Lingshan wrote:
>> The user space program pass down a pid to kfd
>> through set_debug_trap ioctl, which can help
>> find the corresponding user space program and
>> its mm struct.
>>
>> However, these information is insufficie
[AMD Official Use Only - AMD Internal Distribution Only]
flush_workqueue(kfd->ih_wq) and destroy_workqueue(kfd->ih_wq) in
kfd_cleanup_nodes clean up pending work items, and node->interrupts_active
check prevent new work items from being enqueued. So after kfd_cleanup_nodes
free kfd node, there
On Thu, Sep 25, 2025 at 5:59 PM Mario Limonciello (AMD)
wrote:
>
> Hybrid sleep will hibernate the system followed by running through
> the suspend routine. Since both the hibernate and the suspend routine
> will call pm_restrict_gfp_mask(), pm_restore_gfp_mask() must be called
> before starting
This commit refactors the AMDGPU userqueue management subsystem to replace
IDR (ID Allocation) with XArray for improved performance, scalability, and
maintainability. The changes address several issues with the previous IDR
implementation and provide better locking semantics.
Key changes:
1. **Gl
update the amdgpu_ttm_tt_get_user_pages and all dependent function
along with it callers to use a user allocated hmm_range buffer instead
hmm layer allocates the buffer.
This is a need to get hmm_range pointers easily accessible
without accessing the bo and that is a requirement for the
userqueue
On 9/25/2025 4:16 PM, Alex Deucher wrote:
On Thu, Sep 25, 2025 at 3:50 PM Mario Limonciello
wrote:
On 9/25/2025 2:46 PM, Alex Deucher wrote:
On Thu, Sep 25, 2025 at 3:39 PM Mario Limonciello
wrote:
[Why]
Not all renoir hardware supports secure display. If the TA is present
but the fe
On 9/25/2025 12:55 PM, Rafael J. Wysocki wrote:
On Thu, Sep 25, 2025 at 7:51 PM Rafael J. Wysocki wrote:
On Thu, Sep 25, 2025 at 7:47 PM Rafael J. Wysocki wrote:
On Thu, Sep 25, 2025 at 5:59 PM Mario Limonciello (AMD)
wrote:
Hybrid sleep will hibernate the system followed by running t
Alex Deucher ezt írta (időpont: 2025. szept. 25.,
Csü 23:28):
> On Thu, Sep 25, 2025 at 2:45 PM Timur Kristóf
> wrote:
> >
> > Without these, it's impossible to program these registers.
> >
> > Fixes: 102b2f587ac8 ("drm/amd/display: dce_transform: DCE6 Scaling
> Horizontal Filter Init (v2)")
> >
On Thu, Sep 25, 2025 at 5:47 PM Mario Limonciello
wrote:
>
>
>
> On 9/25/2025 4:16 PM, Alex Deucher wrote:
> > On Thu, Sep 25, 2025 at 3:50 PM Mario Limonciello
> > wrote:
> >>
> >>
> >>
> >> On 9/25/2025 2:46 PM, Alex Deucher wrote:
> >>> On Thu, Sep 25, 2025 at 3:39 PM Mario Limonciello
> >>>
On Thu, Sep 25, 2025 at 5:33 PM Timur Kristóf wrote:
>
>
>
> Alex Deucher ezt írta (időpont: 2025. szept. 25., Csü
> 23:28):
>>
>> On Thu, Sep 25, 2025 at 2:45 PM Timur Kristóf
>> wrote:
>> >
>> > Without these, it's impossible to program these registers.
>> >
>> > Fixes: 102b2f587ac8 ("drm/am
On Thu, Sep 25, 2025 at 2:45 PM Timur Kristóf wrote:
>
> Without these, it's impossible to program these registers.
>
> Fixes: 102b2f587ac8 ("drm/amd/display: dce_transform: DCE6 Scaling Horizontal
> Filter Init (v2)")
> Signed-off-by: Timur Kristóf
I think it would make sense to just squash pa
On Thu, Sep 25, 2025 at 3:50 PM Mario Limonciello
wrote:
>
>
>
> On 9/25/2025 2:46 PM, Alex Deucher wrote:
> > On Thu, Sep 25, 2025 at 3:39 PM Mario Limonciello
> > wrote:
> >>
> >> [Why]
> >> Not all renoir hardware supports secure display. If the TA is present
> >> but the feature isn't suppor
On 9/25/2025 2:46 PM, Alex Deucher wrote:
On Thu, Sep 25, 2025 at 3:39 PM Mario Limonciello
wrote:
[Why]
Not all renoir hardware supports secure display. If the TA is present
but the feature isn't supported it will fail to load or send commands.
This shows ERR messages to the user that mak
On Thu, Sep 25, 2025 at 3:39 PM Mario Limonciello
wrote:
>
> [Why]
> Not all renoir hardware supports secure display. If the TA is present
> but the feature isn't supported it will fail to load or send commands.
> This shows ERR messages to the user that make it seems like there is
> a problem.
>
On Thu, Sep 25, 2025 at 8:51 PM Mario Limonciello (AMD)
wrote:
>
> From: Mario Limonciello
>
> Ionut Nechita reported recently a hibernate failure, but in debugging
> the issue it's actually not a hibernate failure; but a hybrid sleep
> failure.
>
> Multiple changes related to the change of when
From: Wang Jiang
The audio detection process in the Radeon driver is as follows:
radeon_dvi_detect/radeon_dp_detect -> radeon_audio_detect ->
radeon_audio_enable -> radeon_audio_component_notify ->
radeon_audio_component_get_eld
When HDMI is unplugged, radeon_dvi_detect is triggered.
At this po
From: Mario Limonciello
Ionut Nechita reported recently a hibernate failure, but in debugging
the issue it's actually not a hibernate failure; but a hybrid sleep
failure.
Multiple changes related to the change of when swap is disabled in
the suspend sequence contribute to the failure. See the i
[Why]
Not all renoir hardware supports secure display. If the TA is present
but the feature isn't supported it will fail to load or send commands.
This shows ERR messages to the user that make it seems like there is
a problem.
[How]
Check the resp_status of the context to see if there was an erro
On 9/24/2025 5:48 PM, Philip Yang wrote:
On 2025-09-24 11:29, Yifan Zhang wrote:
There is race in amdgpu_amdkfd_device_fini_sw and interrupt.
if amdgpu_amdkfd_device_fini_sw run in b/w kfd_cleanup_nodes and
kfree(kfd), and KGD interrupt generated.
kernel panic log:
BUG: kernel NULL point
This series fixes visual glitches on systems with SI GPUs
where the BIOS sets up a default mode with scaling.
Alex was kind enough to give me an extra register definition
that can actually bypass the scaler on DCE6.
Additionally, while testing the scaler under KDE, I noticed
that it doesn't work w
Scaling doesn't work on DCE6 at the moment, the current
register programming produces incorrect output when using
fractional scaling (between 100-200%) on resolutions higher
than 1080p.
Disable it until we figure out how to program it properly.
Fixes: 7c15fd86aaec ("drm/amd/display: dc/dce: add i
[Why]
commit 530694f54dd5e ("drm/amdgpu: do not resume device in thaw for
normal hibernation") optimized the flow for systems that are going
into S4 where the power would be turned off. Basically the thaw()
callback wouldn't resume the device if the hibernation image was
successfully created since
Some drivers have different flows for hibernation and suspend. If
the driver opportunistically will skip thaw() then it needs a hint
to know what is happening after the hibernate.
Introduce a new symbol pm_hibernation_mode_is_suspend() that drivers
can call to determine if suspending the system fo
Hybrid sleep will hibernate the system followed by running through
the suspend routine. Since both the hibernate and the suspend routine
will call pm_restrict_gfp_mask(), pm_restore_gfp_mask() must be called
before starting the suspend sequence.
Add an explicit call to pm_restore_gfp_mask() to po
SCL_SCALER_ENABLE can be used to enable/disable the scaler
on DCE6. Program it to 0 when scaling isn't used, 1 when used.
Additionally, clear some other registers when scaling is
disabled and program the SCL_UPDATE register as recommended.
This fixes visible glitches for users whose BIOS sets up a
Previously, the code would set a bit field which didn't exist
on DCE6 so it would be effectively a no-op.
Fixes: b70aaf5586f2 ("drm/amd/display: dce_transform: add DCE6 specific
macros,functions")
Signed-off-by: Timur Kristóf
---
drivers/gpu/drm/amd/display/dc/dce/dce_transform.c | 6 ++
1
Without these, it's impossible to program these registers.
Fixes: 102b2f587ac8 ("drm/amd/display: dce_transform: DCE6 Scaling Horizontal
Filter Init (v2)")
Signed-off-by: Timur Kristóf
---
drivers/gpu/drm/amd/display/dc/dce/dce_transform.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/d
From: Alex Deucher
Fixes: 102b2f587ac8 ("drm/amd/display: dce_transform: DCE6 Scaling Horizontal
Filter Init (v2)")
Signed-off-by: Alex Deucher
---
drivers/gpu/drm/amd/include/asic_reg/dce/dce_6_0_d.h | 7 +++
drivers/gpu/drm/amd/include/asic_reg/dce/dce_6_0_sh_mask.h | 2 ++
2 files
On 2025-09-25 04:11, Pekka Paalanen wrote:
On Tue, 23 Sep 2025 11:41:24 -0600
Alex Hung wrote:
On 9/23/25 10:16, Alex Hung wrote:
On 9/23/25 01:59, Pekka Paalanen wrote:
On Mon, 22 Sep 2025 21:16:45 -0600
Alex Hung wrote:
On 9/18/25 02:40, Pekka Paalanen wrote:
...
The problem
On Thu, Sep 25, 2025 at 7:51 PM Rafael J. Wysocki wrote:
>
> On Thu, Sep 25, 2025 at 7:47 PM Rafael J. Wysocki wrote:
> >
> > On Thu, Sep 25, 2025 at 5:59 PM Mario Limonciello (AMD)
> > wrote:
> > >
> > > Hybrid sleep will hibernate the system followed by running through
> > > the suspend routin
On Thu, Sep 25, 2025 at 7:47 PM Rafael J. Wysocki wrote:
>
> On Thu, Sep 25, 2025 at 5:59 PM Mario Limonciello (AMD)
> wrote:
> >
> > Hybrid sleep will hibernate the system followed by running through
> > the suspend routine. Since both the hibernate and the suspend routine
> > will call pm_rest
On 24/09/2025 10:11, Philipp Stanner wrote:
On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
To implement fair scheduling we need a view into the GPU time consumed by
entities. Problem we have is that jobs and entities objects have decoupled
lifetimes, where at the point we have a view
patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url:
https://github.com/intel-lab-lkp/linux/commits/Mario-Limonciello-AMD/PM-hibernate-Fix-hybrid-sleep/20250925-045432
base: https://git.kernel.org/pub/scm/linux/
On Thu, 2025-09-25 at 17:43 +0800, Heng Zhou wrote:
> There is some probability that reset workqueue is blocked by KIQ I/O for 10+
> seconds after gpu hangs.
> So we need to add a in_reset check during each KIQ register poll.
>
> Signed-off-by: Heng Zhou
> ---
You should create such patches wit
From: David Laight
[ Upstream commit 495bba17cdf95e9703af1b8ef773c55ef0dfe703 ]
Always pass a 'type' through to __clamp_once(), pass '__auto_type' from
clamp() itself.
The expansion of __types_ok3() is reasonable so it isn't worth the added
complexity of avoiding it when a fixed type is used fo
On Thu, Sep 25, 2025 at 5:59 PM Mario Limonciello (AMD)
wrote:
>
> Ionut Nechita reported recently a hibernate failure, but in debugging
> the issue it's actually not a hibernate failure; but a hybrid sleep
> failure.
>
> Multiple changes related to the change of when swap is disabled in
> the sus
Hybrid sleep will hibernate the system followed by running through
the suspend routine. Since both the hibernate and the suspend routine
will call pm_restrict_gfp_mask(), pm_restore_gfp_mask() must be called
before starting the suspend sequence.
Add an explicit call to pm_restore_gfp_mask() to po
Some drivers have different flows for hibernation and suspend. If
the driver opportunistically will skip thaw() then it needs a hint
to know what is happening after the hibernate.
Introduce a new symbol pm_hibernation_mode_is_suspend() that drivers
can call to determine if suspending the system fo
Ionut Nechita reported recently a hibernate failure, but in debugging
the issue it's actually not a hibernate failure; but a hybrid sleep
failure.
Multiple changes related to the change of when swap is disabled in
the suspend sequence contribute to the failure. See the individual
patches for deta
There is some probability that reset workqueue is blocked by KIQ I/O for 10+
seconds after gpu hangs.
So we need to add a in_reset check during each KIQ register poll.
Signed-off-by: Heng Zhou
---
drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 6 ++
1 file changed, 6 insertions(+)
diff --git a/
On 24/09/2025 13:01, Philipp Stanner wrote:
On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
Now that the run queue to scheduler relationship is always 1:1 we can
embed it (the run queue) directly in the scheduler struct and save on
some allocation error handling code and such.
Looks
submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url:
https://github.com/intel-lab-lkp/linux/commits/Mario-Limonciello-AMD/PM-hibernate-Fix-hybrid-sleep/20250925-045432
base: https://git.kernel.org/pub/scm/li
On 9/24/2025 6:38 AM, Timur Kristóf wrote:
Reject modes with a pixel clock higher than the maximum display
clock. These were never supported, but we haven't noticed the
issue until the YCbCr 422 fallback was recently added.
For example, the DP 1.2 standard technically supports
4K 120Hz YCbCr 422
[Public]
I meant something like this.
destroy_workqueue(kfd->ih_wq);
kfd->ih_wq = NULL;
Then check for NULL at the beginning of kgd2kfd_interrupt. If there is no IH
workqueue, then there is no interrupt handling capability.
May also be within the loop. Not sure if that is really required; if s
On 24/09/2025 12:50, Philipp Stanner wrote:
On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
If the new fair policy is at least as good as FIFO and we can afford to
remove round-robin, we can simplify the scheduler code by making the
scheduler to run queue relationship always 1:1 and r
On 24/09/2025 10:15, Philipp Stanner wrote:
On Thu, 2025-09-11 at 16:06 +0100, Tvrtko Ursulin wrote:
On 11/09/2025 15:32, Philipp Stanner wrote:
On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
There is no need to keep entities with no jobs in the tree so lets remove
it once the las
On 24/09/2025 10:38, Philipp Stanner wrote:
On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
Fair scheduling policy is built upon the same concepts as the well known
CFS kernel scheduler - entity run queue is sorted by the virtual GPU time
consumed by entities in a way that the entity
On 25.09.25 12:32, Jesse.Zhang wrote:
> As KFD no longer uses a separate PASID, the global
> amdgpu_vm_set_pasid()function is no longer necessary.
> Merge its functionality directly intoamdgpu_vm_init() to simplify code flow
> and eliminate redundant locking.
>
> Suggested-by: Christian König
>
As KFD no longer uses a separate PASID, the global
amdgpu_vm_set_pasid()function is no longer necessary.
Merge its functionality directly intoamdgpu_vm_init() to simplify code flow and
eliminate redundant locking.
Suggested-by: Christian König
Signed-off-by: Jesse Zhang
---
drivers/gpu/drm/am
[Public]
Hi Lijo, Do you mean a change like below ? Besides readability, is there
functional improvement ?
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index e9cfb80bd436..86676acd9cbe 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/d
This patch adds robust reset handling for user queues (userq) to improve
recovery from queue failures. The key components include:
1. Queue detection and reset logic:
- amdgpu_userq_detect_and_reset_queues() identifies failed queues
- Per-IP detect_and_reset callbacks for targeted recovery
From: Linus Torvalds
[ Upstream commit 3a7e02c040b130b5545e4b115aada7bacd80a2b6 ]
The minmax infrastructure is overkill for simple constants, and can
cause huge expansions because those simple constants are then used by
other things.
For example, 'pageblock_order' is a core VM constant, but bec
From: Linus Torvalds
[ Upstream commit 017fa3e89187848fd056af757769c9e66ac3e93d ]
This simplifies the min_t() and max_t() macros by no longer making them
work in the context of a C constant expression.
That means that you can no longer use them for static initializers or
for array sizes in type
From: Linus Torvalds
[ Upstream commit 1a251f52cfdc417c84411a056bc142cbd77baef4 ]
This just standardizes the use of MIN() and MAX() macros, with the very
traditional semantics. The goal is to use these for C constant
expressions and for top-level / static initializers, and so be able to
simplif
From: Linus Torvalds
[ Upstream commit 21b136cc63d2a9ddd60d4699552b69c214b32964 ]
David Laight pointed out that we should deal with the min3() and max3()
mess too, which still does excessive expansion.
And our current macros are actually rather broken.
In particular, the macros did this:
#d
From: Linus Torvalds
[ Upstream commit 4477b39c32fdc03363affef4b11d48391e6dc9ff ]
Commit 3a7e02c040b1 ("minmax: avoid overly complicated constant
expressions in VM code") added the simpler MIN_T/MAX_T macros in order
to avoid some excessive expansion from the rather complicated regular
min/max m
From: David Laight
[ Upstream commit c3939872ee4a6b8bdcd0e813c66823b31e6e26f7 ]
At some point the definitions for clamp() got added in the middle of the
ones for min() and max(). Re-order the definitions so they are more
sensibly grouped.
Link:
https://lkml.kernel.org/r/8bb285818e4846469121c8
From: Linus Torvalds
[ Upstream commit cb04e8b1d2f24c4c2c92f7b7529031fc35a16fed ]
We only had a couple of array[] declarations, and changing them to just
use 'MAX()' instead of 'max()' fixes the issue.
This will allow us to simplify our min/max macros enormously, since they
can now unconditiona
64 matches
Mail list logo