[AMD Official Use Only - Internal Distribution Only]

Not sure it's same issue as I observed.

If you have an XGMI setup, use the latest drm-next and the PMFW I used on my 
XGMI system(I just sent you the vega20_smc.bin through mail). And then give 
another attempt.

About the strict time interval, I remember the XGMI node EnterBaco message will 
fail when interval is around millisecond.

Regards,
Ma Le

From: Grodzovsky, Andrey <[email protected]>
Sent: Tuesday, December 10, 2019 6:01 AM
To: Ma, Le <[email protected]>; [email protected]; Zhou1, Tao 
<[email protected]>; Deucher, Alexander <[email protected]>; Li, Dennis 
<[email protected]>; Zhang, Hawking <[email protected]>
Cc: Chen, Guchun <[email protected]>
Subject: Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset support for 
XGMI


I reproduced the issue on my side - i consistently  observe amdgpu: [powerplay] 
Failed to send message 0x58, response 0x0 - Baco exit failure - do you know 
what is the strict time interval within which all the Baco enter/Exit messages 
needs to be sent to all the nodes in the hive ?

Andrey
On 12/9/19 6:34 AM, Ma, Le wrote:

[AMD Official Use Only - Internal Distribution Only]

Hi Andrey,

I tried your patches on my 2P XGMI platform. The baco can work at most time, 
and randomly got following error:
[ 1701.542298] amdgpu: [powerplay] Failed to send message 0x25, response 0x0

This error usually means some sync issue exist for xgmi baco case. Feel free to 
debug your patches on my XGMI platform.

Regards,
Ma Le

From: Grodzovsky, Andrey 
<[email protected]><mailto:[email protected]>
Sent: Saturday, December 7, 2019 5:51 AM
To: Ma, Le <[email protected]><mailto:[email protected]>; 
[email protected]<mailto:[email protected]>; Zhou1, Tao 
<[email protected]><mailto:[email protected]>; Deucher, Alexander 
<[email protected]><mailto:[email protected]>; Li, Dennis 
<[email protected]><mailto:[email protected]>; Zhang, Hawking 
<[email protected]><mailto:[email protected]>
Cc: Chen, Guchun <[email protected]><mailto:[email protected]>
Subject: Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset support for 
XGMI


Hey Ma, attached a solution - it's just compiled as I still can't make my XGMI 
setup work (with bridge connected only one device is visible to the system 
while the other is not). Please try it on your system if you have a chance.

Andrey
On 12/4/19 10:14 PM, Ma, Le wrote:

AFAIK it's enough for even single one node in the hive to to fail the enter the 
BACO state on time to fail the entire hive reset procedure, no ?
[Le]: Yeah, agree that. I've been thinking that make all nodes entering baco 
simultaneously can reduce the possibility of node failure to enter/exit BACO 
risk. For example, in an XGMI hive with 8 nodes, the total time interval of 8 
nodes enter/exit BACO on 8 CPUs is less than the interval that 8 nodes enter 
BACO serially and exit BACO serially depending on one CPU with yield 
capability. This interval is usually strict for BACO feature itself. Anyway, we 
need more looping test later on any method we will choose.

Any way - I see our discussion blocks your entire patch set - I think you can 
go ahead and commit yours way (I think you got an RB from Hawking) and I will 
look then and see if I can implement my method and if it works will just revert 
your patch.

[Le]: OK, fine.

Andrey
_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to