[Kernel-packages] [Bug 2026658] Re: CPU frequency governor broken after upgrading from 22.10 to 23.04, stuck at 400Mhz on Alder Lake

Eli Thu, 07 Sep 2023 05:28:11 -0700

I have some ~different behavior, but I was still able to achieve bug 1.

Initially I saw some intel_pstate/max_perf_pct getting set to 90, but it
would recover quickly.


long_term_power would fluctuate from very low values all the way up to
the nominal 65. Previously when triggering the bug it would slowly go
mostly down toward 0.125, getting stuck between 0.125 and ~4.0, with the
cpu clocked at 400Mhz.  Now it will go up, seemingly reset up to 65, and
allow the max frequencies.

I was able to get something like bug 1 to happen twice.
long_term_power got stuck at 19.5 (unusual, might have been 18.5 the first time)
long_term_time got stuck at 28.0 (typical, its usually this or 32.0)

This happened when I toggled between one and two pegged cores to keep
the temperature at a maximum. If I killed the stress commands at the
right moment while the fan was spun up, while long_term_power was in the
high teens, if I then waited for the spans to spin down - I would get
into a situation where I could not sustain maximum clocks long enough to
reach temperatures to trigger thermald to.... poke whatever it is that
resets long_term_power back up to 65.

I was able to do this twice.
On my third attempt I accidentally got a long_term_power value 1f 12.5.  I 
waited for fans to spin down, started stressing again, and it pretty 
immediately dropped to 0.125 with cpu speeds locked to 400Mhz. I also noticed 
at the end that intelpstate/max_perf_pct was at 90.

Finally after resetting the long_power_mode to 65 I noticed that I still
couldn't get the fan up because intelpstate/max_perf_pct 90 results in a
maximum core speed of 4.3Ghz which is too low to trigger thermald.

(Likely irrelevant: even though I see there is that intel wireless
driver deb package (I installed all of the packages) in that 6.2-26
kernel folder you shared with me, my intel wifi card does not work when
booted into it.)

this was with the thermald build in adaptive mode, the full command was:
sudo ./thermald --no-daemon --adaptive --loglevel=debug


** Attachment added: "thermaldLog_Adaptive_202309070444.zip"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2026658/+attachment/5698212/+files/thermaldLog_Adaptive_202309070444.zip

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to thermald in Ubuntu.
https://bugs.launchpad.net/bugs/2026658

Title:
  CPU frequency governor broken after upgrading from 22.10 to 23.04,
  stuck at 400Mhz on Alder Lake

Status in linux package in Ubuntu:
  Incomplete
Status in thermald package in Ubuntu:
  In Progress

Bug description:
  I've tried to include as much detail as possible in this bug report, I
  originally assembled it just after the release of ubuntu 23.04.  There
  has been no change since then.

  
  I have had substantial performance problems since updating from ubuntu 22.10 
to 23.04.
  The computer in question is the 17 inch Razer Blade laptop from 2022 with an 
intel i7-12800H.
  Current kernel is 6.2.0-20-generic.  (now I'm on 6.2.0-24-generic and nothing 
has changed.)
  This issue occurs regardless of whether the OpenRazer 
(https://openrazer.github.io/) drivers etc are installed.

  
  Description of problem:
  I have discovered what may be two separate bugs involving low level power 
management details on the cpu, they involve the cpu entering different types of 
throttled states and never recovering. These issues appeared immediately after 
upgrading from ubuntu 22.10.  The computer is a large ~gaming laptop with 
plenty of thermal headroom, cpu temperatures cannot reach concerning values 
except when using stress testing tools.

  (I don't know how to propery untangle these two issues, so I'm posting
  them as one. I apologize for the review complexity this causes, but I
  think posting the information all in one spot is more constructive
  here.)

  
  High level testing notes:
  - This issue occurs with use of both the intel_pstate driver and the cpufreq 
driver. (I don't have the same level of detail for cpufreq, but the issue still 
occurs.)
  - I have additionally tested a handful of intel_pstate parameters (and 
others) via grub kernel command line arguments to no effect. All testing 
reported here was done with:
    GRUB_CMDLINE_LINUX_DEFAULT="modprobe.blacklist=nouveau"
    GRUB_CMDLINE_LINUX=""
    (loading nouveau caused problems for me on 22.10, I have not bothered 
reinvestigating it on 23.04)
  - There is a firmware update available from the manufacture when I boot into 
Windows, I have not installed it (yet).
  - - Update: I installed it. No change.
  - Changing the cpu governor setting from "powersave" to "performance" using 
`cpupower frequency-set -g performance` has no effect. (Note: this action is 
separate from the intel_pstate's power-saver/balanced/performance setting 
visible with the `powerprofilesctl` utility. It doesn't seem to be a governor 
bug.
  - - (There is a tertiary issue where I also see substantial (+50%) 
performance degredation using the "performance" profile in a test suite I run 
constantly for my job; that is clearly a problem but it is an unrelated bug 
that has existed for quite some time.)

  
  Summary and my own conclusions:
  These are my takeaways, the ~raw data is in the followup section.

  
  Bug 1)
  The reported cpu power limits are progressively constrained over time. Once 
this failure mode starts the performance never recovers.
    - As this situation progresses the observed cpu speeds (I'm using htop) 
list as 2800Mhz at idle, but the instant any load at all is placed on a cpu 
core that core immediately drops to exactly 400Mhz.
    - This situation occurs quite quickly in human terms, frequently within 20 
minutes of normal usage after a boot, but it will also occur when the computer 
is just sitting there unused for a handful of hours.
    - This occurs when using the cpufreq gevernor (by including 
"intel_pstate=disable" on the grub command line args.)
    - At boot the default value for short_term_time looks wrong to me. This is 
the duration of higher thermal targets in seconds, ~0.002 seconds seems 
extremely short. A normal value would be a handful of seconds.
    - This situation can be remedied by running the following python script. It 
uses the undervolt package (pip install undervolt==0.3.0) to force particular 
power limits (the provided values are intentional overkill):
       1   │ from undervolt import read_power_limit, set_power_limit, 
PowerLimit, ADDRESSES
       2   │ from pprint import pprint
       3   │ 
       4   │ limits = read_power_limit(ADDRESSES)
       5   │ pprint(vars(limits))  # print current values before setting them
       6   │ 
       7   │ POWER_LIMITS = PowerLimit()
       8   │ POWER_LIMITS.locked = True  # lock means don't allow the value to 
be reset until a reboot.
       9   │ POWER_LIMITS.backup_rest = 281474976776192  # afaik this is just a 
backup-on-failure setting, it has no effect here.
      10   │ POWER_LIMITS.long_term_enabled = True
      11   │ POWER_LIMITS.long_term_power = 160  # values are intentional 
overkill
      12   │ POWER_LIMITS.long_term_time = 2880.0
      13   │ POWER_LIMITS.short_term_enabled = True
      14   │ POWER_LIMITS.short_term_power = 250
      15   │ POWER_LIMITS.short_term_time = 500.0
      16   │ set_power_limit(POWER_LIMITS, ADDRESSES)
      17   | 
      18   | limits2 = read_power_limit(ADDRESSES)  # and print the new state
      19   | pprint(vars(limits2))

  
  Bug 2)
  `powerprofilesctl` has unearthed some bug where the cpu performance enters 
the degraded state "high-operating-temperature", and never recovers.
    - This appears to happen for no reason. There is a brief cpu temperature 
spike in the example data below, but it does not hit the listed hardware limit 
values so I am at a loss for its cause.
    - I ran a cpu stress test (prime95/mprime torture test), it immediately 
spikes cpu temperature to 100 degrees and throttles the cpu, but doesn't 
trigger the high temperature degraded state. Go figure.
    - This bug takes quite a while to kick in, uptime in my example below was 
at over 14 hours.
    - When this situation occurs the maximum cpu speed becomes 2400Mhz across 
all cpu cores. The cpu power management appears to behave correctly in the 
400-2400Mhz range. I believe this means all turbo frequencies are disabled.
    - Running the comman `sudo cpupower frequency-set -u 4800000` (or any value 
above 2400000) does not correct the reported cpu_policy_range, it remains 
locked at 2400Mhz.
    - The only fix I know is a reboot.

  
  THE DATA:

  Bug 1:
  This output was gathered using a python package called undervolt's 
read_power_limit function from a script that starts running at ~boot.
  long_term_power and short_term_power metrics are values in watts, 
long_term_time and short_term_time are values in seconds.

  2023-05-12  15:14:32 up 0 min,  0 user,  load average: 0.39, 0.10, 0.03 
  (boot, log starts after normal user login)
   long_term_power: 65.0
   long_term_time: 32.0
   short_term_power: 160.0
   short_term_time: 0.00244140625

  2023-05-12  15:20:29 up 6 min,  2 users,  load average: 1.90, 0.86, 0.37 
   long_term_power: 20.875  <-- down
   long_term_time: 28.0  <-- down
   short_term_power: 160.0
   short_term_time: 0.00244140625

  2023-05-12  15:20:46 up 6 min,  2 users,  load average: 1.63, 0.87, 0.38 
   long_term_power: 22.625  <-- hey it went up! I was still using the computer 
at this point
   long_term_time: 28.0
   short_term_power: 160.0
   short_term_time: 0.00244140625

  2023-05-12  15:46:15 up 32 min,  2 users,  load average: 0.66, 0.84, 0.79 
  (no longer at computer by the time this occurs)
   long_term_power: 20.625  <-- down
   long_term_time: 28.0
   short_term_power: 160.0
   short_term_time: 0.00244140625

  2023-05-12  16:04:46 up 50 min,  3 users,  load average: 0.46, 0.70, 0.79 
   long_term_power: 16.625  <-- down
   long_term_time: 28.0
   short_term_power: 160.0
   short_term_time: 0.00244140625

  2023-05-12  17:23:07 up  2:08,  3 users,  load average: 0.49, 0.61, 0.68 
  (by the time long_term_power hits 8.625 all cpu cores throttle to 400Mhz 
under any load. This one was preceded by ~1 second of a single cpu core 
randomly spiking to 78 degrees, output from `powerprofilesctl` remains normal. 
At this point long_term_power will never go up again. I have seen one more 
lowered stage at ~4.3125w.)
   long_term_power: 8.625  <-- way down - I've seen lower, though.
   long_term_time: 28.0
   short_term_power: 160.0
   short_term_time: 0.00244140625

  (And then after several hours stuck in this mode I returned to the
  computer and needed to run the script in the bug 1 summary to make it
  usable again.)

  
  Bug 2:
  (Some cleanup of output, script starts at ~boot)
  2023-05-11  22:21:15 up 14:15,  2 users,  load average: 0.38, 0.42, 0.52 

  Output from powerprofilesctl:
    |  performance:
    |    Driver:     intel_pstate
    |    Degraded:   no
    |* balanced:
    |    Driver:     intel_pstate
    |  power-saver:
    |    Driver:     intel_pstate

  some summarized details from the `cpupower` utility:
    | cpu_number: 2
    | cpu_range: 400 MHz - 4.70 GHz
    | cpu_policy_range: 400 MHz and 4.70 GHz.
    | governor: powersave

  output from `sensors` (slightly compactified, I don't know what's up with the 
cpu core numbers):
    | iwlwifi_1-virtual-0 - Adapter: Virtual device - temp1: +49.0°C  
    | nvme-pci-0300 - Adapter: PCI adapter - Composite:
    |   +40.9°C  (low = -5.2°C, high = +89.8°C) (crit = +93.8°C)
    | nvme-pci-0200 - Adapter: PCI adapter:
    |   Composite:   +36.9°C  (low = -273.1°C, high = +80.8°C) (crit = +84.8°C)
    |   Sensor 1:    +36.9°C  (low = -273.1°C, high = +65261.8°C)
    |   Sensor 2:    +38.9°C  (low = -273.1°C, high = +65261.8°C)
    | coretemp-isa-0000 - Adapter: ISA adapter
    | Package id 0:  +77.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 0:        +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 4:        +54.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 8:        +77.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 12:       +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 16:       +64.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 20:       +45.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 24:       +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 25:       +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 26:       +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 27:       +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 28:       +50.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 29:       +50.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 30:       +50.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 31:       +50.0°C  (high = +100.0°C, crit = +100.0°C)
    | acpitz-acpi-0 - Adapter: ACPI interface: temp1: +27.8°C (crit = +105.0°C)

  
  2023-05-11  22:21:17 up 14:15,  2 users,  load average: 0.38, 0.42, 0.52    
(2 seconds later)

  output from `powerprofilesctl`:
    |  performance:
    |    Driver:     intel_pstate
    |    Degraded:   yes (high-operating-temperature)
    |* balanced:
    |    Driver:     intel_pstate
    |  power-saver:
    |    Driver:     intel_pstate

  some summarized details from the `cpupower` utility:
    | cpu_number: 8
    | cpu_range: 400 MHz - 4.70 GHz
    | cpu_policy_range: 400 MHz and 2.40 GHz.
    | governor: powersave

  output from `sensors` (slightly compactified, I don't know what's up with the 
cpu core numbers):
    | iwlwifi_1-virtual-0 Adapter: Virtual device temp1: +49.0°C  
    | nvme-pci-0300 - Adapter: PCI adapter
    |   Composite:    +40.9°C  (low =  -5.2°C, high = +89.8°C) (crit = +93.8°C)
    | nvme-pci-0200 - Adapter: PCI adapter
    |   Composite:    +36.9°C  (low = -273.1°C, high = +80.8°C) (crit = +84.8°C)
    |   Sensor 1:     +36.9°C  (low = -273.1°C, high = +65261.8°C)
    |   Sensor 2:     +38.9°C  (low = -273.1°C, high = +65261.8°C)
    | coretemp-isa-0000 - Adapter: ISA adapter
    |   Package id 0:  +60.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 0:        +53.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 4:        +59.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 8:        +54.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 12:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 16:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 20:       +60.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 24:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 25:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 26:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 27:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 28:       +55.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 29:       +55.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 30:       +55.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 31:       +55.0°C  (high = +100.0°C, crit = +100.0°C)
    | acpitz-acpi-0 - Adapter: ACPI interface - temp1: +27.8°C (crit = +105.0°C)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2026658/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2026658] Re: CPU frequency governor broken after upgrading from 22.10 to 23.04, stuck at 400Mhz on Alder Lake

Reply via email to