On 5/21/23 01:14, Christian wrote:
-------- Ursprüngliche Nachricht --------
Von: David Christensen <dpchr...@holgerdanske.com>
An: debian-user@lists.debian.org
Betreff: Re: Weird behaviour on System under high load
Datum: Sat, 20 May 2023 18:00:48 -0700
On 5/20/23 14:46, Christian wrote:
Hi there,
I am having trouble with a new build system. It works normal and
stable
until I put extreme stress on it, e.g. using all 12 cores with stress
tool.
System will suddenly loose network connection and become
unresponsive.
Only a reset works. I am not sure what is going on, but it is
reproducible: Put stress on the system and it fails. It seems, that
something is getting out of step.
Stuff below I found in the logs. I tried quite a bit, even upgraded
to
bookworm, to see if the newer kernel works.
If anyone knows how to analyze this issue, it would be very helpful.
Please use inline posting style and proper indentation.
Have you verified that your PSU has sufficient capacity for the load on
each and every rail?
> Hi there,
>
> Lets go through the different topics:
> - Setup: It is a AMD 5600G
https://www.amd.com/en/products/apu/amd-ryzen-5-5600g
65 W
> on a ASRock B550M-ITX/ac,
https://www.asrock.com/mb/AMD/B550M-ITXac/index.asp
> powered by a BeQuiet SP7 300W
>
> - Power: From the specifications it should fit. As it takes 5-20
> minutes for the error to occur, I would take that as an indication,
> that the power supply is ok. Otherwise would expect that to fail right
> away? Is there a way to measure/test if there is any issue with it?
> I also tested to limit PPT to 45W which also makes no difference.
If all you have a motherboard, a 65W CPU, and an SSD, that looks like a
good quality 300W PSU and I would think it should support long-term full
loading of the CPU. But, there is no substitute for doing the engineering.
I do PSU calculations using a spreadsheet. This requires finding power
specifications (or making estimates) for everything in the system, which
can be tough.
BeQuiet has a PSU calculator. I suggest using it:
https://www.bequiet.com/en/psucalculator
Measuring actual power supply output and system usage would involve
building or buying suitable test equipment. The cost would be non-trivial.
An easy A/B test would be to connect a known-good, high-quality PSU with
a higher power rating (say, 500-1000W). I use:
https://www.fractal-design.com/products/power-supplies/ion/ion-2-platinum-660w/black/
Have you cleaned the system interior, filters, fans, heatsinks, ducts,
etc., recently?
?
Have you tested the thermal solution(s) recently?
> - Thermal: I am observing the temperatures on the stresstest. If I am
> correct in reading Smbusmaster0, Temps haven't been above 71°C, but
> error also occurs earlier, way below 70.
Okay.
What is your CPU thermal solution?
What stresstest are you using?
Have you tested the power supply recently?
I suffered a rash of bad PSU's recently. I was able to figure it out
because I bought an inexpensive PSU tester years ago. It has saved my
sanity more than once. I suggest that you buy something like it:
https://www.ebay.com/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=antec+atx12+tester&_sacat=0
Have you tested the memory recently?
> - Memory: Yes was tested right after the build with no errors
Okay.
Did you do multi-threaded/ stress tests?
Are you running Debian stable?
Are you running Debian stable packages only? Were they all installed
with the same package manager?
> - OS: I was running Debian stable in quite a minimal configuration
> (fresh install as most services are dockerized) when first observed the
> error. Now moved to Debian 12/Bookworm to see if it makes any
> difference with higher kernel (it does not). Also exchanged r8169 for
> the r8168. It changes the error messages, however system instability
> stays.
Did you see the problems when running Debian stable OOTB, before adding
anything?
Did you stress test the system before adding anything (other than the
stress test)?
If all of the above are okay and the system is still locking up, I
would
disable or remove all disks in the system, install a zeroed SSD,
install
Debian stable choosing only "SSH server" and "standard system
utilities", install only the stable packages required for your
workload,
put the workload on it, and see what happens.
> I could disconnect the disks and see if it makes any difference.
> However when reproducing this error, disks other than system where
> unmounted. So would guess this would be a test to see if it is about
> power?
Stripping the system down to minimum hardware and software is a good
starting point. You will need a tool to load the system and some means
to watch what happens. Assuming the base configuration passes all
tests, then add something, test, and repeat until testing fails.
Here is a Perl script I wrote for loading the CPU. It should run on a
base install of Debian OOTB:
2023-05-21 02:24:44 dpchrist@taz ~/home
$ cat exercise-cpu
#!/usr/bin/env perl
# $Id: exercise-cpu,v 1.1 2023/04/10 02:05:22 dpchrist Exp $
# by David Paul Christensen dpchr...@holgerdanske.com
# Public Domain
#
# Exercise central processing unit
use threads;
use strict;
use warnings;
use File::Basename;
use Time::HiRes qw( sleep time );
die sprintf "Usage: %s PERCENT DURATION\n", basename($0)
unless @ARGV == 2;
my $a = 0.01 * shift; # periodic exercise duration
my $b = 1 - $a; # periodic sleep duration
$_ = qx/lscpu/; # Debian GNU/Linux
my ($c) = /CPU.s.:\s+(\d+)/; # number of virtual cores
my $e = time + shift; # time to end
my @thr; # threads
push @thr, async {
while (time < $e) {
my $d = time + $a / 10;
1 while time < $d;
sleep $b/10;
}
} for 1..$c;
$_->join for @thr;
Run it like this:
2023-05-21 02:50:06 dpchrist@taz ~/home
$ ./exercise-cpu
Usage: exercise-cpu PERCENT DURATION
2023-05-21 02:50:52 dpchrist@taz ~/home
$ ./exercise-cpu 25 10
2023-05-21 02:51:33 dpchrist@taz ~/home
$ ./exercise-cpu 50 10
2023-05-21 02:51:48 dpchrist@taz ~/home
$ ./exercise-cpu 75 10
2023-05-21 02:52:01 dpchrist@taz ~/home
$ ./exercise-cpu 100 10
I install Xfce when installing Debian and use the Xfce plugins to watch
CPU loading and CPU temperature. The above tests loaded all virtual
cores at the specified percentage for the specified duration. CPU
temperature peaked at 32 C, 38 C, 66 C and 72 C, respectively.
Having a Debian install on a USB 3.0 flash drive is very useful for
trouble-shooting and for imaging, backup/ restore, archiving, integrity
checking, migration, validation, etc..
David