I am running Kubuntu 18.10 w kernel 4.18.0-11-generic with AMD Ryzen
2700x CPU, I initially believed I had a Ryzen soft lockup issue, and I
had posted in AMD community forums:


https://community.amd.com/thread/225795#

But I later realized the AMD soft lockup issue is one that required
motherboard reset button to get out off. My issue is usually not so bad,
most of the time, SSH and network and VIRTUAL MACHINES inside my server
will still work. I could use the following command vis SSH to get back
alive:

#sudo systemctl restart sddm

I am now inclined to suspect a Linux Kernel scheduler had caused some of
my threads frozen, and X.org console frozen - mouse and keyboard stuck.

The latest discover on/right-after X'mas 2018 was that all CPUs logical
& physical cores will still be running as seen in ksysguard graphs and
top command, while some threads typically my late night crontab backup
jobs, HANG FOR HOURS randomly and after hours, RESUME THEMSELVES. The
backup was apparently all done - but up to after 12hours of delays!

I had also seen frozen X.org screen later refreshed a little after
45mins, but I could not wait further so I SSH a sddm restart as
mentioned above.


I copy my post dated Dec.27.2018 on AMD community forum below:

Dear All,

Today my new discovery indicated that we may be heading wrong direction
with regards to CPU core voltage and power states. It has got to be
something else.


265px-Ksysguard1.png

I use the famous linux top command and ksysguard (above imgs) and I sort
of AMBUSH the problem awaited to solidly catch a process that frozen.


And my chance came today. I caught my Virtual Machines Backup crontab
jobs frozen at the vmware's vmrun suspend command. Info:

https://docs.vmware.com/en/VMware-Fusion/11/com.vmware.fusion.using.doc
/GUID-24F54E24-EFB0-4E94-8A07-2AD791F0E497.html

My cron jobs put each virtual machines into suspend mode and backup into
a harddisk. I got a clue few days ago when I check through my backups,
their folder date time stamps suggested that the usual backup jobs which
should all be done within 30 mins normally, had on 2 occasions took
several hours! There was nothing else wrong beside the long time spent
at late night to backup, the data seem quite completely backed up. That
means, the lockup or freeze could unfreeze themselves and proceeded to a
long delayed completion.


So I ssh into this Ryzen machine at my crontab job hour today, forwarded
X and ran ksysguard and top at remote desktop. Yes the cron job frozen
and backup was not happening. I also used the linux ps -aux | grep
crontab & similar commands, it was confirm that crontab was hanging
awaiting for vmrun to suspend the vm, and this command just frozen. It
fronzen for almost 2 hours! & later it completed it after this long
delay. And my script went ahead further to backup another virtual
machine, and after backing up, it is suppose to do vmrun resume but
agian, the resume frozen up and took more than 1 hour. After this even
my ssh -X session died. I can not reconnect again.


During these hours, I had the top command and ksysguard showing me that
other processes and thread were running, ALL my 16 logical (8 physical)
CPUs were RUNNING! None of the CPU cores were frozen up in C6 or any
other power states, while the thread hang for hours. Because of
Hyperthreading, each 2 logical CPUs are from 1 single physical CPU core,
and if any core locked up in power state during these hours of lockup,
the graphs of 2 logical CPUs must die for each physical CPU to freeze in
deep sleep state. If 2 physical cores locked up, than graphs of 2
logical CPUs must die (ZERO % usage).


I am very sure of my observations. It was repeated twice during my
AMBUSH mission today. I am very sure of how my scripts work, and how
vmrun works, this similar setup and script had worked for more than 10
years, and used on older AMD and Intel machines. This Ryzen is a recent
replacement for the retired old server.


I am now not inclined to believe that CPU cores were frozen in deep
sleep power states, nor it was Typical Current Idle issue. Not for my
Ryzen machine anyway. It has to be something else, RANDOMLY LOCKING UP,
and RANDOMLY UNLOCKED THEMSELVES, Affecting process / thread that also
appear to be random. I checked the PIDs of these locked up jobs, top
said they were in idle state.


While it was locked I went into various /proc folders and files to sniff
for clues, did not get anything too useful except to see that they were
idle

    /proc/[PID]/status

    /proc/[PID]/task/[PID]/status

My favorite soft reset systemctl restart sddm had worked many times
nearly without fail because I think it flushed out and killed the
hanging threads, this command killed X and everything else running on X,
which will be quite a big number, and it restarted KDE desktop manager.


I am hoping to get a further breakthrough to find out what caused the
thread to LOCK-UP & UNLOCK themselves.


Cheers.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1798961

Title:
  Random unrecoverable freezes on Ubuntu 18.10

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798961/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to