On 9/19/2021 1:05 AM, Chuck Zmudzinski wrote:
Hello Elliott and Salvatore, I noticed this bug on bullseye ever since I have been running bullseye as a dom0, but my testing indicates there is no problem with src:linux but the problem appeared in src:xen with the 4.14 version of xen on bullseye. I ask Elliott if you are only seeing the problem on Debian's xen-4.14 hypervisor? Also, which architecture, arm or amd64? I only see the problem on the Debian xen-4.14 hypervisor, and I have only tested on amd64, and I have found a fix for my amd64 system which is as follows: Motherboard: ASRock B85M Pro4, BIOS P2.50 12/11/2015, with a Haswell CPU (core i5-4590S) xen hypervisor version: 4.14.2+25-gb6a8c4f72d-2, amd64 linux kernel version: 5.10.46-4 (the current amd64 kernel for bullseye) Boot system: EFI, not using secure boot, booting xen hypervisor and dom0 bullseye with grub-efi package for bullseye, and it boots the xen-4.14-amd64.gz file, not the xen-4.14-amd64.efi file. I also tested a buster dom0 with the 4.19 series kernel on the xen-4.14 hypervisor from bullseye and saw the problem, but I did not see the problem with either a buster (linux 4.19) or bullseye (linux 5.10) dom0 on the xen-4.11 hypervisor, so I think the problem is with the Debian version of the xen-4.14 hypervisor, not with src:linux. I also found a fix in src:xen: I noticed the series of patches in debian/patches of the 4.14.2+25-gb6a8c4f72d-2 version of src:xen (and earlier versions of xen-4.14 on Debian) have several patches backported from the unstable branch of xen upstream. By removing some of these patches from the patches series of the src:xen package, the dom0 shuts down as expected on my ASRock Haswell motherboard. I rebuilt the src:xen package after removing the following patches from the debian/patches series and the result was that the computer shuts down as expected if I boot using the patched hypervisor: 0027-xen-rpi4-implement-watchdog-based-reset.patch 0028-tools-python-Pass-linker-to-Python-build-process.patch 0029-xen-arm-acpi-Don-t-fail-if-SPCR-table-is-absent.patch 0030-xen-acpi-Rework-acpi_os_map_memory-and-acpi_os_unmap.patch 0031-xen-arm-acpi-The-fixmap-area-should-always-be-cleare.patch 0032-xen-arm-Check-if-the-platform-is-not-using-ACPI-befo.patch 0033-xen-arm-Introduce-fw_unreserved_regions-and-use-it.patch 0034-xen-arm-acpi-add-BAD_MADT_GICC_ENTRY-macro.patch 0035-xen-arm-traps-Don-t-panic-when-receiving-an-unknown-.patch Most of these patches seem unrelated to the amd64 architecture and instead affect the arm architecture, and removing all these patches is probably more than is needed to fix this bug, but I removed them all because I could not find them upstream on the 4.14 branch but instead only saw them on the xen unstable branch upstream (I did not check if they are on the 4.15 branch upstream), and I wanted to test a true upstream 4.14 version without these seemingly aggressive patches added by Debian from the unstable branch of xen upstream, and I discovered by being more conservative and not adding these patches from the unstable branch upstream fixed the problem! I suspect the following patch is the culprit for problems shutting down on the amd64 architecture: 0030-xen-acpi-Rework-acpi_os_map_memory-and-acpi_os_unmap.patch The commit log for this patch states: From: Julien Grall <jgr...@amazon.com> Date: Sat, 26 Sep 2020 17:44:29 +0100 Subject: xen/acpi: Rework acpi_os_map_memory() and acpi_os_unmap_memory() The functions acpi_os_{un,}map_memory() are meant to be arch-agnostic while the __acpi_os_{un,}map_memory() are meant to be arch-specific. Currently, the former are still containing x86 specific code. To avoid this rather strange split, the generic helpers are reworked so they are arch-agnostic. This requires the introduction of a new helper __acpi_os_unmap_memory() that will undo any mapping done by __acpi_os_map_memory(). Currently, the arch-helper for unmap is basically a no-op so it only returns whether the mapping was arch specific. But this will change in the future. Note that the x86 version of acpi_os_map_memory() was already able to able the 1MB region. Hence why there is no addition of new code. Signed-off-by: Julien Grall <jgr...@amazon.com> Reviewed-by: Rahul Singh <rahul.si...@arm.com> Reviewed-by: Jan Beulich <jbeul...@suse.com> Acked-by: Stefano Stabellini <sstabell...@kernel.org> Tested-by: Rahul Singh <rahul.si...@arm.com> Tested-by: Elliott Mitchell <ehem+...@m5p.com> (cherry picked from commit 1c4aa69ca1e1fad20b2158051eb152276d1eb973) --------------------------------------------------- This patch does affect amd64 acpi code, and is probably causing the problem on my amd64 system, so my build of the xen-4.14 hypervisor without this patch fixed the problem. I think this bug should be re-classified as a bug in src:xen. I also would inquire with the Debian Xen Team about why they are backporting patches from the upstream xen unstable branch into Debian's 4.14 package that is currently shipping on Debian stable (bullseye). IMHO, the aforementioned patches that are not in the stable 4.14 branch upstream should not be included in the xen package for Debian stable. Regards, Chuck Zmudzinski
As a follow-up to my last comment on this bug, the problems I see with my bullseye amd64 dom0 point to problems with ACPI powerdown/reset issue, but only on the Debian version of Xen-4.14. I do not see the problem on any version of the linux kernel, neither on bare metal nor on the Debian version of the Xen-4.11 hypervisor from buster. For example, the problem manifests itself on the Debian Xen-4.14 hypervisor with the Debian dom0 reaching the systemd power off target but the power does not actually turn off. Moreover, I can only recover by manually resetting the computer by pressing the physical reset button on the computer or removing power by physically unplugging the computer. One slight difference I see from what Elliott reported - not only does the power supply remain powered after shutdown, but also messages on the console about powering down remain on the display monitor after reaching the systemd power down target and power to the display/monitor also persists. For my amd64 system, this bug would be probably fixed on Debian stable by having a separate Xen-4.14 package for Debian stable that removes at least the following patches from the debian/patches series of the current Xen-4.14 package for stable: 0027-xen-rpi4-implement-watchdog-based-reset.patch 0029-xen-arm-acpi-Don-t-fail-if-SPCR-table-is-absent.patch 0030-xen-acpi-Rework-acpi_os_map_memory-and-acpi_os_unmap.patch 0031-xen-arm-acpi-The-fixmap-area-should-always-be-cleare.patch 0032-xen-arm-Check-if-the-platform-is-not-using-ACPI-befo.patch 0033-xen-arm-Introduce-fw_unreserved_regions-and-use-it.patch 0034-xen-arm-acpi-add-BAD_MADT_GICC_ENTRY-macro.patch 0035-xen-arm-traps-Don-t-panic-when-receiving-an-unknown-.patch The 0028-tools-python-Pass-linker-to-Python-build-process.patch is probably not related to this bug, but I have not verified that the bug is fixed without removing that patch also. I would defer to more knowledgeable people about the problems with building Xen on Debian using various versions of python to decide whether or not to remove the 0028-tools-python... patch. I think perhaps the aforementioned patches to xen/arm and xen/acpi would be suitable for testing a Debian Xen package targeting bookworm/testing or sid/unstable, but not for Debian bullseye/stable. As it is now, it appears the Debian Xen Team is not making any distinction between stable, testing, and unstable for its current Xen-4.14 package, and IMHO that is the root cause of this bug on Debian stable. If the Debian Xen Team wants to experiment with patches from the unstable branch of upstream Xen on a Debian version of Xen-4.14, I respectfully ask that it do so only on bookworm/testing or unstable/sid and ship a separate more conservative package for bullseye/stable that is closer to the official upstream Xen 4.14.x version than the package that is currently shipping on bullseye/stable. Regards, Chuck Zmudzinski