(+Ard)

On 05/22/19 16:22, Laszlo Ersek wrote:
> On 05/22/19 15:06, Igor Mammedov wrote:
>> On Tue, 21 May 2019 09:26:16 -0400
>> "Michael S. Tsirkin" <m...@redhat.com> wrote:
>>
>>> On Tue, May 21, 2019 at 12:49:48PM +0100, Peter Maydell wrote:
>>>> On Tue, 21 May 2019 at 00:10, Michael S. Tsirkin <m...@redhat.com>
>>>> wrote:
>>>>>
>>>>> The following changes since commit
>>>>> 2259637b95bef3116cc262459271de08e038cc66:
>>>>>
>>>>>   Merge remote-tracking branch 'remotes/kevin/tags/for-upstream'
>>>>>   into staging (2019-05-20 17:22:05 +0100)
>>>>>
>>>>> are available in the Git repository at:
>>>>>
>>>>>   git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git
>>>>>   tags/for_upstream
>>>>>
>>>>> for you to fetch changes up to
>>>>> 0c05ec64c388aea59facbef740651afa78e04f50:
>>>>>
>>>>>   tests: acpi: print error unable to dump ACPI table during
>>>>>   rebuild (2019-05-20 18:40:02 -0400)
>>>>>
>>>>> ----------------------------------------------------------------
>>>>> pci, pc, virtio: features, fixes
>>>>>
>>>>> reconnect for vhost blk
>>>>> tests for UEFI
>>>>> misc other stuff
>>>>>
>>>>> Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
>>>>>
>>>>> ----------------------------------------------------------------
>>>>
>>>> Hi -- this failed 'make check' for 32-bit Arm hosts:
>>>>
>>>> ERROR:/home/peter.maydell/qemu/tests/acpi-utils.c:145:acpi_find_rsdp_address_uefi:
>>>> code should not be reached
>>>> Aborted
>>>> ERROR - too few tests run (expected 1, got 0)
>>>> /home/peter.maydell/qemu/tests/Makefile.include:885: recipe for
>>>> /target check-qtest-aarch64' failed
>>>>
>>>> thanks
>>>> -- PMM
>>>
>>> Nothing jumps out ... Igor?
>> On 32-bit ARM host and it looks like UEFI crashes (CCing Laszlo)
>> with:
>>
>> InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 469E52C0
>> ASSERT [DxeCore] 
>> /home/lacos/src/upstream/qemu/roms/edk2/MdePkg/Library/BaseLib/String.c(1090):
>>  Length < _gPcd_FixedAtBuild_PcdMaximumAsciiStringLength
>>
>> CLI to reproduce:
>>
>> qemu-system-aarch64  -display none -machine virt,accel=tcg
>> -nodefaults -nographic -drive
>> if=pflash,format=raw,file=pc-bios/edk2-aarch64-code.fd,readonly
>> -drive if=pflash,format=raw,file=pc-bios/edk2-arm-vars.fd,snapshot=on
>> -cdrom tests/data/uefi-boot-images/bios-tables-test.aarch64.iso.qcow2
>> -cpu cortex-a57 -serial stdio
>
> This is very interesting. I had obviously tested booting
> "bios-tables-test.aarch64.iso.qcow2" against "edk2-aarch64-code.fd",
> using TCG, on my x86_64 laptop. (And, I've run the above exact command
> just now, at commit a4f667b67149 -- it works 100% fine.)
>
> However, I've never been *near* a 32-bit ARM host. Therefore my
> suspicion is that the AARCH64 UEFI guest code tickles something in the
> 32-bit ARM code generator. It could be a bug in 32-bit ARM TCG, or it
> could be a bug in edk2 that is exposed by 32-bit ARM TCG.
>
> The direct assertion failure is mostly useless. The AsciiStrLen()
> function does what you'd expect it to, except it has a kind of "safety
> check" where it trips an assertion if the string length (under
> measurement) exceeds a pre-set platform constant. It helps with
> catching memory corruption errors.
>
> $ git show edk2-stable201903:MdePkg/Library/BaseLib/String.c | less
> 1090g
>
> UINTN
> EFIAPI
> AsciiStrLen (
>   IN      CONST CHAR8               *String
>   )
> {
>   UINTN                             Length;
>
>   ASSERT (String != NULL);
>
>   for (Length = 0; *String != '\0'; String++, Length++) {
>     //
>     // If PcdMaximumUnicodeStringLength is not zero,
>     // length should not more than PcdMaximumUnicodeStringLength
>     //
>     if (PcdGet32 (PcdMaximumAsciiStringLength) != 0) {
>       ASSERT (Length < PcdGet32 (PcdMaximumAsciiStringLength)); <-- HERE
>     }
>   }
>   return Length;
> }
>
> (Never mind that the comment has a typo -- it incorrectly references
> "PcdMaximumUnicodeStringLength", but the PCD that's actually checked
> is (correctly) "PcdMaximumAsciiStringLength".)
>
> The constant is set to decimal 1,000,000 in ArmVirtQemu builds
> (inherited from MdePkg.dec), and that's indeed a quite unlikely length
> for real-word strings seen by firmware.
>
> I'll take a closer look once I have access to a 32-bit ARM host, but
> I'll definitely need help. Basically we should compare the original
> AARCH64 (dis)assembly with the QEMU-generated 32-bit ARM assembly.
> Hopefully I can at least get a backtrace myself.

I have narrowed down the issue sufficiently that I think I can hand it
over to Peter and Ard now -- because they know AARCH32 and AARCH64
assembly, and "target/arm/translate-a64.c" and "tcg/arm/*" too.

The summarize the issue for Ard, the symptom is that AARCH64 ArmVirtQemu
runs perfectly fine with TCG on an x86-64 system, but it crashes on an
AARCH32 host system.

Below is my analysis.

(1) First, I determined a backtrace for the crash. For this, I flipped
the ASSERT() failure disposition from CpuDeadLoop() to CpuBreakpoint(),
via "PcdDebugPropertyMask". This printed a very nice (numeric) stack
trace, which wasn't hard to turn into symbols with "objdump -S", using
edk2's Build directory.


(2) The actual crash is completely irrelevant, as it occurs on a cleanup
path after the DXE Core fails to load the very first DXE driver that it
attempts to load. The cleanup path should never be entered (i.e. the
attempt to load the DXE driver in question should never fail). BTW the
DXE driver in question is "MdeModulePkg/Universal/DevicePathDxe", but
it's mostly irrelevant.


(3) The function that first encounters a failure -- i.e. where the guest
firmware behavior diverges, dependent on whether qemu-system-aarch64/TCG
executes on an AARCH32 host, or an x86-64 host -- is
PeCoffLoaderRelocateImage(), in
"MdePkg/Library/BasePeCoffLib/BasePeCoff.c". It is invoked when the DXE
Core loads DevicePathDxe. The following check fails in the function (on
the AARCH32 host), when it attempts to process the very first relocation
block in DevicePathDxe:

>       //
>       // Add check for RelocBase->SizeOfBlock field.
>       //
>       if (RelocBase->SizeOfBlock == 0) {
>         ImageContext->ImageError = IMAGE_ERROR_FAILED_RELOCATION;
>         return RETURN_LOAD_ERROR;
>       }

I logged the address and the contents of (*RelocBase). The address is
the same in both working and failing cases, the contents differ.


(4) I tracked back a little bit to CoreLoadImageCommon() in
"MdeModulePkg/Core/Dxe/Image/Image.c", to the spot where the image file
is fetched (for later relocation). The following function call succeeds
in both cases, however it returns *different data* as the
DevicePathDxe.efi image file:

>     FHand.Source = GetFileBufferByFilePath (
>                       BootPolicy,
>                       FilePath,
>                       &FHand.SourceSize,
>                       &AuthenticationStatus
>                       );

Base address and size are identical, the CRC32s differ. After hexdumping
the image variants (functional vs. broken with garbled relocations), and
diffing the logs, an interesting pattern emerged. In every 4096 byte
block, the 8-byte word at offset 4032 (0xFC0) is zeroed out in the
broken variant. There are no other differences, as far as I can tell.
4096 = 64*64, and the qword in question is the start of the last 64-byte
block (63*64=4032). I'm attaching the two log sections ("good.txt" (from
the x86-64 host) vs "bad.txt" (from the aarch64 host)).


(5) Because the DevicePathDxe.efi image originates from the FvMain
firmware volume, which is embedded as an LZMA-compressed file into the
FVMAIN_COMPACT firmware volume, I hooked another CRC32 calculation into
LzmaUefiDecompress(), in
"MdeModulePkg/Library/LzmaCustomDecompressLib/LzmaDecompress.c". The
decompression is performed by the PEI Core with the help of the DXE IPL
PEIM; in other words it happens in the PEI phase. The log confirmed that
the firmware ran identically on both hosts (x86-64 and aarch32).

Thus, the data corruption was introduced somewhere between the
decompression near the end of PEI, and GetFileBufferByFilePath() in the
DXE Core.


(6) Here I got a bit frustrated by the many possible paths in the
reading of firmware volumes, in the files
"MdeModulePkg/Core/Dxe/FwVol/FwVolRead.c" and
"MdeModulePkg/Core/Dxe/SectionExtraction/CoreSectionExtraction.c".
However, all those paths seemed to end in CopyMem(), one way or another
-- ultimately, CopyMem() would transfer the data from the decompressed
firmware volume (which was fine) to the caller of
GetFileBufferByFilePath() (which was not fine).


(7) CopyMem() comes from the BaseMemoryLib class.
"ArmVirtPkg/ArmVirt.dsc.inc" resolves it to the following lib instances:

> [LibraryClasses.common]
>   # use the accelerated BaseMemoryLibOptDxe by default, overrides for SEC/PEI 
> below
>   BaseMemoryLib|MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
>
> [LibraryClasses.common.SEC]
>   BaseMemoryLib|MdePkg/Library/BaseMemoryLib/BaseMemoryLib.inf
>
> [LibraryClasses.common.PEI_CORE]
>   BaseMemoryLib|MdePkg/Library/BaseMemoryLib/BaseMemoryLib.inf
>
> [LibraryClasses.common.PEIM]
>   BaseMemoryLib|MdePkg/Library/BaseMemoryLib/BaseMemoryLib.inf

The optimized aarch64 assembly code can be seen here:

  
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseMemoryLibOptDxe/AArch64/CopyMem.S

It has great comments, and the 64-byte chunk size mentioned in the
comments made me realize that 0xFC0 equals 63 decimal * 64 decimal.


(8) I applied the following (proof of concept) patch:

> diff --git a/ArmVirtPkg/ArmVirt.dsc.inc b/ArmVirtPkg/ArmVirt.dsc.inc
> index a5d63751a343..c643a5a76718 100644
> --- a/ArmVirtPkg/ArmVirt.dsc.inc
> +++ b/ArmVirtPkg/ArmVirt.dsc.inc
> @@ -67,8 +67,7 @@ [LibraryClasses.common]
>    #
>    PcdLib|MdePkg/Library/DxePcdLib/DxePcdLib.inf
>
> -  # use the accelerated BaseMemoryLibOptDxe by default, overrides for 
> SEC/PEI below
> -  BaseMemoryLib|MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
> +  BaseMemoryLib|MdePkg/Library/BaseMemoryLib/BaseMemoryLib.inf
>
>    # Networking Requirements
>  !include NetworkPkg/NetworkLibs.dsc.inc
> @@ -160,7 +159,6 @@ [LibraryClasses.common]
>
>  [LibraryClasses.common.SEC]
>    PcdLib|MdePkg/Library/BasePcdLibNull/BasePcdLibNull.inf
> -  BaseMemoryLib|MdePkg/Library/BaseMemoryLib/BaseMemoryLib.inf
>
>    
> DebugAgentLib|ArmPkg/Library/DebugAgentSymbolsBaseLib/DebugAgentSymbolsBaseLib.inf
>    
> SerialPortLib|ArmVirtPkg/Library/FdtPL011SerialPortLib/EarlyFdtPL011SerialPortLib.inf
> @@ -171,7 +169,6 @@ [LibraryClasses.common.SEC]
>
>  [LibraryClasses.common.PEI_CORE]
>    PcdLib|MdePkg/Library/PeiPcdLib/PeiPcdLib.inf
> -  BaseMemoryLib|MdePkg/Library/BaseMemoryLib/BaseMemoryLib.inf
>    HobLib|MdePkg/Library/PeiHobLib/PeiHobLib.inf
>    PeiServicesLib|MdePkg/Library/PeiServicesLib/PeiServicesLib.inf
>    
> MemoryAllocationLib|MdePkg/Library/PeiMemoryAllocationLib/PeiMemoryAllocationLib.inf
> @@ -186,7 +183,6 @@ [LibraryClasses.common.PEI_CORE]
>
>  [LibraryClasses.common.PEIM]
>    PcdLib|MdePkg/Library/PeiPcdLib/PeiPcdLib.inf
> -  BaseMemoryLib|MdePkg/Library/BaseMemoryLib/BaseMemoryLib.inf
>    HobLib|MdePkg/Library/PeiHobLib/PeiHobLib.inf
>    PeiServicesLib|MdePkg/Library/PeiServicesLib/PeiServicesLib.inf
>    
> MemoryAllocationLib|MdePkg/Library/PeiMemoryAllocationLib/PeiMemoryAllocationLib.inf

which replaces the assembly implementation of CopyMem() -- and of some
other functions -- with C implementations (which are also optimized; see
commit 01f688be90f5, "MdePkg/BaseMemoryLib: widen aligned accesses to 32
or 64 bits", 2016-09-13), in all module types.


(9) With this patch, the boot finished successfully (although it took
very long):

> BiosTablesTest: BiosTablesTest=41200000 Rsdp10=0 Rsdp20=40370000
> BiosTablesTest: Smbios21=0 Smbios30=43EF0000
> BiosTablesTest: press any key to exit


(10) Given that "translate-a64.c" is common between both x86-64 and
aarch32 hosts, I think it must be "tcg/arm/*" that doesn't interoperate
with the guest's "MdePkg/Library/BaseMemoryLibOptDxe/AArch64/CopyMem.S",
for some reason. IOW, the aarch64 binary code is likely parsed correctly
into the internal representation, but the 32-bit ARM code generated from
the IR could hit some corner case.

Thanks,
Laszlo

Attachment: good.txt.xz
Description: application/xz

Attachment: bad.txt.xz
Description: application/xz

Reply via email to