Public bug reported:
[IMPACT]
In Bluefield-2 and Bluefield-3 embedded ARM cores (Ubuntu 22.04 Jammy), ptp4l
randomly goes out of sync during long-running operations (~24 hours) with the
error message:
"ptp4l[3416283.946]: port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)"
Debugging traces reveal that the failure occurs in the network stack's
sendto() system call when ptp4l attempts to send DelayReq messages,
returning error code -6 (ENXIO - "No such device or address"). This
issue affects PTP synchronization reliability on Bluefield hardware and
was reproduced consistently on BF2 and BF3 systems but not on CX6 DX
hardware.
The root cause is corrupted FPSIMD (floating point SIMD) register state
during kernel mode context switches. When the kernel uses NEON/FPSIMD
instructions for network operations or cryptographic functions, the
register state can be lost or corrupted if a context switch occurs,
leading to unpredictable behavior in subsequent operations including
network socket calls. This corruption manifests as the observed sendto()
failures that disrupt PTP synchronization.
[FIX]
Backporting the upstream commit:
aefbab8e77eb16b56e18f24b85a09ebf4dc60e93 ("arm64: fpsimd: Preserve/restore
kernel mode NEON at context switch")
This commit introduces proper preservation and restoration of FPSIMD
register state during context switches when kernel code is using
NEON/FPSIMD instructions. It adds a new thread flag TIF_KERNEL_FPSTATE
to track when tasks are using FPSIMD in kernel mode, and modifies the
context switch hook to save/restore the kernel FPSIMD state to/from
struct thread_struct. This prevents FPSIMD register corruption that can
affect network operations and other kernel functions relying on floating
point calculations.
The backport adapts the upstream changes to the linux-bluefield-5.15
kernel structure and ensures compatibility with the existing FPSIMD
handling infrastructure in the Jammy kernel base.
[TEST CASE]
Compile tested on linux-bluefield-5.15 on the master-next branch.
Functional testing involved reproducing the original ptp4l synchronization
issue on BF2/BF3 hardware by running extended PTP operations for multiple days.
After applying the patch, the system was tested for 7 consecutive days under
the same conditions that previously triggered the issue within 24 hours. No
ptp4l synchronization failures or ENXIO errors from sendto() calls were
observed during the extended test period.
[REGRESSION POTENTIAL]
The backport introduces new code paths for FPSIMD state management during
context switches. Potential regression areas include context switch performance
overhead and compatibility with existing kernel FPSIMD users. However, the
extensive 7-day testing provides confidence in the backported implementation's
stability.
** Affects: linux-bluefield (Ubuntu)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-bluefield in Ubuntu.
https://bugs.launchpad.net/bugs/2119457
Title:
Ubuntu 22.04: ptp4l randomly goes out of sync on BF2/BF3 with ENXIO
errors from sendto() calls
Status in linux-bluefield package in Ubuntu:
New
Bug description:
[IMPACT]
In Bluefield-2 and Bluefield-3 embedded ARM cores (Ubuntu 22.04 Jammy), ptp4l
randomly goes out of sync during long-running operations (~24 hours) with the
error message:
"ptp4l[3416283.946]: port 1: SLAVE to FAULTY on FAULT_DETECTED
(FT_UNSPECIFIED)"
Debugging traces reveal that the failure occurs in the network stack's
sendto() system call when ptp4l attempts to send DelayReq messages,
returning error code -6 (ENXIO - "No such device or address"). This
issue affects PTP synchronization reliability on Bluefield hardware
and was reproduced consistently on BF2 and BF3 systems but not on CX6
DX hardware.
The root cause is corrupted FPSIMD (floating point SIMD) register
state during kernel mode context switches. When the kernel uses
NEON/FPSIMD instructions for network operations or cryptographic
functions, the register state can be lost or corrupted if a context
switch occurs, leading to unpredictable behavior in subsequent
operations including network socket calls. This corruption manifests
as the observed sendto() failures that disrupt PTP synchronization.
[FIX]
Backporting the upstream commit:
aefbab8e77eb16b56e18f24b85a09ebf4dc60e93 ("arm64: fpsimd: Preserve/restore
kernel mode NEON at context switch")
This commit introduces proper preservation and restoration of FPSIMD
register state during context switches when kernel code is using
NEON/FPSIMD instructions. It adds a new thread flag TIF_KERNEL_FPSTATE
to track when tasks are using FPSIMD in kernel mode, and modifies the
context switch hook to save/restore the kernel FPSIMD state to/from
struct thread_struct. This prevents FPSIMD register corruption that
can affect network operations and other kernel functions relying on
floating point calculations.
The backport adapts the upstream changes to the linux-bluefield-5.15
kernel structure and ensures compatibility with the existing FPSIMD
handling infrastructure in the Jammy kernel base.
[TEST CASE]
Compile tested on linux-bluefield-5.15 on the master-next branch.
Functional testing involved reproducing the original ptp4l synchronization
issue on BF2/BF3 hardware by running extended PTP operations for multiple days.
After applying the patch, the system was tested for 7 consecutive days under
the same conditions that previously triggered the issue within 24 hours. No
ptp4l synchronization failures or ENXIO errors from sendto() calls were
observed during the extended test period.
[REGRESSION POTENTIAL]
The backport introduces new code paths for FPSIMD state management during
context switches. Potential regression areas include context switch performance
overhead and compatibility with existing kernel FPSIMD users. However, the
extensive 7-day testing provides confidence in the backported implementation's
stability.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/2119457/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp