Public bug reported:

[IMPACT]
In Bluefield-2 and Bluefield-3 embedded ARM cores (Ubuntu 22.04 Jammy), ptp4l 
randomly goes out of sync during long-running operations (~24 hours) with the 
error message:
"ptp4l[3416283.946]: port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)"

Debugging traces reveal that the failure occurs in the network stack's
sendto() system call when ptp4l attempts to send DelayReq messages,
returning error code -6 (ENXIO - "No such device or address"). This
issue affects PTP synchronization reliability on Bluefield hardware and
was reproduced consistently on BF2 and BF3 systems but not on CX6 DX
hardware.

The root cause is corrupted FPSIMD (floating point SIMD) register state
during kernel mode context switches. When the kernel uses NEON/FPSIMD
instructions for network operations or cryptographic functions, the
register state can be lost or corrupted if a context switch occurs,
leading to unpredictable behavior in subsequent operations including
network socket calls. This corruption manifests as the observed sendto()
failures that disrupt PTP synchronization.

[FIX]
Backporting the upstream commit:
aefbab8e77eb16b56e18f24b85a09ebf4dc60e93 ("arm64: fpsimd: Preserve/restore 
kernel mode NEON at context switch")

This commit introduces proper preservation and restoration of FPSIMD
register state during context switches when kernel code is using
NEON/FPSIMD instructions. It adds a new thread flag TIF_KERNEL_FPSTATE
to track when tasks are using FPSIMD in kernel mode, and modifies the
context switch hook to save/restore the kernel FPSIMD state to/from
struct thread_struct. This prevents FPSIMD register corruption that can
affect network operations and other kernel functions relying on floating
point calculations.

The backport adapts the upstream changes to the linux-bluefield-5.15
kernel structure and ensures compatibility with the existing FPSIMD
handling infrastructure in the Jammy kernel base.

[TEST CASE]
Compile tested on linux-bluefield-5.15 on the master-next branch.
Functional testing involved reproducing the original ptp4l synchronization 
issue on BF2/BF3 hardware by running extended PTP operations for multiple days. 
After applying the patch, the system was tested for 7 consecutive days under 
the same conditions that previously triggered the issue within 24 hours. No 
ptp4l synchronization failures or ENXIO errors from sendto() calls were 
observed during the extended test period.

[REGRESSION POTENTIAL]
The backport introduces new code paths for FPSIMD state management during 
context switches. Potential regression areas include context switch performance 
overhead and compatibility with existing kernel FPSIMD users. However, the 
extensive 7-day testing provides confidence in the backported implementation's 
stability.

** Affects: linux-bluefield (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-bluefield in Ubuntu.
https://bugs.launchpad.net/bugs/2119457

Title:
  Ubuntu 22.04: ptp4l randomly goes out of sync on BF2/BF3 with ENXIO
  errors from sendto() calls

Status in linux-bluefield package in Ubuntu:
  New

Bug description:
  [IMPACT]
  In Bluefield-2 and Bluefield-3 embedded ARM cores (Ubuntu 22.04 Jammy), ptp4l 
randomly goes out of sync during long-running operations (~24 hours) with the 
error message:
  "ptp4l[3416283.946]: port 1: SLAVE to FAULTY on FAULT_DETECTED 
(FT_UNSPECIFIED)"

  Debugging traces reveal that the failure occurs in the network stack's
  sendto() system call when ptp4l attempts to send DelayReq messages,
  returning error code -6 (ENXIO - "No such device or address"). This
  issue affects PTP synchronization reliability on Bluefield hardware
  and was reproduced consistently on BF2 and BF3 systems but not on CX6
  DX hardware.

  The root cause is corrupted FPSIMD (floating point SIMD) register
  state during kernel mode context switches. When the kernel uses
  NEON/FPSIMD instructions for network operations or cryptographic
  functions, the register state can be lost or corrupted if a context
  switch occurs, leading to unpredictable behavior in subsequent
  operations including network socket calls. This corruption manifests
  as the observed sendto() failures that disrupt PTP synchronization.

  [FIX]
  Backporting the upstream commit:
  aefbab8e77eb16b56e18f24b85a09ebf4dc60e93 ("arm64: fpsimd: Preserve/restore 
kernel mode NEON at context switch")

  This commit introduces proper preservation and restoration of FPSIMD
  register state during context switches when kernel code is using
  NEON/FPSIMD instructions. It adds a new thread flag TIF_KERNEL_FPSTATE
  to track when tasks are using FPSIMD in kernel mode, and modifies the
  context switch hook to save/restore the kernel FPSIMD state to/from
  struct thread_struct. This prevents FPSIMD register corruption that
  can affect network operations and other kernel functions relying on
  floating point calculations.

  The backport adapts the upstream changes to the linux-bluefield-5.15
  kernel structure and ensures compatibility with the existing FPSIMD
  handling infrastructure in the Jammy kernel base.

  [TEST CASE]
  Compile tested on linux-bluefield-5.15 on the master-next branch.
  Functional testing involved reproducing the original ptp4l synchronization 
issue on BF2/BF3 hardware by running extended PTP operations for multiple days. 
After applying the patch, the system was tested for 7 consecutive days under 
the same conditions that previously triggered the issue within 24 hours. No 
ptp4l synchronization failures or ENXIO errors from sendto() calls were 
observed during the extended test period.

  [REGRESSION POTENTIAL]
  The backport introduces new code paths for FPSIMD state management during 
context switches. Potential regression areas include context switch performance 
overhead and compatibility with existing kernel FPSIMD users. However, the 
extensive 7-day testing provides confidence in the backported implementation's 
stability.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/2119457/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to