This looks like a lot of heavily LLM-assisted effort. Please review the
relevant documentation, starting here:
   https://docs.kernel.org/process/submitting-patches.html#using-assisted-by

==> I partly agree. The code and bug analysis are done manually.
LLM use was the out of tree test harness and lightly polishing
the commit message. None of the submitted code is generated.
If you'd prefer, I can reword the changelog in my own words or
add an Assisted-by tag ?

V1         was a simple clamp
v2/V3      was a simple locking mechanism which is pretty straight forward.
V4         bounce buffer, modelled on gpiolib-cdev (acknowledged in patch)
V5 and V6  entirely your review comments (__free, scoped_guard,
           kfifo_out_spinlocked, __guarded_by, context analysis)


I feel the testing strategy is pretty questionable. Any invariant
violation is possible with that type of meddling.

==> The underlying bug is a kfifo SPSC contract violation. My intent with the
test wasn't to simulate the race itself, but to reconstruct the post race state
specifically where (in - out) exceeds the buffer size and show it causes a
usercopy overflow in the unpatched code, handled safely after the fix.

==> I take your point that forcing that state can itself produce violations that
wouldn't occur naturally. Since the bug is provable from the source but hard to
trigger on demand, I'd rather ask what validation you'd accept here?

I was interested in whether you drove the interrupt sequence via
emulated hardware. I asked because upstream qemu doesn't currently
support the snoop device.

==> My apologies for the confusion, I mixed up things. I have not driven the
interrupt sequence in emulation; as you noted, upstream QEMU doesn't model the
snoop device. I've described the actual hardware context below.

In v3 you said:
   The issue was observed on physical AST2600 (dual-core Cortex-A7)
   in production under heavy POST code traffic during concurrent
   userspace reads.
   
https://lore.kernel.org/all/[email protected]/
Is this true? What platform did you test with?

==> Yes, the underlying failure is real. It was observed on an AST2600-based
production BMC running a vendor BSP kernel under continuous host reboot
cycles. Because that platform can't currently be brought up on pure
mainline without substantial out-of-tree board support, I have not run
this exact mainline patch on the physical silicon, observed under the
BSP kernel, not yet verified as the mainline patch. I should have made
that distinction clear earlier in the thread.

==> If there's a way you'd consider valid for validating a fix like this
without a full mainline bring up on the SoC, such as a targeted kfifo unit
test, or something else you'd accept.I'd appreciate the pointer and I'll
do that.

Reply via email to