On Sun, 2023-10-08 at 09:23 +0100, Richard Purdie via
lists.openembedded.org wrote:
> On Sat, 2023-10-07 at 23:05 +0100, Richard Purdie via
> lists.openembedded.org wrote:
> > I thought I'd summarise where things are at with the 6.5 kernel.
> > 
> > We've fixed:
> > * the ARM LTP OOM lockup (kernel patch)
> > * the locale ARM selftest failure which was OOM due to silly buffer 
> >   allocations in 6.5 (kernel commandline)
> > * the ARM jitterentropy errors (kernel patch)
> > * the cryptodev build failures (recipe updated)
> > 
> > We've also:
> > * disabled the strace tests that fail with 6.5.
> > * made sure the serial ports and getty counts match
> > * added ttyrun which wraps serial consoles and avoids hacks
> > * made the qemurunner logging save all the port logs
> > * made the qemurunner write the binary data it is sent verbatim
> > * made sure to use nodelay on qemu's tcpserial
> > 
> > This leaves an annoying serial console problem where ttyS1 never has
> > the getty login prompt appear.
> > 
> > What we know:
> > 
> > * We've only seen this on x86 more recently (yesterday/today) but have
> > seen it on ARM in the days before that.
> > 
> > * It affects both sysvinit and systemd images.
> > 
> > * Systemd does print that it started a getty on ttyS0 and ttyS1 when
> > the failure occurs
> > 
> > * There is a getty running according to "ps" when the failure occurs
> > 
> > * There are only ever one or three characters received to ttyS1 in the
> > failure case (0x0d and 0x0a chars, i.e. CR and LF)
> > 
> > * It can't be any kind of utf-8 conversion issue since the login prompt
> > isn't visible in the binary log
> > 
> > * the kernel boot logs do show the serial port created with the same
> > ioport and irq on x86.
> > 
> > Previously we did see some logs with timing issues on the ttyS0 port
> > but the nodelay parameter may have helped with that.
> > 
> > There are debug patches in master-next against qemurunner which try and
> > poke around to gather more debug when things fail using ttyS0.
> > 
> > The best failure log we have is now this one:
> > 
> > https://autobuilder.yoctoproject.org/typhoon/#/builders/79/builds/5874/steps/14/logs/stdio
> > 
> > where I've saved the logs:
> > 
> > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853
> > and
> > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2
> > 
> > You can see ttyS1 times out after 1000 seconds and the port only has a
> > single byte (in the .2 file). The other log shows ps output showing the
> > getty running for ttyS1.
> > 
> > Ideas welcome on where from here. 
> > 
> > I've tweaked master-next to keep reading the ttyS1 port after we poke
> > it from ttyS0 to see if that reveals anything next time it fails (build
> > running).
> 
> Testing overnight with the new debug yielded:
> 
> https://autobuilder.yoctoproject.org/typhoon/#/builders/87/builds/5895/steps/14/logs/stdio
> 
> The interesting bit being:
> 
> """
> WARNING: core-image-full-cmdline-1.0-r0 do_testimage: Extra read data: 
> Poky (Yocto Project Reference Distro) 4.2+snapshot-
> 7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1
> 
> qemux86-64 login: helloA
> 
> Poky (Yocto Project Reference Distro) 4.2+snapshot-
> 7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1
> qemux86-64 login: 
> 
> """
> 
> i.e. the getty didn't appear in 1000s but sometime in shutdown the
> original prompt, the "helloA" and the new getty prompt did.
> 
> So the data *is* there but stuck in a buffer somehow. Kernel or qemu
> side, I don't know.

To update, the latest debug is:

https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/5836/steps/14/logs/stdio

which shows that it is enough to echo "helloA" to /dev/ttyS1 to cause
the data to be flushed, we don't need to restart the getty. It also
shows there can be other system console messages "stuck" in the queue.

I did also test forcing a read onto the socket but that just triggers a
BlockingIOError exception and doesn't help.

I'm going to try writing "\n\n" to the port after 120s if the getty
hasn't appeared, see if that helps.

Cheers,

Richard




-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#188835): 
https://lists.openembedded.org/g/openembedded-core/message/188835
Mute This Topic: https://lists.openembedded.org/mt/101824562/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to