On January 31, 2024 1:28:37 PM PST, hw <h...@adminart.net> wrote:
>On Wed, 2024-01-31 at 09:27 -0500, Gary Dale wrote:
>> On 2024-01-30 15:54, hw wrote:
>> > On Mon, 2024-01-29 at 11:42 -0500, Gary Dale wrote:
>> > > I'm running Debian/Trixie on an AMD64 workstation. I've lost the ability
>> > > to see the root directory even when I am logged in as root (su -).
>> > >
>> > > This has been happening intermittently for several months. I initially
>> > > thought it might be related to failing NVME drive that was part of a
>> > > RAID1 array that is mounted as "/" but I replaced the device and the
>> > > problem is still happening.
>> > > [...]
>> > What happens when you put the device you replaced back?
>> >
>> How could putting a known-failing device back in help? The problem
>> existed before I replaced it and continues to exist after the replacement.
>
>It sounded like you were able to list the root directory (at least
>sometimes) before you did the replacement. Manually failing the
>device (perhaps after adding it back first) could make a difference.
>
>I've seen such indefinite hangs only when an NFS share has become
>unreachable after it had been mounted. You could use clonezilla to
>make a copy and then perhaps convert the file system to btrfs.
>
>Do you still have the problem when you remove one of the NVME storage
>things? Perhaps you have the equivivalent of a bad SATA cable or the
>mainboard doesn't like it when you access two of those at the same
>time, or something like that. Even simple network cables can behave
>very strangely, and NVME may be a bit more complicated than that.
>
>Running fsck on every boot to work around an issue like this is
>certainly a bad idea. Doesn't fsck report anything? If it really
>makes a difference in itself rather than creating some side effect
>that leads to the root directory being readable, it should report
>something. Perhaps you need to increase its verbosity.
>
>If there's no report then it would look like a side effect and raise
>the question what side effect it might be. Does fsck run before the
>RAID has been brought up or after? Is the RAID up when booting is
>completed? What does mdadm say about the device(s)? Can you still
>list the root directory when you manually fail either drive? What
>exactly are the circumstances under which you can and not list the
>root directory?
>
>You need to do some investigating and ask questions like those ...
>
Also, instead of doing "ls -l /" which will stat() every child folder under
root, try "/bin/ls -f /" and see if that is successful. That will only do a
readdir() on root itself. Also, it might be interesting to get a log of "strace
ls -l /" to confirm exactly where the hang happens.
-Loren
--
Sent from my Nexus 4 with K-9 Mail. Please excuse my brevity.