m4/getcwd-path-max.m4 hangs with busybox + btrfs
See https://bugs.gentoo.org/447970 for extra details. While this gentoo report has a workaround posted just a few months ago, it seems that the test itself is faulty, as a failure mode should not be to hang the entire computer. When running this configure test on my particular system, a NAS running busybox and btrfs, it hung for at least 12 hours before I decided to power cycle it. While hung, the system is mostly unresponsive. Pings work, for instance, but no new processes can start, and the RCU kernel thread spins its core at 100%. The test states that it is needed for old glibc and kernel versions. My NAS is using glibc 2.27 and kernel 5.13, so presumably it is no longer a useful test. Can this be modified to be less destructive? I mention btrfs, because that gentoo link highlights that the test is particularly troublesome on that file system. I mention busybox, because when I inspect the leftover directory structure after rebooting, I find that the busybox shell cannot handle the path length, with tab completion eventually failing. In my case, I hit this while configuring m4 version 1.4.19, which uses the latest (serial 25) version of this file.
Re: m4/getcwd-path-max.m4 hangs with busybox + btrfs
On Sun, Mar 31, 2024, 18:44 Bruno Haible wrote: > NightStrike wrote: > > See https://bugs.gentoo.org/447970 for extra details. While this > > gentoo report has a workaround posted just a few months ago, it seems > > that the test itself is faulty, as a failure mode should not be to > > hang the entire computer. > > No. What I understand from the above ticket, is that the problem is > in the "libsandbox". > > Quote: "Also noticable on openvz guests, where getcwd test takes > about minute and without sandbox its below 3s." > > Quote: "The underlying problem appears to be that libsandbox's > mkdir() scales very poorly." > > Quote: "The slowness comes from glibc own getcwd calls. But this > isn't a problem when running outside sandbox ..." > > etc. > > Bruno > Later on, there are data points that show that the algorithm is bad even outside the sandbox, it's just less noticeable. And as mentioned, on btrfs, it's crazy. In my setup, there is no sandbox. So sandboxing has nothing to do with the poor scalability of the function. >
Re: m4/getcwd-path-max.m4 hangs with busybox + btrfs
On Sun, Mar 31, 2024, 19:02 Bruno Haible wrote: > NightStrike wrote: > > In my setup, there is no sandbox. > > ... > > Later on, there are data points that show that the algorithm is bad even > > outside the sandbox, it's just less noticeable. And as mentioned, on > btrfs, > > it's crazy. > > If an algorithm is fast on ext4 and slow on btrfs: what is the foundation > of your logic, that you blame the algorithm? My logic is to blame btrfs > in this case. > > And, frankly, 8 seconds on tmpfs vs. 63 seconds on btrfs, on a machine with > 12 BogoMIPS (!!) is not _that_ bad. 12 BogoMIPS is more than 300 time > slower > than a the CPU of a current desktop machine. > > Bruno > I don't quite understand your animosity here. Gnulib is supposed to help porting to systems, and I'm highlighting a situation where it doesn't work. Why such antagonism vs trying to understand why it doesn't work? Surely a configure test that causes the RCU to spin for 12 hours and hang a whole system isn't a great way to have portability. Are you responsible for this particular script, or is there someone else who can help with a lot less hostility? >
Re: m4/getcwd-path-max.m4 hangs with busybox + btrfs
On Mon, Apr 1, 2024 at 12:19 AM Paul Eggert wrote: > > On 2024-03-31 18:07, NightStrike wrote: > > I don't quite understand your animosity here. > > I don't see any animosity in Bruno's comments. Clearly the system you're > talking about has a severe performance bug, and the question is whether > it's worth our limited time to port to buggy systems like that. Since we > don't have easy access to these systems, you can't expect us to fix the > problems ourselves - you'll need to pitch in if you want a workaround > installed. > > That being said, does the attached patch (which I have neither tested or > installed) fix the problem for you? If not, perhaps you can adjust the > patch so that it does work. Thanks Paul! I'll try as soon as I get it back online. We've been trying various ideas in #gnu for the past couple hours, and it's currently hung. I have to wait for a RAID verify after every attempt :( I ran the conftest directly instead of through configure. Notably, when I filled it with fprintf(stderr...) around every syscall, it worked correctly and finished after 1366 iterations. When I ran it unmodified, it seg faulted and left the system in the aforementioned hung state. So this leads me to believe that it's due to running too fast, assuming the printf's slowed it down and let the RCU not get stuck. For your patch, it'll be interesting to see if a SIGALRM gets through, because currently, no signals get through (or at least, the sigint via Ctrl-C doesn't, and I can't run kill from another shell because all disk access is blocked at that point, so I can't launch the kill program). I'm also curious to try your suggestion from another time this came up: https://lists.gnu.org/r/bug-tar/2019-10/msg3.html (with additional info here: https://www.cs.rug.nl/~jurjen/ApprenticesNotes/tstcg_tar.html) but I haven't yet compared that file to the current one to see if you changed it or if you just extracted it unmodified. I don't have strace on this system, so I can't directly apply your debugging method from that thread unfortunately. This is why I tried the printfs, but then it Heisenbug'd away. I can try building strace, if it doesn't have too many dependencies. Maybe I should have done that first :)