m4/getcwd-path-max.m4 hangs with busybox + btrfs

2024-03-31 Thread NightStrike
See https://bugs.gentoo.org/447970 for extra details.  While this
gentoo report has a workaround posted just a few months ago, it seems
that the test itself is faulty, as a failure mode should not be to
hang the entire computer.

When running this configure test on my particular system, a NAS
running busybox and btrfs, it hung for at least 12 hours before I
decided to power cycle it.  While hung, the system is mostly
unresponsive.  Pings work, for instance, but no new processes can
start, and the RCU kernel thread spins its core at 100%.  The test
states that it is needed for old glibc and kernel versions.  My NAS is
using glibc 2.27 and kernel 5.13, so presumably it is no longer a
useful test.  Can this be modified to be less destructive?

I mention btrfs, because that gentoo link highlights that the test is
particularly troublesome on that file system.

I mention busybox, because when I inspect the leftover directory
structure after rebooting, I find that the busybox shell cannot handle
the path length, with tab completion eventually failing.

In my case, I hit this while configuring m4 version 1.4.19, which uses
the latest (serial 25) version of this file.



Re: m4/getcwd-path-max.m4 hangs with busybox + btrfs

2024-03-31 Thread NightStrike
On Sun, Mar 31, 2024, 18:44 Bruno Haible  wrote:

> NightStrike wrote:
> > See https://bugs.gentoo.org/447970 for extra details.  While this
> > gentoo report has a workaround posted just a few months ago, it seems
> > that the test itself is faulty, as a failure mode should not be to
> > hang the entire computer.
>
> No. What I understand from the above ticket, is that the problem is
> in the "libsandbox".
>
> Quote: "Also noticable on openvz guests, where getcwd test takes
> about minute and without sandbox its below 3s."
>
> Quote: "The underlying problem appears to be that libsandbox's
> mkdir() scales very poorly."
>
> Quote: "The slowness comes from glibc own getcwd calls. But this
> isn't a problem when running outside sandbox ..."
>
> etc.
>
> Bruno
>

Later on, there are data points that show that the algorithm is bad even
outside the sandbox, it's just less noticeable. And as mentioned, on btrfs,
it's crazy.

In my setup, there is no sandbox. So sandboxing has nothing to do with the
poor scalability of the function.

>


Re: m4/getcwd-path-max.m4 hangs with busybox + btrfs

2024-03-31 Thread NightStrike
On Sun, Mar 31, 2024, 19:02 Bruno Haible  wrote:

> NightStrike wrote:
> > In my setup, there is no sandbox.
> > ...
> > Later on, there are data points that show that the algorithm is bad even
> > outside the sandbox, it's just less noticeable. And as mentioned, on
> btrfs,
> > it's crazy.
>
> If an algorithm is fast on ext4 and slow on btrfs: what is the foundation
> of your logic, that you blame the algorithm? My logic is to blame btrfs
> in this case.
>
> And, frankly, 8 seconds on tmpfs vs. 63 seconds on btrfs, on a machine with
> 12 BogoMIPS (!!) is not _that_ bad. 12 BogoMIPS is more than 300 time
> slower
> than a the CPU of a current desktop machine.
>
> Bruno
>

I don't quite understand your animosity here. Gnulib is supposed to help
porting to systems, and I'm highlighting a situation where it doesn't work.
Why such antagonism vs trying to understand why it doesn't work? Surely a
configure test that causes the RCU to spin for 12 hours and hang a whole
system isn't a great way to have portability. Are you responsible for this
particular script, or is there someone else who can help with a lot less
hostility?

>


Re: m4/getcwd-path-max.m4 hangs with busybox + btrfs

2024-03-31 Thread NightStrike
On Mon, Apr 1, 2024 at 12:19 AM Paul Eggert  wrote:
>
> On 2024-03-31 18:07, NightStrike wrote:
> > I don't quite understand your animosity here.
>
> I don't see any animosity in Bruno's comments. Clearly the system you're
> talking about has a severe performance bug, and the question is whether
> it's worth our limited time to port to buggy systems like that. Since we
> don't have easy access to these systems, you can't expect us to fix the
> problems ourselves - you'll need to pitch in if you want a workaround
> installed.
>
> That being said, does the attached patch (which I have neither tested or
> installed) fix the problem for you? If not, perhaps you can adjust the
> patch so that it does work.

Thanks Paul!

I'll try as soon as I get it back online.  We've been trying various
ideas in #gnu for the past couple hours, and it's currently hung.  I
have to wait for a RAID verify after every attempt :(

I ran the conftest directly instead of through configure.  Notably,
when I filled it with fprintf(stderr...) around every syscall, it
worked correctly and finished after 1366 iterations.  When I ran it
unmodified, it seg faulted and left the system in the aforementioned
hung state.  So this leads me to believe that it's due to running too
fast, assuming the printf's slowed it down and let the RCU not get
stuck.

For your patch, it'll be interesting to see if a SIGALRM gets through,
because currently, no signals get through (or at least, the sigint via
Ctrl-C doesn't, and I can't run kill from another shell because all
disk access is blocked at that point, so I can't launch the kill
program).

I'm also curious to try your suggestion from another time this came up:
https://lists.gnu.org/r/bug-tar/2019-10/msg3.html

(with additional info here:
https://www.cs.rug.nl/~jurjen/ApprenticesNotes/tstcg_tar.html)

but I haven't yet compared that file to the current one to see if you
changed it or if you just extracted it unmodified.  I don't have
strace on this system, so I can't directly apply your debugging method
from that thread unfortunately.  This is why I tried the printfs, but
then it Heisenbug'd away.  I can try building strace, if it doesn't
have too many dependencies.  Maybe I should have done that first :)