https://sourceware.org/bugzilla/show_bug.cgi?id=33530

            Bug ID: 33530
           Summary: ld --gc-sections is quadratic-slow on input with huge
                    list of sections (20K)
           Product: binutils
           Version: 2.45
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: ld
          Assignee: unassigned at sourceware dot org
          Reporter: slyich at gmail dot com
  Target Milestone: ---

It's reproducible on x86_64-linux target, but I expect the problem to be more
generic.

I'll start from the reproducer:

$ printf "int f_0() __attribute__ ((section (\".text.0\"))); int f_0() { return
0; };\n" > main.c; for (( i=1; i<20000; i++ )); do printf "int f_$i()
__attribute__ ((section (\".text.$i\"))); int f_$i() { return f_$((i-1))();
};\n"; done >> main.c; printf "int main() { return f_19999(); }" >> main.c; gcc
-O0 -c main.c -o main.o; echo "bfd:"; time gcc main.o -o main -fuse-ld=bfd
-Wl,--gc-sections; echo "gold:"; time gcc main.o -o main -fuse-ld=gold
-Wl,--gc-sections

bfd:

real    0m5,175s
user    0m2,586s
sys     0m2,575s
gold:

real    0m0,049s
user    0m0,036s
sys     0m0,014s

Note how faster `gold` is in this case. I hope `bfd` can be fixed to be on par
with it.

IN the example we generate a function per section and build a chain references:

$ head -n 5 main.c
int f_0() __attribute__ ((section (".text.0"))); int f_0() { return 0; };
int f_1() __attribute__ ((section (".text.1"))); int f_1() { return f_0(); };
int f_2() __attribute__ ((section (".text.2"))); int f_2() { return f_1(); };
int f_3() __attribute__ ((section (".text.3"))); int f_3() { return f_2(); };
int f_4() __attribute__ ((section (".text.4"))); int f_4() { return f_3(); };

$ tail -n 5 main.c
int f_19996() __attribute__ ((section (".text.19996"))); int f_19996() { return
f_19995(); };
int f_19997() __attribute__ ((section (".text.19997"))); int f_19997() { return
f_19996(); };
int f_19998() __attribute__ ((section (".text.19998"))); int f_19998() { return
f_19997(); };
int f_19999() __attribute__ ((section (".text.19999"))); int f_19999() { return
f_19998(); };
int main() { return f_19999(); }

>From what I understand staring at the profiles the trigger here is
`-Wl,--gc-sections`. Otherwise `bfd` is as fast.

Where the time I think is spent: `_bfd_elf_gc_mark / _bfd_elf_gc_mark_reloc`
finds out unused sections for removal:
https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=bfd/elflink.c;h=91c77c211ef065a77883004eb696adacd92a00be;hb=815d9a14cbbb3b81843f7566222c87fb22e7255d#l14063

```c
bool
_bfd_elf_gc_mark_reloc (struct bfd_link_info *info,
                        asection *sec,
                        elf_gc_mark_hook_fn gc_mark_hook,
                        struct elf_reloc_cookie *cookie)
{
  asection *rsec;
  bool start_stop = false;

  rsec = _bfd_elf_gc_mark_rsec (info, sec, gc_mark_hook, cookie, &start_stop);
  while (rsec != NULL)
    {
      if (!rsec->gc_mark)
        {
          if (bfd_get_flavour (rsec->owner) != bfd_target_elf_flavour
              || (rsec->owner->flags & DYNAMIC) != 0)
            rsec->gc_mark = 1;
          else if (!_bfd_elf_gc_mark (info, rsec, gc_mark_hook))
            return false;
        }
      if (!start_stop)
        break;
      rsec = bfd_get_next_section_by_name (rsec->owner, rsec);
    }
  return true;
}
```

This `while()` loop looks suspicious. If we do such marking for each section it
probably has quadratic complexity.

The performance drop is picked from a real example: such huge section count is
typical for binaries built by haskell GHC compiler in `-split-sections` mode:
https://downloads.haskell.org/ghc/9.12.2/docs/users_guide/phases.html#ghc-flag-fsplit-sections

I think it's similar to `-ffunction-sections` `gcc` option at least in spirit.

`gold` seems to work quite a bit faster on such inputs. While `bfd` uses a lot
of RAM and gets slower and slower with the increase of functions I put to the
file.

This was noticed in
https://discourse.nixos.org/t/removing-gold-from-nixpkgs/70496 where `nixpkgs`
considers switching from `gold` linker as `gold` is deprecated for removal.
Switching to `bfd` so far looks like a big link-time regression: on larger
inputs it's extra minutes of linking on individual binaries.

I hope that `bfd` could be fixed not to degrade as much on such pathological
inputs.

Thanks!

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Reply via email to