On Sun, 5 May 2019 at 06:41, Martin Pieuchot <m...@openbsd.org> wrote:
>
> On 27/04/19(Sat) 21:45, Nathanael Rensen wrote:
> > The diff below speeds up ld.so library loading where the dependency tree
> > is broad and deep, such as samba's smbd which links over 100 libraries.
> >
> > See for example https://marc.info/?l=openbsd-misc&m=155007285712913&w=2
> >
> > The timings below are for ldd /usr/local/sbin/smbd:
> >
> > Timing without diff: 2m02.50s real     2m02.48s user     0m00.02s system
> > Timing with diff   : 0m00.03s real     0m00.02s user     0m00.01s system
> >
> > Note that these timings are for a build of a recent samba master tree
> > (linked with kerberos) which is probably slower than the OpenBSD port.
> >
> > While this diff speeds up loading (e.g. ldd) a second diff (part 2) is
> > also helpful to speed up library initialisation.
>
> As for the other diff, could you explain in words what does your change
> do?

The broad intent is the same as for the other diff - avoid traversing a
node multiple times when it appears in multiple places within the tree.
This situation is more complex than the part 2 diff. More detail in
response to your specific questions below.

> Comments below :)
>
> > Index: libexec/ld.so/dlfcn.c
> > ===================================================================
> > RCS file: /cvs/src/libexec/ld.so/dlfcn.c,v
> > retrieving revision 1.102
> > diff -u -p -p -u -r1.102 dlfcn.c
> > --- libexec/ld.so/dlfcn.c     22 Oct 2018 01:59:08 -0000      1.102
> > +++ libexec/ld.so/dlfcn.c     27 Apr 2019 13:24:02 -0000
> > @@ -93,7 +93,7 @@ dlopen(const char *libname, int flags)
> >               /* if opened but grpsym_list has not been created */
> >               if (OBJECT_DLREF_CNT(object) == 1) {
> >                       /* add first object manually */
> > -                     _dl_link_grpsym(object, 1);
> > +                     _dl_link_grpsym(object);
> >                       _dl_cache_grpsym_list(object);
> >               }
> >               goto loaded;
> > Index: libexec/ld.so/library_subr.c
> > ===================================================================
> > RCS file: /cvs/src/libexec/ld.so/library_subr.c,v
> > retrieving revision 1.48
> > diff -u -p -p -u -r1.48 library_subr.c
> > --- libexec/ld.so/library_subr.c      28 Aug 2017 14:06:22 -0000      1.48
> > +++ libexec/ld.so/library_subr.c      27 Apr 2019 13:24:02 -0000
> > @@ -527,19 +527,13 @@ _dl_link_child(elf_object_t *dep, elf_ob
> >  static unsigned int _dl_grpsym_gen = 0;
> > 
> >  void
> > -_dl_link_grpsym(elf_object_t *object, int checklist)
> > +_dl_link_grpsym(elf_object_t *object)
> >  {
> >       struct dep_node *n;
> > 
> > -     if (checklist) {
> > -             TAILQ_FOREACH(n, &_dl_loading_object->grpsym_list, next_sib)
> > -                     if (n->data == object)
> > -                             return; /* found, dont bother adding */
> > -     } else {
> > -             if (object->grpsym_gen == _dl_grpsym_gen) {
> > +     TAILQ_FOREACH(n, &_dl_loading_object->grpsym_list, next_sib)
> > +             if (n->data == object)
> >                       return; /* found, dont bother adding */
> > -             }
> > -     }
>
> Currently this loop isn't executed when being called from
> _dl_cache_grpsym_list().  Is there any risk of adding a duplicate?

I don't believe so. As far as I can tell the grpsym_gen mechanism works
as intended to avoid the overhead of the dupe-check during recursion.

> Does it buy us much to remove the lookup?

In my testing I found that the cost was negligible. It's a linear search
through a list bounded by the number of distinct child libraries, however
there may be dependency trees where it does make a noticeable difference.

The reason that my diff removes the dupe-check bypass is aesthetic. The
bypass is only used by _dl_cache_grpsym_list(). I chose to prioritise
cleaning up the _dl_link_grpsym() interface and gain obvious avoidance
of duplicates over performance (as my testing didn't find that
performance was adversely affected).

I can see an irony here - I'm proposing a speed-up diff that removes
one of the existing performance improvements. Your point is correct:
If the goal is optimal performance then the dupe-check bypass should
be restored. It can be - more later.

> >       object->grpsym_gen = _dl_grpsym_gen;
> > 
> >       n = _dl_malloc(sizeof *n);
> > @@ -572,6 +566,7 @@ _dl_cache_grpsym_list_setup(elf_object_t
> >  void
> >  _dl_cache_grpsym_list(elf_object_t *object)
> >  {
> > +     int recurse = 0;
> >       struct dep_node *n;
> > 
> >       /*
> > @@ -581,8 +576,12 @@ _dl_cache_grpsym_list(elf_object_t *obje
> >        */
> > 
> >       TAILQ_FOREACH(n, &object->child_list, next_sib)
> > -             _dl_link_grpsym(n->data, 0);
> > +             if (n->data->grpsym_gen != _dl_grpsym_gen) {
> > +                     _dl_link_grpsym(n->data);
> > +                     recurse = 1;
> > +             }
>
> Do I understand correctly that you're not recursing if not object is
> inserted?

Yes.

You have probably also noticed that using a single recurse flag is not
optimally efficient. It is only necessary to recurse child nodes where
  n->data->grpsym_gen != _dl_grpsym_gen
With my approach if any child node satisfies that then all child nodes
are recursed.

An aside: I notice the following comment:

        /*
         * grpsym_list is an ordered list of all child libs of the
         * _dl_loading_object with no dups. The order is equivalent
         * to a breadth-first traversal of the child list without dups.
         */

I also noticed this 13 year old commit comment from revision 1.21 of
library_subr.c:

        Split grpsym_list creation away from child_list creation and change
        grpsym_list order to match Sun's docs. Also corrects bugs where
        grpsym_list was either not created or partially created.

I don't believe that grpsym_list follows a strict breadth-first
traversal, at least not as I understand the meaning of that term.
It would be interesting to know what Sun docs are referred to above
and what they say precisely.

Nevertheless my interntion was to reproduce the existing grpsym_list
order with less overhead.

> Would it makes sense to have _dl_link_grpsym() return a value
> reflecting if it inserted an object or not?

It could be done that way, and in fact that is what I did when I first
started working on this. I later chose to remove the checklist flag
from _dl_link_grpsym() for aesthetic reasons. With my diff the primary
purpose of grpsym_gen is to control the recursion rather than avoid the
dupe-check loop and I felt it was clearer to keep that logic in
_dl_cache_grpsym_list().

The way I have refactored _dl_cache_grpsym_list(), it only calls
_dl_link_grpsym() when the grpsym_gen does not match. With the original
_dl_link_grpsym() code that would always mean an insertion occurs (since
the checklist flag was not set). So providing the grpsym_gen mechanism
is working properly having _dl_link_grpsym() indicate whether it inserted
an object or not would not be interesting. It might be interesting to use
that to assert that an insert did occur as a check on the grpsym_gen
mechanism.

I decided it is better to not involve _dl_link_grpsym() with the
grpsym_gen logic and remove the checklist flag. That way it is clear
that the grpsym_gen logic is particular to _dl_cache_grpsym_list() and
_dl_link_grpsym() has a cleaner interface which guarantees no duplicates.

What do you think about the approach below?

static void
_dl_insert_grpsym(elf_object_t *object)
{
        struct dep_node *n;

        n = _dl_malloc(sizeof *n);
        if (n == NULL)
                _dl_oom();
        n->data = object;
        TAILQ_INSERT_TAIL(&_dl_loading_object->grpsym_list, n, next_sib);
}

void
_dl_link_grpsym(elf_object_t *object)
{
        struct dep_node *n;

        TAILQ_FOREACH(n, &_dl_loading_object->grpsym_list, next_sib)
                if (n->data == object && object->grpsym_gen == _dl_grpsym_gen)
                        return; /* found, dont bother adding */

        _dl_insert_grpsym(object);
        object->grpsym_gen = _dl_grpsym_gen;
}

void
_dl_cache_grpsym_list(elf_object_t *object)
{
        struct dep_node *n;

        /*
         * grpsym_list is an ordered list of all child libs of the
         * _dl_loading_object with no dups. The order is equivalent
         * to a breadth-first traversal of the child list without dups.
         */

        TAILQ_FOREACH(n, &object->child_list, next_sib)
                if (n->data->grpsym_gen != _dl_grpsym_gen)
                        _dl_insert_grpsym(n->data);

        TAILQ_FOREACH(n, &object->child_list, next_sib)
                if (n->data->grpsym_gen != _dl_grpsym_gen) {
                        object->grpsym_gen = _dl_grpsym_gen;
                        _dl_cache_grpsym_list(n->data);
                }
}

That addresses both of the performance issues noted above, avoids the
need for a checklist flag on _dl_link_grpsym() and keeps the grpsym_gen
logic in _dl_cache_grpsym_list(). This is barely tested, but does load
smbd like a rocket and survives a kernel build and ld.so regress tests
on amd64.

That pretty much covers the totality of my thinking. Sorry for the
length - you did ask for words. :)

Nathanael

>
> > 
> > -     TAILQ_FOREACH(n, &object->child_list, next_sib)
> > -             _dl_cache_grpsym_list(n->data);
> > +     if (recurse)
> > +             TAILQ_FOREACH(n, &object->child_list, next_sib)
> > +                     _dl_cache_grpsym_list(n->data);
> >  }
> > Index: libexec/ld.so/loader.c
> > ===================================================================
> > RCS file: /cvs/src/libexec/ld.so/loader.c,v
> > retrieving revision 1.177
> > diff -u -p -p -u -r1.177 loader.c
> > --- libexec/ld.so/loader.c    3 Dec 2018 05:29:56 -0000       1.177
> > +++ libexec/ld.so/loader.c    27 Apr 2019 13:24:02 -0000
> > @@ -360,7 +360,7 @@ _dl_load_dep_libs(elf_object_t *object,
> >       }
> > 
> >       /* add first object manually */
> > -     _dl_link_grpsym(object, 1);
> > +     _dl_link_grpsym(object);
> >       _dl_cache_grpsym_list_setup(object);
> > 
> >       return(0);
> > @@ -543,7 +543,7 @@ _dl_boot(const char **argv, char **envp,
> >       _dl_add_object(dyn_obj);
> > 
> >       dyn_obj->refcount++;
> > -     _dl_link_grpsym(dyn_obj, 1);
> > +     _dl_link_grpsym(dyn_obj);
> > 
> >       dyn_obj->status |= STAT_RELOC_DONE;
> >       _dl_set_sod(dyn_obj->load_name, &dyn_obj->sod);
> > Index: libexec/ld.so/resolve.h
> > ===================================================================
> > RCS file: /cvs/src/libexec/ld.so/resolve.h,v
> > retrieving revision 1.90
> > @@ -271,7 +272,7 @@ int _dl_load_dep_libs(elf_object_t *obje
> >  int _dl_rtld(elf_object_t *object);
> >  void _dl_call_init(elf_object_t *object);
> >  void _dl_link_child(elf_object_t *dep, elf_object_t *p);
> > -void _dl_link_grpsym(elf_object_t *object, int checklist);
> > +void _dl_link_grpsym(elf_object_t *object);
> >  void _dl_cache_grpsym_list(elf_object_t *object);
> >  void _dl_cache_grpsym_list_setup(elf_object_t *object);
> >  void _dl_link_grpref(elf_object_t *load_group, elf_object_t *load_object);
> >

Reply via email to