On Sun, May 15, 2022 at 12:20:57AM +0200, Johannes Schauer Marin Rodrigues 
wrote:
> I now have a patch (attached).

Thanks for your patch!

I found a few problems with it.  Don't worry about sending a new patch
to address these; I have fixes in my local tree for most things I've
commented on here and am working on the rest.  I'm just letting you know
where things stand.

First, we should add the appropriate Gnulib modules for upstream
portability.  Then:

> @@ -356,8 +356,8 @@ static void add_dir_entries (MYDBM_FILE
>        * or . files (such as current, parent dir).
>        */
>  
> -     dir = opendir (infile);
> -     if (!dir) {
> +     n = scandir(infile, &namelist, NULL, alphasort);

IMO we might as well move the filtering currently being done in the
while loop below to a scandir filter function.

I would prefer not to use alphasort here, because it's locale-dependent
(not that it will matter very much in practice with the sorts of file
names that typically appear in manual page directories, but I can
imagine edge cases).  A variant that's deliberately locale-independent
is only a few lines of code.

> @@ -367,13 +367,13 @@ static void add_dir_entries (MYDBM_FILE
>  
>          /* strlen(newdir->d_name) could be replaced by newdir->d_reclen */
>  
> -     while ((newdir = readdir (dir)) != NULL) {
> -             if (*newdir->d_name == '.' &&
> -                 strlen (newdir->d_name) < (size_t) 3)
> +     while (n--) {

I guess this might have been borrowed from the example in the scandir(3)
manual page, because it goes in reverse order; I don't think there's a
good reason to do that.

> +     free(namelist);

This leaks memory.  We need to free all the elements of this list as
well.

>       order_files (infile, &names);

This means that the conversion to scandir in this function is in
practice going to be ineffective, because we immediately turn around and
re-sort the list by the physical locations of the first blocks of the
corresponding files.  Won't this have just as much of an effect on
reproducibility in principle, even if it doesn't happen to affect
mmdebstrap in your tests, presumably due to something like disk order
typically being similar between runs if your disk isn't too full?

I'm experimenting with simply removing the order_files call here, on the
basis that other performance improvements to mandb(8) have made it less
critical.  It does seem to slow things down slightly even on an SSD,
though that may be measurement error; I still need to compare timings on
a rotational drive (but I will).

More interestingly, it changes accessdb(8) output in ways that aren't
just obvious consequences of sorting (multi keys change due to that, but
as far as I can tell that's fine).  So far I've spotted the following
issues:

 * A number of new entries are introduced for some reason (and the
   question is why they were missing beforehand).

 * The targets of WHATIS_MAN references change (admittedly in cases
   where there are multiple possibilities, but we should be picking a
   deterministic one rather than just the first).

 * Symlinks flip between ULT_MAN and SO_MAN depending on whether the
   symlink happens to sort before its target.

 * The database's idea of whether pages require processing via tbl(1)
   (and probably other preprocessors) changes in some cases.  (Fixed in
   https://gitlab.com/cjwatson/man-db/-/commit/1873051fdb.)

I think these are mostly existing bugs, but I intend to fix them first
before attempting to land your patch, because this is all delicate
enough that I really want to be sure of exactly what's changing.  (Since
these are all order-dependent bugs, it may even be that fixing all these
will make parts of your patch unnecessary; but we'll see.)

I will keep working on this, and expect to be able to get a new release
out in a week or two.

-- 
Colin Watson (he/him)                              [cjwat...@debian.org]

Reply via email to