On Sun, May 15, 2022 at 12:20:57AM +0200, Johannes Schauer Marin Rodrigues wrote: > I now have a patch (attached).
Thanks for your patch! I found a few problems with it. Don't worry about sending a new patch to address these; I have fixes in my local tree for most things I've commented on here and am working on the rest. I'm just letting you know where things stand. First, we should add the appropriate Gnulib modules for upstream portability. Then: > @@ -356,8 +356,8 @@ static void add_dir_entries (MYDBM_FILE > * or . files (such as current, parent dir). > */ > > - dir = opendir (infile); > - if (!dir) { > + n = scandir(infile, &namelist, NULL, alphasort); IMO we might as well move the filtering currently being done in the while loop below to a scandir filter function. I would prefer not to use alphasort here, because it's locale-dependent (not that it will matter very much in practice with the sorts of file names that typically appear in manual page directories, but I can imagine edge cases). A variant that's deliberately locale-independent is only a few lines of code. > @@ -367,13 +367,13 @@ static void add_dir_entries (MYDBM_FILE > > /* strlen(newdir->d_name) could be replaced by newdir->d_reclen */ > > - while ((newdir = readdir (dir)) != NULL) { > - if (*newdir->d_name == '.' && > - strlen (newdir->d_name) < (size_t) 3) > + while (n--) { I guess this might have been borrowed from the example in the scandir(3) manual page, because it goes in reverse order; I don't think there's a good reason to do that. > + free(namelist); This leaks memory. We need to free all the elements of this list as well. > order_files (infile, &names); This means that the conversion to scandir in this function is in practice going to be ineffective, because we immediately turn around and re-sort the list by the physical locations of the first blocks of the corresponding files. Won't this have just as much of an effect on reproducibility in principle, even if it doesn't happen to affect mmdebstrap in your tests, presumably due to something like disk order typically being similar between runs if your disk isn't too full? I'm experimenting with simply removing the order_files call here, on the basis that other performance improvements to mandb(8) have made it less critical. It does seem to slow things down slightly even on an SSD, though that may be measurement error; I still need to compare timings on a rotational drive (but I will). More interestingly, it changes accessdb(8) output in ways that aren't just obvious consequences of sorting (multi keys change due to that, but as far as I can tell that's fine). So far I've spotted the following issues: * A number of new entries are introduced for some reason (and the question is why they were missing beforehand). * The targets of WHATIS_MAN references change (admittedly in cases where there are multiple possibilities, but we should be picking a deterministic one rather than just the first). * Symlinks flip between ULT_MAN and SO_MAN depending on whether the symlink happens to sort before its target. * The database's idea of whether pages require processing via tbl(1) (and probably other preprocessors) changes in some cases. (Fixed in https://gitlab.com/cjwatson/man-db/-/commit/1873051fdb.) I think these are mostly existing bugs, but I intend to fix them first before attempting to land your patch, because this is all delicate enough that I really want to be sure of exactly what's changing. (Since these are all order-dependent bugs, it may even be that fixing all these will make parts of your patch unnecessary; but we'll see.) I will keep working on this, and expect to be able to get a new release out in a week or two. -- Colin Watson (he/him) [cjwat...@debian.org]