Hi Frank, On Tue, 2024-10-15 at 18:12 -0400, Frank Ch. Eigler wrote: > > Is this 4-byte hash enough to prevent accidental collisions? > > Yeah, it only needs to be uniqueish among source files of the same > binary, and with the same #-escaped tail string.
Aha, it is "under" an unique build-id already. > > > This is a transparent change to clients, who are not to make any > > > assumptions about cache file naming structure. > > > > How does it interact with existing source file cache files? > > Existing files will be ignored. All files will be periodically > garbage collected. OK, so for source files an update will kind of invalidate the cache. That is no major disaster, just a minor inconvenience. > > [...] > > Are you sure you want to use elf_hash? It returns an (arch dependent) > > unsigned long int but the actual implementation only produces an > > unsigned int. To simplify things and not need extra build dependencies > > you might want to just include a hand-coded string hash function like > > djb2? Which is really just 4 lines of code and probably a better hash > > for strings than elf_hash. > > Good idea, done. Nice. It looks even simpler than I thought. > > [...] > > What do these "Reduce" comments mean? > > So we hope to get at least the original input path length plus 10 (8 > > char hash plus dash plus zero), but we are fine with less. > > New comments elaborate on this. > > > > [...] > > OK, so this always adds (or overwrites) a 8 char hash plus dash '-' in > > front. Have you considered to only add it if the src doesn't fit the > > dest_len? > > Yes, new comments explain why. Thanks, these new comments are clear. And also explain why we always want the hash-prefix to guard against "escape collisions". > > [...] > > sizeof(suffix) == PATH_MAX > > What is the relation to NAME_MAX? > > Nothing, just PATH_MAX >> NAME_MAX. Aha. But since we now never use more than NAME_MAX and suffix is stack allocated would it be good to make it NAME_MAX lenght? > patch v2: > > From 4023ad1a81f31ae404c2959bad752d05ad2bb3b9 Mon Sep 17 00:00:00 2001 > From: "Frank Ch. Eigler" <f...@redhat.com> > Date: Thu, 10 Oct 2024 16:30:19 -0400 > Subject: [PATCH] PR32218: debuginfod-client: support very long source file > names > > debuginfod clients & servers may sometimes encounter very long source > file names. Previously, the client would synthesize a path name like > $CACHEDIR/$BUILDID/source-$PATHNAME > where $PATHNAME was a funky ##-encoded version of the entire source > path name. See https://sourceware.org/PR32218 for a horror case. > This can get too long to store as a single component of a file system > pathname (e.g. linux/limits.h NAME_MAX), resulting on client-side > errors even after a successful download. > > New code switches encoding of the $PATHNAME part to use less escaping, > and a merciless truncation to the tail part of the filename. (We keep > the tail rather than the head, so that the extension is preserved, > which makes some consumers happier.) To limit collision damage from > truncation, we add also insert a goofy hash (4-byte DJBX33A) of the > name into the path name. The result is a relatively short name: > > $CACHEDIR/$BUILDID/source-$HASH-$NAMETAIL > > This is a transparent change to clients, who are not to make any > assumptions about cache file naming structure. However, one existing > test did make such assumptions, so is fixed with some globness. A new > test is also added, using a pre-baked tarball with a very long srcfile > name. Looks good with or without the suggested s/PATH_MAX/NAME_MAX/ tweak suggested above. Thanks, Mark