Hi Frank,

On Tue, 2024-10-15 at 18:12 -0400, Frank Ch. Eigler wrote:
> > Is this 4-byte hash enough to prevent accidental collisions?
> 
> Yeah, it only needs to be uniqueish among source files of the same
> binary, and with the same #-escaped tail string.

Aha, it is "under" an unique build-id already.

> > > This is a transparent change to clients, who are not to make any
> > > assumptions about cache file naming structure.
> > 
> > How does it interact with existing source file cache files?
> 
> Existing files will be ignored.  All files will be periodically
> garbage collected.

OK, so for source files an update will kind of invalidate the cache.
That is no major disaster, just a minor inconvenience.

> > [...]
> > Are you sure you want to use elf_hash? It returns an (arch dependent)
> > unsigned long int but the actual implementation only produces an
> > unsigned int. To simplify things and not need extra build dependencies
> > you might want to just include a hand-coded string hash function like
> > djb2? Which is really just 4 lines of code and probably a better hash
> > for strings than elf_hash.
> 
> Good idea, done.

Nice. It looks even simpler than I thought.

> > [...]
> > What do these "Reduce" comments mean?
> > So we hope to get at least the original input path length plus 10 (8
> > char hash plus dash plus zero), but we are fine with less.
> 
> New comments elaborate on this.
> 
> 
> > [...]
> > OK, so this always adds (or overwrites) a 8 char hash plus dash '-' in
> > front. Have you considered to only add it if the src doesn't fit the
> > dest_len?
> 
> Yes, new comments explain why.

Thanks, these new comments are clear. And also explain why we always
want the hash-prefix to guard against "escape collisions".

> > [...]
> > sizeof(suffix) == PATH_MAX
> > What is the relation to NAME_MAX?
> 
> Nothing, just PATH_MAX >> NAME_MAX.

Aha. But since we now never use more than NAME_MAX and suffix is stack
allocated would it be good to make it NAME_MAX lenght?

> patch v2:
> 
> From 4023ad1a81f31ae404c2959bad752d05ad2bb3b9 Mon Sep 17 00:00:00 2001
> From: "Frank Ch. Eigler" <f...@redhat.com>
> Date: Thu, 10 Oct 2024 16:30:19 -0400
> Subject: [PATCH] PR32218: debuginfod-client: support very long source file
>  names
> 
> debuginfod clients & servers may sometimes encounter very long source
> file names.  Previously, the client would synthesize a path name like
>    $CACHEDIR/$BUILDID/source-$PATHNAME
> where $PATHNAME was a funky ##-encoded version of the entire source
> path name.  See https://sourceware.org/PR32218 for a horror case.
> This can get too long to store as a single component of a file system
> pathname (e.g. linux/limits.h NAME_MAX), resulting on client-side
> errors even after a successful download.
> 
> New code switches encoding of the $PATHNAME part to use less escaping,
> and a merciless truncation to the tail part of the filename.  (We keep
> the tail rather than the head, so that the extension is preserved,
> which makes some consumers happier.)  To limit collision damage from
> truncation, we add also insert a goofy hash (4-byte DJBX33A) of the
> name into the path name.  The result is a relatively short name:
> 
>    $CACHEDIR/$BUILDID/source-$HASH-$NAMETAIL
> 
> This is a transparent change to clients, who are not to make any
> assumptions about cache file naming structure.  However, one existing
> test did make such assumptions, so is fixed with some globness.  A new
> test is also added, using a pre-baked tarball with a very long srcfile
> name.

Looks good with or without the suggested s/PATH_MAX/NAME_MAX/ tweak
suggested above.

Thanks,

Mark

Reply via email to