Hi Simon,

On Mon, Apr 14, 2025 at 05:23:01PM +0100, Simon McVittie wrote:
> Loaders are expected to be able to recognise that a particular driver is not
> for them, and gracefully not load it. In practice this works fine, because all
> of our architectures can be distinguished by their ELF headers (and if that
> wasn't the case, multiarch co-installation of ordinary shared libraries would
> go badly wrong).

I'm sorry to disappoint you, but reality is not like that.

You can actually run kfreebsd-amd64 binaries on a Linux kernel as their
ELF header looks the same. Not that they do useful stuff, but they may
go far enough as to reset your system clock. I've actually encountered
that.

Then, if you combine armel and armhf, those architectures also have ELF
headers that are mostly indistinguishable. I'm not sure what happens
exactly, but it isn't good.

What also gets interesting is when you try to combine e.g. amd64 and
musl-linux-amd64. Those also do not tell apart from their ELF header.

The elf-arch tool from arch-test attempts to map ELF headers to Debian
architectures, but it can only do so much.

So no, as long as we support armel and armhf simultaneously, we cannot
tell architectures apart by their ELF header.

> The problem here is that Mesa's upstream build system is trying to 
> disambiguate
> the manifests' filenames in order to avoid collisions, but is doing so with an
> architecture name that is not sufficiently unique: namely Meson's cpu(), which
> does not vary between architectures that run on essentially the same hardware
> and differ only by ABI design choices, like amd64/x32 (word size) and
> armel/armhf (whether to assume and use hardware floating point support).

I concur.

> Meson's cpu() also does not distinguish between ABIs that have the same
> instruction set but different endianness, so I believe we would have a
> similar collision between ppc64 and ppc64el, or between mips and mipsel.

And libc!

> I can see two ways to resolve #980148 without needing to change the
> search path for Vulkan drivers:
> 
> 1. As far as I'm aware, the basename of these files never matters: all
>     that matters is their content. So Mesa's debian/rules could do something
>     like this (assuming file-rename(1p) from the rename package):
> 
>         file-rename 's/(.*)\.([^.]+?)\.json$/$1.$ENV{DEB_HOST_ARCH}.json/' \
>         debian/tmp/usr/share/vulkan/icd.d/*.json
> 
>     to replace the "x86_64" or "armv8l" part of the filename with a string
>     that is definitely distinct for each pair of Debian architectures,
>     resulting in filenames like intel_icd.amd64.json and intel_icd.x32.json.
> 
>     Or it could use $ENV{DEB_HOST_MULTIARCH} for longer-but-maybe-clearer
>     filenames like intel_icd.x86_64-linux-gnu.json, which would be necessary
>     if we want to allow mesa-vulkan-drivers:amd64,
>     mesa-vulkan-drivers:hurd-amd64 and mesa-vulkan-drivers:kfreebsd-amd64
>     to be co-installed.
> 
>     Mesa upstream probably will not want to do this because they don't have
>     a better taxonomy of architectures than what Meson provides, but it
>     would be fairly easy to do in debian/rules between dh_auto_install and
>     dh_install, for example with the file-rename(1) invocation above.
> 
>     Or, maybe Mesa upstream would be willing to accept a patch adding a
>     build option like 'architecture_string', to be used in these filenames
>     instead of Meson's cpu() if non-empty, and we could run
>     `meson setup -Darchitecture_string="${DEB_HOST_ARCH}"`?
>     But that seems like something that would be better done post-trixie.

This sounds very reasonable to me. Including the post-trixie part.

> 2. Or, Mesa could give its Vulkan drivers the same file layout as its
>     Vulkan layers (which happens to be the same as the Nvidia proprietary
>     driver's Vulkan driver), taking advantage of the fact that on Debian, each
>     of its drivers is installed into ld.so's default load path for shared
>     libraries. So instead of hard-coding the full path of the library, it 
> could
>     set the library_path field to be just the basename, resulting in the same
>     JSON content on every architecture:
> 
>         {
>              "ICD": {
>                  "api_version": "1.2.145",
>                  "library_path": "libvulkan_intel.so"
>              },
>              "file_format_version": "1.0.0"
>         }
> 
>     and then rename the file to a name that is intentionally the same
>     for every architecture (like intel_icd.json), so that they *always*
>     collide, and dpkg's multiarch refcounting resolves this by only keeping
>     one copy.
> 
>     Mesa upstream probably will not want to do this by default because they
>     have to assume that their users might be installing Mesa into a 
> non-default
>     prefix like /opt/mesa25 where their driver library would not be found
>     without using an absolute or relative path, but it would be reasonably 
> easy
>     to implement this in debian/rules with some file-rename and sed, again
>     between dh_auto_install and dh_install.
> 
>     Or maybe Mesa upstream would accept a build option to make it switch
>     the generated JSON to be this way instead, although, again, that seems
>     like something for post-trixie.

Given what I said earlier about the inability to tell ELF headers apart
and the real problems observed in trying to do so, I have a preference
for the first option.

> I don't think that would be a great idea. There is a non-Debian-specific
> specification for how Vulkan drivers and layers are to be discovered, and
> components outside Debian can and do rely on it (in particular Steam's
> container runtime framework relies on knowing how to find Vulkan drivers,
> independent of libvulkan).

Fair enough.

> If this was done, it's the sort of coordinated transition (mesa + 
> vulkan-loader
> + possibly others) that we shouldn't be doing for trixie at this stage unless
> there's no alternative. So I would recommend choosing one of the two 
> strategies
> I suggested above, or some similar option that doesn't involve a transition.

In general, I doubt we fix this for trixie other than dropping M-A:same
maybe.

Helmut

Reply via email to