On Wed, 07 Feb 2024 at 12:42:01 +0100, Sébastien Noel wrote:
A fix could be as simple as

--- a/game_data_packager/build.py
+++ b/game_data_packager/build.py
@@ -903,7 +903,7 @@ class PackagingTask:
        if provider is None:
            try_to_unpack: Collection[str] = self.game.files
            should_provide = set()
-            distinctive_dirs = False
+            distinctive_dirs = True
        else:
            try_to_unpack = set(f.name for f in
provider.provides_files)
            should_provide = set(try_to_unpack)


but i'm not familiar enough with that part of the code to have a full
view of the implications of that change...

I don't think that's the correct fix, but I've been able to identify a different change that I find easier to justify, so I pushed that instead.

Context:
- the file provided on the CLI is "unknown/untrusted" by g-d-p and so
there isn't any "provider" for the stream and for that case,
'distinctive_dirs' was set to False (what's the reasoning for that ?)

The origin of this seems to be commit ef4dc6c7 "zipfile: match on basename for unknown & quircky archives" [sic]. Originally, it was only for zip files, where the equivalent of this bug couldn't actually cause a crash, because zip files are seekable (although it might have led to some inefficiency, unpacking the same file more than once). Later, I refactored the same function into consider_stream(), in commit f7915ca5 "Unify code to stream members from a TarFile or ZipFile". After that point, it *can* cause a crash when unpacking a non-seekable archive, like a tar file or (as in this case) the tar file embedded in a .deb.

What distinctive_dirs means is: if it's true (which is the default), we will assume that a file named "foo/bar/abc.bin" can only be found in a directory named "foo/bar", and if we see an "abc.bin" somewhere else, it's a different file. This is usually what we want: we know the layout that the directory tree will have, and there's no point in unpacking a file named something like directx/setup.exe in the hope that it might secretly be mygame/setup.exe.

However, if the user of g-d-p has passed an unknown archive file on the command-line, after ef4dc6c7 we will consider any "abc.bin" in any directory to be potentially the one we are looking for.

I don't know why this was done: the commit message doesn't say, and the change isn't part of a merge request. But I can guess: probably Alexandre had (or was thinking about) an unofficial archive containing files in a layout that was not the one we expect, for example:

    quake2-pak0.zip
     \- pak0.pak    (same content we expect to find in baseq2/pak0.pak)

and in that scenario, we will not successfully match pak0.pak to anything unless we are willing to disregard the path.

- quake2 have multiple (different) gamefiles that have the same
basename()

Not just that, but they are the same size, for example:

    31329     14d330fe2a9af05a6254b2c2bda9f384 
baseq2/players/crakhor/a_grenades.md2
    ...
    31329     8b26b6a4863b7e2c30b4dcd7867a6d10 
baseq2/players/cyborg/a_grenades.md2

This means that g-d-p cannot tell which one it's looking at by just looking at the metadata available in the tar file: to disambiguate, it needs to know either the path (which we have told it to ignore), or the content (which requires actually unpacking).

When I run

    /path/to/run-gdp-uninstalled \
    -d. --no-search --no-compress --debug \
    quake2 input/quake2-full-data_85_all.deb

with this temporary change

--- a/game_data_packager/build.py
+++ b/game_data_packager/build.py
@@ -945,6 +945,12 @@ class PackagingTask:
                     # proceed to next entry
                     continue

+                logger.debug(
+                    'Unpacking %s from %s because it might be %s',
+                    entry.name,
+                    name,
+                    wanted.name,
+                )
                 should_provide.discard(filename)

                 if filename in self.found:

it becomes obvious what the problem is:

DEBUG:game_data_packager.build:Unpacking 
./usr/share/games/quake2/baseq2/players/crakhor/a_grenades.md2 from 
ref/quake2-full-data_85_all.deb because it might be 
baseq2/players/crakhor/a_grenades.md2
extracting ./usr/share/games/quake2/baseq2/players/crakhor/a_grenades.md2 from 
ref/quake2-full-data_85_all.deb
DEBUG:game_data_packager.build:found baseq2/players/crakhor/a_grenades.md2 at 
/tmp/gdptmp.sun12x3u/tmp/baseq2/players/crakhor/a_grenades.md2
DEBUG:game_data_packager.build:... matches baseq2/players/crakhor/a_grenades.md2
DEBUG:game_data_packager.build:Unpacking 
./usr/share/games/quake2/baseq2/players/crakhor/a_grenades.md2 from 
ref/quake2-full-data_85_all.deb because it might be 
baseq2/players/cyborg/a_grenades.md2

... we are unpacking the same member of the archive more than once, which at best is inefficient, and at worst fails because the archive isn't seekable.

Instead, we should unpack it once, then loop through all the files that it might possibly be, and check whether it does in fact match any of them. That's what I've pushed.

While fixing this, I noticed that if we have more than one file with identical content (quake2 often does this for weapon.md2 and shotgun.md2), we would inefficiently unpack all of them, even though the first one we unpacked can be used as the data source for all of the duplicate copies. Now we only unpack the first, and skip the others.

    smcv

Reply via email to