On Wed, 07 Feb 2024 at 12:42:01 +0100, Sébastien Noel wrote:
A fix could be as simple as
--- a/game_data_packager/build.py
+++ b/game_data_packager/build.py
@@ -903,7 +903,7 @@ class PackagingTask:
if provider is None:
try_to_unpack: Collection[str] = self.game.files
should_provide = set()
- distinctive_dirs = False
+ distinctive_dirs = True
else:
try_to_unpack = set(f.name for f in
provider.provides_files)
should_provide = set(try_to_unpack)
but i'm not familiar enough with that part of the code to have a full
view of the implications of that change...
I don't think that's the correct fix, but I've been able to identify a
different change that I find easier to justify, so I pushed that
instead.
Context:
- the file provided on the CLI is "unknown/untrusted" by g-d-p and so
there isn't any "provider" for the stream and for that case,
'distinctive_dirs' was set to False (what's the reasoning for that ?)
The origin of this seems to be commit ef4dc6c7 "zipfile: match on
basename for unknown & quircky archives" [sic]. Originally, it was only
for zip files, where the equivalent of this bug couldn't actually cause
a crash, because zip files are seekable (although it might have led to
some inefficiency, unpacking the same file more than once). Later, I
refactored the same function into consider_stream(), in commit f7915ca5
"Unify code to stream members from a TarFile or ZipFile". After that
point, it *can* cause a crash when unpacking a non-seekable archive,
like a tar file or (as in this case) the tar file embedded in a .deb.
What distinctive_dirs means is: if it's true (which is the default), we
will assume that a file named "foo/bar/abc.bin" can only be found in a
directory named "foo/bar", and if we see an "abc.bin" somewhere else,
it's a different file. This is usually what we want: we know the layout
that the directory tree will have, and there's no point in unpacking a
file named something like directx/setup.exe in the hope that it might
secretly be mygame/setup.exe.
However, if the user of g-d-p has passed an unknown archive file on the
command-line, after ef4dc6c7 we will consider any "abc.bin" in any
directory to be potentially the one we are looking for.
I don't know why this was done: the commit message doesn't say, and the
change isn't part of a merge request. But I can guess: probably
Alexandre had (or was thinking about) an unofficial archive containing
files in a layout that was not the one we expect, for example:
quake2-pak0.zip
\- pak0.pak (same content we expect to find in baseq2/pak0.pak)
and in that scenario, we will not successfully match pak0.pak to
anything unless we are willing to disregard the path.
- quake2 have multiple (different) gamefiles that have the same
basename()
Not just that, but they are the same size, for example:
31329 14d330fe2a9af05a6254b2c2bda9f384
baseq2/players/crakhor/a_grenades.md2
...
31329 8b26b6a4863b7e2c30b4dcd7867a6d10
baseq2/players/cyborg/a_grenades.md2
This means that g-d-p cannot tell which one it's looking at by just
looking at the metadata available in the tar file: to disambiguate, it
needs to know either the path (which we have told it to ignore), or the
content (which requires actually unpacking).
When I run
/path/to/run-gdp-uninstalled \
-d. --no-search --no-compress --debug \
quake2 input/quake2-full-data_85_all.deb
with this temporary change
--- a/game_data_packager/build.py
+++ b/game_data_packager/build.py
@@ -945,6 +945,12 @@ class PackagingTask:
# proceed to next entry
continue
+ logger.debug(
+ 'Unpacking %s from %s because it might be %s',
+ entry.name,
+ name,
+ wanted.name,
+ )
should_provide.discard(filename)
if filename in self.found:
it becomes obvious what the problem is:
DEBUG:game_data_packager.build:Unpacking
./usr/share/games/quake2/baseq2/players/crakhor/a_grenades.md2 from
ref/quake2-full-data_85_all.deb because it might be
baseq2/players/crakhor/a_grenades.md2
extracting ./usr/share/games/quake2/baseq2/players/crakhor/a_grenades.md2 from
ref/quake2-full-data_85_all.deb
DEBUG:game_data_packager.build:found baseq2/players/crakhor/a_grenades.md2 at
/tmp/gdptmp.sun12x3u/tmp/baseq2/players/crakhor/a_grenades.md2
DEBUG:game_data_packager.build:... matches baseq2/players/crakhor/a_grenades.md2
DEBUG:game_data_packager.build:Unpacking
./usr/share/games/quake2/baseq2/players/crakhor/a_grenades.md2 from
ref/quake2-full-data_85_all.deb because it might be
baseq2/players/cyborg/a_grenades.md2
... we are unpacking the same member of the archive more than once,
which at best is inefficient, and at worst fails because the archive
isn't seekable.
Instead, we should unpack it once, then loop through all the files that
it might possibly be, and check whether it does in fact match any of
them. That's what I've pushed.
While fixing this, I noticed that if we have more than one file with
identical content (quake2 often does this for weapon.md2 and
shotgun.md2), we would inefficiently unpack all of them, even though the
first one we unpacked can be used as the data source for all of the
duplicate copies. Now we only unpack the first, and skip the others.
smcv