On Wed, Dec 30, 2020 at 10:38:58PM -0800, Josh Triplett wrote: > With a large number of path exclusions specified (around 500), > mmtarfilter starts to become a noticeable performance bottleneck. > > It looks like mmtarfilter checks each file linearly against each filter > using fnmatch. > > Python's fnmatch implementation works by translating shell patterns into > regular > expressions. Python also provides a function to do that translation > separate from fnmatch. One fairly simple optimization would be to walk the > list of > patterns *once*, take each series of consecutive exclude or include > filters, turn each one into a regex, join all the regexes in > each group together using (?:...)|(?:...) , and compile the resulting > regexes once. That should provide a substantial performance improvement.
Turns out there's a much simpler explanation with a simpler fix. fnmatch has a 256-entry LRU cache for the translated regular expressions. Once there are more than 256 path filters, the cache stops working entirely, and every shell pattern gets re-translated and re-compiled on every invocation of fnmatch. I wrote the attached patch for mmtarfilter to address this. On an invocation of mmdebstrap with around 500 path filters, this saves more than a minute. - Josh Triplett
>From 2218e9d882516c4cdb4a6ba41121355e475da06d Mon Sep 17 00:00:00 2001 Message-Id: <2218e9d882516c4cdb4a6ba41121355e475da06d.1609448641.git.j...@joshtriplett.org> From: Josh Triplett <j...@joshtriplett.org> Date: Thu, 31 Dec 2020 12:49:16 -0800 Subject: [PATCH] Optimize mmtarfilter to handle many path exclusions mmtarfilter uses fnmatch to handle path exclusions and inclusions. Python's fnmatch handles shell patterns by translating them to regular expressions, with a 256-entry LRU cache. With more than 256 path exclusions or inclusions, this LRU cache no longer works, and every invocation of fnmatch on every file in every package will re-translate and re-compile a regular expression, resulting in much worse performance. Translate all the shell patterns to regular expressions once. For an mmdebstrap invocation with around 500 path filters, this speeds up mmdebstrap by more than a minute. --- debian/changelog | 7 +++++++ tarfilter | 13 +++++++------ 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/debian/changelog b/debian/changelog index 87d1f32..06227cf 100644 --- a/debian/changelog +++ b/debian/changelog @@ -1,3 +1,10 @@ +mmdebstrap (0.7.4-1) UNRELEASED; urgency=medium + + [ Josh Triplett ] + * Optimize mmtarfilter to handle many path exclusions (closes: #978742) + + -- Johannes 'josch' Schauer <jo...@debian.org> Thu, 31 Dec 2020 12:56:02 -0800 + mmdebstrap (0.7.3-1) unstable; urgency=medium * new upstream release diff --git a/tarfilter b/tarfilter index 838e4f5..ab76683 100755 --- a/tarfilter +++ b/tarfilter @@ -21,14 +21,15 @@ import tarfile import sys import argparse -from fnmatch import fnmatch +import fnmatch import re class FilterAction(argparse.Action): def __call__(self, parser, namespace, values, option_string=None): items = getattr(namespace, "filter", []) - items.append((self.dest, values)) + regex = re.compile(fnmatch.translate(values)) + items.append((self.dest, regex)) setattr(namespace, "filter", items) @@ -64,17 +65,17 @@ dpkg(1) for information on how these two options work in detail. skip = False if not args.filter: return False - for (t, f) in args.filter: - if fnmatch(member.name[1:], f): + for (t, r) in args.filter: + if r.match(member.name[1:]) is not None: if t == "path_include": skip = False else: skip = True if skip and (member.isdir() or member.issym()): - for (t, f) in args.filter: + for (t, r) in args.filter: if t != "path_include": continue - prefix = re.sub(r"^([^*?[\\]*).*", r"\1", f) + prefix = re.sub(r"^([^*?[\\]*).*", r"\1", r.pattern) prefix = prefix.rstrip("/") if member.name[1:].startswith(prefix): if member.name == "./usr/share/doc/doc-debian": -- 2.30.0