On Wed, Dec 30, 2020 at 10:38:58PM -0800, Josh Triplett wrote:
> With a large number of path exclusions specified (around 500),
> mmtarfilter starts to become a noticeable performance bottleneck.
> 
> It looks like mmtarfilter checks each file linearly against each filter
> using fnmatch.
> 
> Python's fnmatch implementation works by translating shell patterns into 
> regular
> expressions. Python also provides a function to do that translation
> separate from fnmatch. One fairly simple optimization would be to walk the 
> list of
> patterns *once*, take each series of consecutive exclude or include
> filters, turn each one into a regex, join all the regexes in
> each group together using (?:...)|(?:...) , and compile the resulting
> regexes once. That should provide a substantial performance improvement.

Turns out there's a much simpler explanation with a simpler fix. fnmatch
has a 256-entry LRU cache for the translated regular expressions. Once
there are more than 256 path filters, the cache stops working entirely,
and every shell pattern gets re-translated and re-compiled on every
invocation of fnmatch.

I wrote the attached patch for mmtarfilter to address this. On an
invocation of mmdebstrap with around 500 path filters, this saves more
than a minute.

- Josh Triplett
>From 2218e9d882516c4cdb4a6ba41121355e475da06d Mon Sep 17 00:00:00 2001
Message-Id: <2218e9d882516c4cdb4a6ba41121355e475da06d.1609448641.git.j...@joshtriplett.org>
From: Josh Triplett <j...@joshtriplett.org>
Date: Thu, 31 Dec 2020 12:49:16 -0800
Subject: [PATCH] Optimize mmtarfilter to handle many path exclusions

mmtarfilter uses fnmatch to handle path exclusions and inclusions.
Python's fnmatch handles shell patterns by translating them to regular
expressions, with a 256-entry LRU cache. With more than 256 path
exclusions or inclusions, this LRU cache no longer works, and every
invocation of fnmatch on every file in every package will re-translate
and re-compile a regular expression, resulting in much worse
performance.

Translate all the shell patterns to regular expressions once. For an
mmdebstrap invocation with around 500 path filters, this speeds up
mmdebstrap by more than a minute.
---
 debian/changelog |  7 +++++++
 tarfilter        | 13 +++++++------
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/debian/changelog b/debian/changelog
index 87d1f32..06227cf 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,3 +1,10 @@
+mmdebstrap (0.7.4-1) UNRELEASED; urgency=medium
+
+  [ Josh Triplett ]
+  * Optimize mmtarfilter to handle many path exclusions (closes: #978742)
+
+ -- Johannes 'josch' Schauer <jo...@debian.org>  Thu, 31 Dec 2020 12:56:02 -0800
+
 mmdebstrap (0.7.3-1) unstable; urgency=medium
 
   * new upstream release
diff --git a/tarfilter b/tarfilter
index 838e4f5..ab76683 100755
--- a/tarfilter
+++ b/tarfilter
@@ -21,14 +21,15 @@
 import tarfile
 import sys
 import argparse
-from fnmatch import fnmatch
+import fnmatch
 import re
 
 
 class FilterAction(argparse.Action):
     def __call__(self, parser, namespace, values, option_string=None):
         items = getattr(namespace, "filter", [])
-        items.append((self.dest, values))
+        regex = re.compile(fnmatch.translate(values))
+        items.append((self.dest, regex))
         setattr(namespace, "filter", items)
 
 
@@ -64,17 +65,17 @@ dpkg(1) for information on how these two options work in detail.
         skip = False
         if not args.filter:
             return False
-        for (t, f) in args.filter:
-            if fnmatch(member.name[1:], f):
+        for (t, r) in args.filter:
+            if r.match(member.name[1:]) is not None:
                 if t == "path_include":
                     skip = False
                 else:
                     skip = True
         if skip and (member.isdir() or member.issym()):
-            for (t, f) in args.filter:
+            for (t, r) in args.filter:
                 if t != "path_include":
                     continue
-                prefix = re.sub(r"^([^*?[\\]*).*", r"\1", f)
+                prefix = re.sub(r"^([^*?[\\]*).*", r"\1", r.pattern)
                 prefix = prefix.rstrip("/")
                 if member.name[1:].startswith(prefix):
                     if member.name == "./usr/share/doc/doc-debian":
-- 
2.30.0

Reply via email to