Hi apt and devscripts maintainers, Tools parsing deb822-style documents currently disagree about where paragraphs should be split within the document. The result is that wrap-and-sort from devscripts currently eats control files if there are separator lines between paragraphs that contain whitespace, while the rest of our build system accepts whitespace-only lines as separators.
wrap-and-sort uses python-debian's deb822 module to work out what's in debian/control so bugs about wrap-and-sort's behaviour here have ended up with python-debian. I've sought to address this problem and attached is a patch for python-debian that fixes it by explicitly including whitespace in python- debian's test for new lines between paragraphs. The problem with this patch is that python-debian's deb822 parser isn't necessarily used by devscripts... python-debian will use python-apt's apt_pkg.TagFile if it is available and it also has this same behaviour. If I were to apply this patch to python-debian, there would be different behaviour depending on whether python-apt is installed and whether the text being interpreted is a file or a sequence/string: * if python-apt is not installed or the text is not a file or the call to iter_paragraphs includes "use_apt_pkg=False", then the code path in deb822.py is used and the split is done correctly. Using the same test case as in #655988 where the correct answer should be '3': $ python -c 'import deb822; print len([p for p in deb822.Deb822.iter_paragraphs(open("control-wspace"), use_apt_pkg=False)])' 3 $ python -c 'import deb822; print len([p for p in deb822.Deb822.iter_paragraphs(open("control-wspace").readlines())])' 3 * if python-apt is installed and the text is in a file and "use_apt_pkg=False" is not given, then libapt will *not* split the paragraphs as desired. $ python -c 'import deb822; print len([p for p in deb822.Deb822.iter_paragraphs(open("control-wspace"))])' 2 :( So... we could change python-debian with the attached patch and *also* change devscripts/control.py to pass "use_apt_pkg=False" and that would fix wrap-and- sort. However, making those two changes alone would mean that the use of apt_pkg.TagFile is not just a performance boost for iter_paragraphs as advertised in its documentation, it will also have different behaviour. I don't think that's a particularly good idea. It's creating a horrible interface and potentially making for very difficult debugging for other users of deb822.py. python-apt maintainers: do you think it's reasonable to change apt_pkg.TagFile (presumably by changing libapt-pkg) to split paragraphs not only on blank lines but also on whitespace-only lines? For reference, policy §5.1 permits such control files with pretty rubbery language: The paragraphs are separated by empty lines. Parsers may accept lines consisting solely of spaces and tabs as paragraph separators, but control files should use empty lines. I tend to err on the side of the parser being lax and the generator being strict, which makes me think that both deb822.iter_paragraphs and apt_pkg.TagFile should split on these whitespace-only lines. cheers Stuart -- Stuart Prescott http://www.nanonanonano.net/ stu...@nanonanonano.net Debian Developer http://www.debian.org/ stu...@debian.org GPG fingerprint 90E2 D2C1 AD14 6A1B 7EBB 891D BBC1 7EBB 1396 F2F7
From c8bbd5a9776fa03811ef11f73aa598059f64e002 Mon Sep 17 00:00:00 2001 From: Stuart Prescott <stu...@debian.org> Date: Wed, 4 Jun 2014 00:49:01 +1000 Subject: [PATCH] Allow whitespace-only lines to separate paragraphs iter_paragraphs should split paragraphs on whitespace-only lines. Paragraphs should be separated by a blank line but policy permits parsers to permit whitespace only lines too. Closes #715558. --- lib/debian/deb822.py | 4 +++- tests/test_deb822.py | 24 ++++++++++++++++++++++++ 2 files changed, 27 insertions(+), 1 deletion(-) diff --git a/lib/debian/deb822.py b/lib/debian/deb822.py index 98bf9f3..ed933e4 100644 --- a/lib/debian/deb822.py +++ b/lib/debian/deb822.py @@ -627,7 +627,9 @@ class Deb822(Deb822Dict): gpg_post_lines = [] state = b'SAFE' gpgre = re.compile(br'^-----(?P<action>BEGIN|END) PGP (?P<what>[^-]+)-----$') - blank_line = re.compile(b'^$') + # Include whitespace-only lines in blank lines to split paragraphs. + # (see #715558) + blank_line = re.compile(b'^\s*$') first_line = True for line in sequence: diff --git a/tests/test_deb822.py b/tests/test_deb822.py index a989a32..614bd5a 100755 --- a/tests/test_deb822.py +++ b/tests/test_deb822.py @@ -445,6 +445,30 @@ class TestDeb822(unittest.TestCase): self.assertWellParsed(d, PARSED_PACKAGE) self.assertEqual(count, 2) + def test_iter_paragraphs_with_extra_whitespace(self): + """ Paragraphs not elided when stray whitespace is between + + From policy §5.1: + + The paragraphs are separated by empty lines. Parsers may accept + lines consisting solely of spaces and tabs as paragraph separators, + but control files should use empty lines. + + On the principle of "be strict in what you send; be generous in + what you receive", deb822 should permit such extra whitespace between + deb822 stanzas. + + See #715558 for further details. + """ + for extra_space in (" ", " ", "\t"): + text = (UNPARSED_PACKAGE + '%s\n' % extra_space + + UNPARSED_PACKAGE).splitlines() + count = 0 + for d in deb822.Deb822.iter_paragraphs(text): + count += 1 + self.assertEqual(count, 2, + "Wrong number paragraphs were found: %d != 2" % count) + def _test_iter_paragraphs(self, filename, cls, **kwargs): """Ensure iter_paragraphs consistency""" -- 1.9.1
signature.asc
Description: This is a digitally signed message part.