Package: libgumbo1
Version: 0.10.1+dfsg-2.1
Severity: minor

When parsing a file, a spurious newline is added before </body>.
For instance, on

<!DOCTYPE html>
<html><head></head><body><p>Test</p></body></html>

I get

<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml";><head/><body><p>Test</p>
</body></html>

when converting it with the following Perl script:

----------------------------------------------------------------------------
#!/usr/bin/env perl

use strict;
use HTML::Gumbo;
use XML::LibXML;

use open ':encoding(UTF-8)';
binmode STDIN, ':encoding(UTF-8)';

my %voidelem = map { $_ => 1 } qw(area base br col embed hr img input keygen
                                  link meta param source track wbr);

my $doc = XML::LibXML::Document->createDocument('1.0', 'utf-8');
my @nodes;

HTML::Gumbo->new->parse
  (do { local $/; <> }, format => 'callback', callback => sub
   {
     my ($event) = shift;
     if ($event =~ /^document (start|end)$/ )
       { }
     elsif ($event eq 'start' )
       {
         my ($tag, $attr) = @_;
         my $element;
         if (@nodes)
           {
             $element = $doc->createElement($tag);
             $nodes[-1]->appendChild($element);
           }
         else
           {
             $element = $doc->createElementNS('http://www.w3.org/1999/xhtml',
                                              $tag);
             $doc->setDocumentElement($element);
           }
         while (@$attr)
           { $element->setAttribute(splice @$attr, 0, 2); }
         push @nodes, $element unless $voidelem{$tag};
       }
     elsif ($event eq 'end')
       {
         $_[0] eq $nodes[-1]->nodeName or die "internal error";
         pop @nodes;
       }
     else
       {
         my $node;
         if ($event =~ /^(text|space)$/)
           { $node = $doc->createTextNode($_[0]); }
         elsif ($event eq 'comment')
           { $node = $doc->createComment($_[0]); }
         elsif ($event eq 'cdata')
           { $node = $doc->createCDATASection($_[0]); }
         else
           { die "unknown event"; }
         $nodes[-1]->appendChild($node);
       }
   }
  );

$doc->toFH(*STDOUT, 0);
----------------------------------------------------------------------------

I also obtain the newline with the "string" output format
instead of "callback":

----------------------------------------------------------------------------
#!/usr/bin/env perl

use strict;
use HTML::Gumbo;

use open ':encoding(UTF-8)';
binmode STDIN, ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';

print HTML::Gumbo->new->parse(do { local $/; <> });
----------------------------------------------------------------------------

This is with the Perl HTML::Gumbo module, but I suppose that the bug
is in the library itself.

In practice, this space shouldn't matter, but one never knows as
it might theoretically have an effect on some CSS rules, for instance.
It also breaks the idempotence property: by applying the above script
with string output format several times with pipes, one gets another
newline for each invocation.

-- System Information:
Debian Release: buster/sid
  APT prefers unstable-debug
  APT policy: (500, 'unstable-debug'), (500, 'unstable'), (500, 'testing'), 
(500, 'stable'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 4.11.0-2-amd64 (SMP w/8 CPU cores)
Locale: LANG=POSIX, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE=POSIX 
(charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages libgumbo1 depends on:
ii  libc6  2.24-14

libgumbo1 recommends no packages.

libgumbo1 suggests no packages.

-- no debconf information

Reply via email to