Package: libhtml-gumbo-perl
Version: 0.18-4+b1
Severity: serious
Tags: security upstream
Justification: security
Forwarded: https://github.com/ruz/HTML-Gumbo/issues/6
X-Debbugs-Cc: Debian Security Team <t...@security.debian.org>

I get erratic behavior on the template HTML element, e.g. on
the HTML file "<template>". For instance:

$ perl -C -MHTML::Gumbo -e "print HTML::Gumbo->new->parse('<template>', format 
=> 'string');"
<html><head>\217¥�¾U</head><body></body></html>
$ perl -C -MHTML::Gumbo -e "print HTML::Gumbo->new->parse('<template>', format 
=> 'string');"
<html><head>)�>\220U</head><body></body></html>
$ perl -C -MHTML::Gumbo -e "print HTML::Gumbo->new->parse('<template>', format 
=> 'string');"
<html><head>q'N$uU</head><body></body></html>

One can see random output, which may include control characters
(above, I have changed them to \217 and \220 as Emacs shows them,
to avoid such control characters in the mail message).

With valgrind:

$ valgrind perl -C -MHTML::Gumbo -e "print 
HTML::Gumbo->new->parse('<template>', format => 'string');"
==64955== Memcheck, a memory error detector
==64955== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==64955== Using Valgrind-3.24.0 and LibVEX; rerun with -h for copyright info
==64955== Command: perl -C -MHTML::Gumbo -e print\ 
HTML::Gumbo-\>new-\>parse('\<template\>',\ format\ =\>\ 'string');
==64955==
==64955== Conditional jump or move depends on uninitialised value(s)
==64955==    at 0x484DC89: strlen (vg_replace_strmem.c:505)
==64955==    by 0x2AD7DF: ??? (in /usr/bin/perl)
==64955==    by 0x486D6CE: tree_to_string (Gumbo.xs:189)
==64955==    by 0x486E2C4: walk_tree.isra.0 (Gumbo.xs:55)
==64955==    by 0x486E2C4: walk_tree.isra.0 (Gumbo.xs:55)
==64955==    by 0x486E2C4: walk_tree.isra.0 (Gumbo.xs:55)
==64955==    by 0x486E41B: parse_to_string_cb (Gumbo.xs:505)
==64955==    by 0x486ED4B: common_parse.isra.0 (Gumbo.xs:545)
==64955==    by 0x486F09C: XS_HTML__Gumbo_parse_to_string (Gumbo.xs:559)
==64955==    by 0x20B3E7: ??? (in /usr/bin/perl)
==64955==    by 0x290C95: Perl_runops_standard (in /usr/bin/perl)
==64955==    by 0x179E51: perl_run (in /usr/bin/perl)
==64955==
<html><head></head><body></body></html>
==64955==
==64955== HEAP SUMMARY:
==64955==     in use at exit: 592,160 bytes in 2,369 blocks
==64955==   total heap usage: 7,166 allocs, 4,797 frees, 1,159,576 bytes 
allocated
==64955==
==64955== LEAK SUMMARY:
==64955==    definitely lost: 18,102 bytes in 19 blocks
==64955==    indirectly lost: 50,698 bytes in 23 blocks
==64955==      possibly lost: 514,100 bytes in 2,318 blocks
==64955==    still reachable: 9,260 bytes in 9 blocks
==64955==                       of which reachable via heuristic:
==64955==                         newarray           : 1,056 bytes in 33 blocks
==64955==         suppressed: 0 bytes in 0 blocks
==64955== Rerun with --leak-check=full to see details of leaked memory
==64955==
==64955== Use --track-origins=yes to see where uninitialised values come from
==64955== For lists of detected and suppressed errors, rerun with: -s
==64955== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

So, uninitialized data are used for the output.

If I use "format => 'callback'" (will a callback) instead of
"format => 'string'", then I get the following error:

Unknown node type at /usr/lib/x86_64-linux-gnu/perl5/5.40/HTML/Gumbo.pm line 
298, <> line 1.

(which is better from the security point of view, but prevents one
from parsing some modern HTML documents).

It apparently comes from Gumbo.xs, where there are two occurrences of

  croak("Unknown node type");

I suspect that this is the first one as the second one corresponds to
text node types.

The cause is probably the most recent node type GUMBO_NODE_TEMPLATE
from the Gumbo library (libgumbo):

typedef enum {
  /** Document node.  v will be a GumboDocument. */
  GUMBO_NODE_DOCUMENT,
  /** Element node.  v will be a GumboElement. */
  GUMBO_NODE_ELEMENT,
  /** Text node.  v will be a GumboText. */
  GUMBO_NODE_TEXT,
  /** CDATA node. v will be a GumboText. */
  GUMBO_NODE_CDATA,
  /** Comment node.  v will be a GumboText, excluding comment delimiters. */
  GUMBO_NODE_COMMENT,
  /** Text node, where all contents is whitespace.  v will be a GumboText. */
  GUMBO_NODE_WHITESPACE,
  /** Template node.  This is separate from GUMBO_NODE_ELEMENT because many
   * client libraries will want to ignore the contents of template nodes, as
   * the spec suggests.  Recursing on GUMBO_NODE_ELEMENT will do the right thing
   * here, while clients that want to include template contents should also
   * check for GUMBO_NODE_TEMPLATE.  v will be a GumboElement.  */
  GUMBO_NODE_TEMPLATE
} GumboNodeType;

This node type was added in 2015:

https://github.com/google/gumbo-parser/commit/4383a40605ee7872a8e2de58553383a13d919153

but most of the HTML::Gumbo code predates this change.

-- System Information:
Debian Release: trixie/sid
  APT prefers unstable-debug
  APT policy: (500, 'unstable-debug'), (500, 'stable-updates'), (500, 
'stable-security'), (500, 'stable-debug'), (500, 'proposed-updates-debug'), 
(500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 6.11.10-amd64 (SMP w/12 CPU threads; PREEMPT)
Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE, 
TAINT_UNSIGNED_MODULE
Locale: LANG=C.UTF-8, LC_CTYPE=C.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages libhtml-gumbo-perl depends on:
ii  libc6                       2.41-7
ii  libgumbo3                   0.13.0+dfsg-2
ii  libhtml-tree-perl           5.07-3
ii  perl                        5.40.1-3
ii  perl-base [perlapi-5.40.0]  5.40.1-3

libhtml-gumbo-perl recommends no packages.

libhtml-gumbo-perl suggests no packages.

-- no debconf information

-- 
Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / Pascaline project (LIP, ENS-Lyon)

Reply via email to