Package: html2text
Version: 1.3.2a-14
Severity: normal
Tags: upstream patch

If the <html> element has an attribute name that contains a numeric,
html2text stops processing the element and the remaining attributes show
up in the output (debug shows that it treats the remainder as PCDATA).

Example input misparsed at the x2:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:p="urn:schemas-microsoft-com:office:powerpoint"
xmlns:a="urn:schemas-microsoft-com:office:access"
xmlns:mt="http://schemas.microsoft.com/sharepoint/soap/meetings/";
xmlns:x2="http://schemas.microsoft.com/office/excel/2003/xml";
xmlns:ppda="http://www.passport.com/NameSpace.xsd"; 
><body>test</body></html>

Output:
="http://schemas.microsoft.com/office/excel/2003/xml"; xmlns:ppda="http://
www.passport.com/NameSpace.xsd" >

test


Patch:
--- HTMLControl.C.orig  2003-11-23 04:05:29.000000000 -0700
+++ HTMLControl.C       2010-05-18 19:33:54.000000000 -0600
@@ -372,7 +372,7 @@
             attribute.first = c;
             for (;;) {
               c = get_char();
-              if (!isalpha(c) && c != '-' && c != '_' && c != ':') break;
+              if (!isalnum(c) && c != '-' && c != '_' && c != ':') break;
              // Same as in line 352 - Arno
               attribute.first += c;
             }


-- System Information:
Debian Release: squeeze/sid
  APT prefers oldstable
  APT policy: (500, 'oldstable'), (500, 'testing'), (500, 'stable')
Architecture: i386 (i686)

Kernel: Linux 2.6.30-2-686 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages html2text depends on:
ii  libc6                         2.10.2-6   Embedded GNU C Library: Shared lib
ii  libgcc1                       1:4.4.2-9  GCC support library
ii  libstdc++6                    4.4.2-9    The GNU Standard C++ Library v3

html2text recommends no packages.

Versions of packages html2text suggests:
ii  curl                          7.20.0-2   Get a file from an HTTP, HTTPS or 
ii  wget                          1.12-1.1   retrieves files from the web

-- no debconf information
--- HTMLControl.C.orig	2003-11-23 04:05:29.000000000 -0700
+++ HTMLControl.C	2010-05-18 19:33:54.000000000 -0600
@@ -372,7 +372,7 @@
             attribute.first = c;
             for (;;) {
               c = get_char();
-              if (!isalpha(c) && c != '-' && c != '_' && c != ':') break;
+              if (!isalnum(c) && c != '-' && c != '_' && c != ':') break;
 	      // Same as in line 352 - Arno
               attribute.first += c;
             }

Reply via email to