Package: html2text
Version: 1.3.2a-14
Severity: normal
Tags: upstream patch
If the <html> element has an attribute name that contains a numeric,
html2text stops processing the element and the remaining attributes show
up in the output (debug shows that it treats the remainder as PCDATA).
Example input misparsed at the x2:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:p="urn:schemas-microsoft-com:office:powerpoint"
xmlns:a="urn:schemas-microsoft-com:office:access"
xmlns:mt="http://schemas.microsoft.com/sharepoint/soap/meetings/"
xmlns:x2="http://schemas.microsoft.com/office/excel/2003/xml"
xmlns:ppda="http://www.passport.com/NameSpace.xsd"
><body>test</body></html>
Output:
="http://schemas.microsoft.com/office/excel/2003/xml" xmlns:ppda="http://
www.passport.com/NameSpace.xsd" >
test
Patch:
--- HTMLControl.C.orig 2003-11-23 04:05:29.000000000 -0700
+++ HTMLControl.C 2010-05-18 19:33:54.000000000 -0600
@@ -372,7 +372,7 @@
attribute.first = c;
for (;;) {
c = get_char();
- if (!isalpha(c) && c != '-' && c != '_' && c != ':') break;
+ if (!isalnum(c) && c != '-' && c != '_' && c != ':') break;
// Same as in line 352 - Arno
attribute.first += c;
}
-- System Information:
Debian Release: squeeze/sid
APT prefers oldstable
APT policy: (500, 'oldstable'), (500, 'testing'), (500, 'stable')
Architecture: i386 (i686)
Kernel: Linux 2.6.30-2-686 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash
Versions of packages html2text depends on:
ii libc6 2.10.2-6 Embedded GNU C Library: Shared lib
ii libgcc1 1:4.4.2-9 GCC support library
ii libstdc++6 4.4.2-9 The GNU Standard C++ Library v3
html2text recommends no packages.
Versions of packages html2text suggests:
ii curl 7.20.0-2 Get a file from an HTTP, HTTPS or
ii wget 1.12-1.1 retrieves files from the web
-- no debconf information
--- HTMLControl.C.orig 2003-11-23 04:05:29.000000000 -0700
+++ HTMLControl.C 2010-05-18 19:33:54.000000000 -0600
@@ -372,7 +372,7 @@
attribute.first = c;
for (;;) {
c = get_char();
- if (!isalpha(c) && c != '-' && c != '_' && c != ':') break;
+ if (!isalnum(c) && c != '-' && c != '_' && c != ':') break;
// Same as in line 352 - Arno
attribute.first += c;
}