How about this?
Regards,
mpsuzuki
On 11/15/2013 04:26 PM, suzuki toshiya wrote:
I'm trying to fix this issue by an insertion of myXmlTokenReplace()
into printInfoString().
Regards,
mpsuzuki
On 11/14/2013 10:42 PM, Paweł Leń wrote:
This is the contents of file output.xml generated by command pdftotext -bbox
-htmlmeta 'myfile.pdf' 'output.xml' :
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html
xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Microsoft Word -
Preface&Contents_Advances_in_Lasers_and_Electro_Optics.doc</title>
<meta name="Author" content="Teodora"/>
<meta name="Creator" content="PScript5.dll Version 5.2.2"/>
<meta name="Producer" content="Acrobat Distiller 8.0.0 (Windows)"/>
<meta name="CreationDate" content=""/>
</head>
<body>
<doc>
<page width="482.000000" height="680.000000">
<word xMin="255.120000" yMin="190.576860" xMax="338.055540"
yMax="207.269700">Advances</word>
<word xMin="344.000562" yMin="190.576860" xMax="359.331702"
yMax="207.269700">in</word>
<word xMin="365.276724" yMin="190.576860" xMax="425.239584"
yMax="207.269700">Lasers</word>
<word xMin="256.260624" yMin="207.256884" xMax="288.954240"
yMax="223.949724">and</word>
<word xMin="294.884844" yMin="207.256884" xMax="363.168492"
yMax="223.949724">Electro</word>
<word xMin="369.099096" yMin="207.256884" xMax="425.265216"
yMax="223.949724">Optics</word>
</page>
</doc>
</body>
</html>
As You can see in line 3 tag <title> contains invalid character squence with
"&". The title is extracted from myfile.pdf. CDATA or some kind of htmlspecialchars
is needed.
*--
*
*Paweł Leń*
2013/11/14 suzuki toshiya <[email protected]
<mailto:[email protected]>>
Hi,
If you could post a sample XML file that you modified the
output of pdftotext to fit the XML parser, it would be
helpful for some kind people to develop a patch.
Regards,
mpsuzuki
On 11/14/2013 10:04 PM, Paweł Leń wrote:
Hello,
I have error when running:
pdftotext -bbox -htmlmeta 'myfile.pdf' 'tempFile.xml'
The output xml have <title> tag on the begining of document (meta section), error
appears when title contains "&" character. Title field has no CDATA and it is not
quoted so it causes error in my xmllib parser. Can I (or You :) ) fix it somehow?
Beast regards
*--
*
*Paweł Leń*
_________________________________________________
poppler mailing list
[email protected] <mailto:[email protected]>
http://lists.freedesktop.org/__mailman/listinfo/poppler
<http://lists.freedesktop.org/mailman/listinfo/poppler>
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler
diff --git a/utils/pdftotext.cc b/utils/pdftotext.cc
index f7b2b0e..7946ba7 100644
--- a/utils/pdftotext.cc
+++ b/utils/pdftotext.cc
@@ -437,7 +437,7 @@ static void printInfoString(FILE *f, Dict *infoDict, const char *key,
GooString *s1;
GBool isUnicode;
Unicode u;
- char buf[8];
+ char buf[9];
int i, n;
if (infoDict->lookup(key, &obj)->isString()) {
@@ -461,7 +461,9 @@ static void printInfoString(FILE *f, Dict *infoDict, const char *key,
++i;
}
n = uMap->mapUnicode(u, buf, sizeof(buf));
- fwrite(buf, 1, n, f);
+ buf[n] = '\0';
+ const std::string myString = myXmlTokenReplace(buf);
+ fputs(myString.c_str(), f);
}
fputs(text2, f);
}
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler