Hello :) Everything works fine, thank You very much!
Best Regards *--* *Paweł Leń* 2013/11/15 suzuki toshiya <[email protected]> > How about this? > > Regards, > mpsuzuki > > > On 11/15/2013 04:26 PM, suzuki toshiya wrote: > >> I'm trying to fix this issue by an insertion of myXmlTokenReplace() >> into printInfoString(). >> >> Regards, >> mpsuzuki >> >> On 11/14/2013 10:42 PM, Paweł Leń wrote: >> >>> This is the contents of file output.xml generated by command pdftotext >>> -bbox -htmlmeta 'myfile.pdf' 'output.xml' : >>> >>> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" " >>> http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns=" >>> http://www.w3.org/1999/xhtml"> >>> <head> >>> <title>Microsoft Word - Preface&Contents_Advances_in_ >>> Lasers_and_Electro_Optics.doc</title> >>> <meta name="Author" content="Teodora"/> >>> <meta name="Creator" content="PScript5.dll Version 5.2.2"/> >>> <meta name="Producer" content="Acrobat Distiller 8.0.0 (Windows)"/> >>> <meta name="CreationDate" content=""/> >>> </head> >>> <body> >>> <doc> >>> <page width="482.000000" height="680.000000"> >>> <word xMin="255.120000" yMin="190.576860" xMax="338.055540" >>> yMax="207.269700">Advances</word> >>> <word xMin="344.000562" yMin="190.576860" xMax="359.331702" >>> yMax="207.269700">in</word> >>> <word xMin="365.276724" yMin="190.576860" xMax="425.239584" >>> yMax="207.269700">Lasers</word> >>> <word xMin="256.260624" yMin="207.256884" xMax="288.954240" yMax=" >>> 223.949724">and</word> >>> <word xMin="294.884844" yMin="207.256884" xMax="363.168492" yMax=" >>> 223.949724">Electro</word> >>> <word xMin="369.099096" yMin="207.256884" xMax="425.265216" yMax=" >>> 223.949724">Optics</word> >>> </page> >>> </doc> >>> </body> >>> </html> >>> >>> >>> As You can see in line 3 tag <title> contains invalid character squence >>> with "&". The title is extracted from myfile.pdf. CDATA or some kind of >>> htmlspecialchars is needed. >>> >>> >>> >>> >>> *-- >>> * >>> >>> *Paweł Leń* >>> >>> >>> >>> 2013/11/14 suzuki toshiya <[email protected] <mailto: >>> [email protected]>> >>> >>> Hi, >>> >>> If you could post a sample XML file that you modified the >>> output of pdftotext to fit the XML parser, it would be >>> helpful for some kind people to develop a patch. >>> >>> Regards, >>> mpsuzuki >>> >>> >>> On 11/14/2013 10:04 PM, Paweł Leń wrote: >>> >>> Hello, >>> >>> I have error when running: >>> pdftotext -bbox -htmlmeta 'myfile.pdf' 'tempFile.xml' >>> >>> The output xml have <title> tag on the begining of document >>> (meta section), error appears when title contains "&" character. Title >>> field has no CDATA and it is not quoted so it causes error in my xmllib >>> parser. Can I (or You :) ) fix it somehow? >>> >>> Beast regards >>> >>> *-- >>> * >>> >>> *Paweł Leń* >>> >>> >>> >>> _________________________________________________ >>> poppler mailing list >>> [email protected] <mailto:poppler@lists. >>> freedesktop.org> >>> http://lists.freedesktop.org/__mailman/listinfo/poppler < >>> http://lists.freedesktop.org/mailman/listinfo/poppler> >>> >>> >>> >>> >> _______________________________________________ >> poppler mailing list >> [email protected] >> http://lists.freedesktop.org/mailman/listinfo/poppler >> > >
_______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
