Josh Richardson <jric <at> chegg.com> writes: > > 1. I'd like to point out that pdftohtml also has a "coalescence" function > which attempts to make paragraphs out of PDF, but is so far very > rudimentary and inaccurate, and could definitely benefit from some good > algorithmic sauce. Perhaps we could figure out how to create functions at > the poppler library level to be leveraged across applications. I'd be > happy to contribute. > 2. Dave, why do you say that you cannot read unicode, and you want 8-bit > in plain English? Unicode is great for describing English, as well as > every other human language. ASCII is encoded in 7 bits, and once you get > into that eighth bit, you better know what the encoding is, otherwise you > may misinterpret the meaning. What exactly is the problem you're facing? > For pdftohtml, we found that many documents were encoded with glyphs from > embedded fonts that had no unicode mapping. If you need to be able to > interpret that text without reference to the embedded font, then I think > you'll have to do pattern-matching on the rendered glyph. Not something > I'm planning to undertake, but sounds like fun! > > --josh >
HI Josh thanks for your reply In the file Gfx I read the commands and I have access to the string of character directly from those commands, the text is a parameter, of TJ or Tj, since all the pieces of text from the same paragraph are always between BT (begin text) and ET (end text) I can correctly extract the whole paragraph, so i dont need to made any guess or more complex process. The problem with this way is, sometimes instead of letters, I got some weird stuffs (it prints like a 2x2 table with numbers), but if instead of extract the text from the commands I extract it before rendering (which is what most of people do) I can actually read the string of characters, so my question is, im not sure what is the piece of code that made the translation, So far I also made some heuristics to separate paragraphs, it works most of the time, but not always, but i think if i can find a way to translate the other code then i will have something that works all the time. _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
