John Gardner <gardnerjo...@gmail.com> wrote: > It's easier than you think.You just have to separate presentational > semantics from structural and content-related ones.
I’m fond of saying ‘All you have to do is…’ is one of the biggest lies ever told. ;-D > I've seen grohtml's complexity and was bewildered. Hence why I intend to > write my own. The procedures for inferring structural or semantic metadata > from low-level intermediate output commands will be an entertaining > challenge. =) Good luck! i’ve actually had some limited success using ps2ascii’s more complex output modes to derive structure from font/size specifications and spacing/location, converting a PDF file to some kind of markup. Of course, it’s very specific to individual documents — it’s actually a collection of scripts, one of which returns a list of fonts and sizes used in a document and the number of characters used for each. i would use that to build a table, specifying whether strings with that format were inline or block, and the kind of markup to wrap them in. Paragraph detection is, um, fun. Some books use indents, others use vertical space. And don’t get me started on definition lists or tables… Larry