Hi *: I work on pdf files some of which might be image-based (with or without the text included), or searchable pdf which include images of varying quality and with text embedded in various ways. This would be the typical text I would be dealing with:
https://www.nysedregents.org/USHistoryGov/Archive/20020122exam.pdf which tools could be used to extract the text on the Images? As Liam on the gimpusers Forum pointed out to me, you Need: (1) feature extraction, finding the writing, (2) OCR of some sort, to turn pictures of letters into letters, and then (3) the linguistic analysis. which tools and/or strategies could be used for steps 1-3? Another example of textual file I work with would be: https://scholarworks.iu.edu/dspace/bitstream/handle/2022/18961/Notes and Texts.pdf on that searchable file file pdftohtml produces one background file per page, but when you stratify the content (simply using hash signatures) you realize most files are of the same kind (just blank background images or files containing a single line (for example, underlining a title) or framing a blocked message), then there are full-page blank Images with segments of greek text, ... Why don't poppler utils: a) underline text segments since they know their exact X,Y offsets; b) encode blocked text using html blocks; c) include the image of textual characters in foreing languages as character sequences; instead of creating for such purposes a background Image for each page? Maybe there is a way to work around such hurdles I don't know and/or someone has already written code to take care of that. Do you know of such a code? Thank you, lbrtchx _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
