On 5/2/23 6:29 AM, Dave Trombley wrote:
I'm trying to extract some specially formatted text from a PDF, and it
seems like it will be impossible to use PDFTextStripper for this task.  In
particular, some of the font style (bold / italic /etc) and color
information is semantically relevant, and what is considered a "paragraph"
depends on this information.

What would be ideal is if there were a way to have a callback of mine
called for each glyph on the page, containing its font, color, size, glyph,
and location in translated / simple page coordinates.  Is there a way to do
something like that?

I've looked at some of the classes that PDFTextStripper derives from, but
it's not clear to me how these work and they seem to have TOO much
information, not at all a simple view of the characters / text
themselves.

I use something like this fairly often:

    PDFStreamParser     parser = new PDFStreamParser(page);
    List<COSBase>       operands = new ArrayList<>();
    Object              token;

    while ((token = parser.parseNextToken()) != null)
    {
        if (token instanceof COSBase)
        {
            operands.add((COSBase) token);

            continue;
        }

        if (!(token instanceof Operator))
            throw new IllegalArgumentException("Unknown token " + token);

        parseOperator((Operator) token, operands);

        operands.clear();
    }

That's very low level, you have to track positions and transformation matrices and graphic states and everything yourself, but you get to see everything the PDF says to do and can set state and react accordingly.

Also, I don't think PDF has specific font style operators like, say, HTML. So you won't see something like:

turn on italic
draw some text
turn off italic

Instead it will be:

choose a font that happens to be italic
draw some text
reset the current font

Of course I'm nowhere near an expert at PDF or PDFBox, so maybe someone else will have a better suggestion.

Brian

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to