Hi, I'd like to add a little but important detail to Roberts suggestion. If you are interested in the fontsize, you will prefer the fontSizeInPt from TextPosition. The fontsize in PDFs is splitted into 2 fields: the fontsize and the scaling factor from the textmatrix. The attribute fontSizeInPt is a combination of both. You will find this feature in the current trunk only. See [1] for further details.
Andreas Lehmkühler [1] https://issues.apache.org/jira/browse/PDFBOX-450 > Hello André, > > have a look on the PDFTextStripper. It collects tokens from a given > document (so called TextPositions). A TextPosition object has as a > method called getFont which returns you the font object encapsulating > font information for the current token. What you can do, is to retrieve > the base font name from the font object (the postscript name of the > font) and check, if its end with the postfix -bold or whatever (this is > at least what i did to detect bold text blocks). Further a TextPosition > object contains the attribute fontSize. With this attribute you should > be able to detect larger text tokens by (just a suggestion) parsing an > entire page, computing the median font size, parsing the page again and > checking it the fontSize of a token is above the median. > > I hope i could help you. > > With kind regards, > Robert > > > > André Ramos schrieb: > > Hello, > > > > I'd like to use PDFBox to extract text with special features like: bold > > text, italicized text, text whose font size is above average and so on. > The > > idea is that any kind of highlighted text or any text formatted out of > the > > ordinary within a document must contain relevant terms to describe the > > document. > > > > How can I do it? > > > > Thank you. > > > > > > --- original Nachricht Ende ----
