Re: Re: Extracting Features From Text

Andreas Lehmkühler Wed, 08 Jul 2009 03:06:08 -0700

Hi,

I'd like to add a little but important detail to Roberts suggestion. If you are 
interested in the fontsize, you will prefer the fontSizeInPt from TextPosition. 
The fontsize in PDFs is splitted into 2 fields: the fontsize and the scaling 
factor from the textmatrix. The attribute fontSizeInPt  is a combination of 
both. You will find this feature in the current trunk only. See [1] for further 
details.


Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-450
> Hello André,
> 
> have a look on the PDFTextStripper. It collects tokens from a given 
> document (so called TextPositions). A TextPosition object has as a 
> method called getFont which returns you the font object encapsulating 
> font information for the current token. What you can do, is to retrieve 
> the base font name from the font object (the postscript name of the 
> font) and check, if its end with the postfix -bold or whatever (this is 
> at least what i did to detect bold text blocks). Further a TextPosition 
> object contains the attribute fontSize. With this attribute you should 
> be able to detect larger text tokens by (just a suggestion) parsing an 
> entire page, computing the median font size, parsing the page again and 
> checking it the fontSize of a token is above the median.
> 
> I hope i could help you.
> 
> With kind regards,
> Robert
> 
> 
> 
> André Ramos schrieb:
> > Hello,
> >
> > I'd like to use PDFBox to extract text with special features like: bold
> > text, italicized text, text whose font size is above average and so on.
> The
> > idea is that any kind of highlighted text or any text formatted out of
> the
> > ordinary within a document must contain relevant terms to describe the
> > document.
> >
> > How can I do it?
> >
> > Thank you.
> >
> >   
> 
> 

--- original Nachricht Ende ----

Re: Re: Extracting Features From Text

Reply via email to