Re: Re: Extracting Features From Text

André Ramos Thu, 09 Jul 2009 12:57:26 -0700

Hello,

Thank you guys, you've helped me a lot. I've managed to find out what token
is above average font size and so on.  But now I'd like to know how I could
group these tokens in the original paragraphs, sentences or phrases where
they were taken from the document. For example, if we had the sentence:


Apache *PDFBox* is an open source Java PDF library for working with *PDF
documents.

*I can find out that the tokens "PDFBox", "PDF" and "documents" are bold.
But I'd like to know that "PDF documents" is bold together given they are
next to each other. What is the simpler way to do that?

Best regards,
André Ramos

2009/7/8 Andreas Lehmkühler <[email protected]>

> Hi,
>
> I'd like to add a little but important detail to Roberts suggestion. If you
> are interested in the fontsize, you will prefer the fontSizeInPt from
> TextPosition. The fontsize in PDFs is splitted into 2 fields: the fontsize
> and the scaling factor from the textmatrix. The attribute fontSizeInPt  is a
> combination of both. You will find this feature in the current trunk only.
> See [1] for further details.
>
> Andreas Lehmkühler
>
> [1] https://issues.apache.org/jira/browse/PDFBOX-450
> > Hello André,
> >
> > have a look on the PDFTextStripper. It collects tokens from a given
> > document (so called TextPositions). A TextPosition object has as a
> > method called getFont which returns you the font object encapsulating
> > font information for the current token. What you can do, is to retrieve
> > the base font name from the font object (the postscript name of the
> > font) and check, if its end with the postfix -bold or whatever (this is
> > at least what i did to detect bold text blocks). Further a TextPosition
> > object contains the attribute fontSize. With this attribute you should
> > be able to detect larger text tokens by (just a suggestion) parsing an
> > entire page, computing the median font size, parsing the page again and
> > checking it the fontSize of a token is above the median.
> >
> > I hope i could help you.
> >
> > With kind regards,
> > Robert
> >
> >
> >
> > André Ramos schrieb:
> > > Hello,
> > >
> > > I'd like to use PDFBox to extract text with special features like: bold
> > > text, italicized text, text whose font size is above average and so on.
> > The
> > > idea is that any kind of highlighted text or any text formatted out of
> > the
> > > ordinary within a document must contain relevant terms to describe the
> > > document.
> > >
> > > How can I do it?
> > >
> > > Thank you.
> > >
> > >
> >
> >
>
> --- original Nachricht Ende ----
>
>


-- 
Best regards,
André Ramos

Re: Re: Extracting Features From Text

Reply via email to