RE: Reading page using PDFTextStripper

James Kelly Mon, 23 Nov 2020 12:24:12 -0800

Outline of the solution I'm using is presented below.
Likely will need some adaptations for your case Hesham as subscript/superscript 
"height" will probably be substantially smaller than the rest of the text on 
the line and I could see that potentially throwing my algorithm off for a line 
with substantial amounts of subscript/superscript text.



The class "Line" is a list of "Words" with: a method for determining if a 
"word" is on the "Line" (if the word list is empty "word" is on the "Line");
a method for sorting the words on the line based on their leftmost x coordinate;
a method to retrieve the list of words.

        Lines = new List<Line>();
        var line = new Line() { Words = new 
List<UglyToad.PdfPig.Content.Word>() };
        Lines.Add(line);
        foreach (var page in document.GetPages())
        {
                foreach (var word in page.GetWords())
                {
                        if (line.IsWordOnLine(word))
                        {
                                line.Add(word);
                        }
                        else
                        {
                                line = new Line() { Words = new 
List<UglyToad.PdfPig.Content.Word>() };
                                Lines.Add(line);
                                line.Add(word);
                        }
                }
        }

...
Lines.ForEach(l => l.Sort());           // sort by left x coordinate of 
bounding box.
var words = Lines.SelectMany(l=>l.Words);       // retrieve words in the 
document in correct order

I use a factor of 0.8 times text height as a decision window for determining if 
a "word" is on the "Line".
"Bottom" and "Height" are averages of the "Bottom" and "Height" of all the 
words currently on the line.
        internal bool IsWordOnLine(Word word)
        {
            if (Words.Any())
            {
                return Math.Abs(Bottom - word.BoundingBox.Bottom) < 0.8* Height;
            }
            else
            {
                return true;
            }
        }

The sort routine is pretty simple:
        internal void Sort()
        {
            Words = Words.OrderBy(w => w.BoundingBox.Left).ToList();
        }

-----Original Message-----
From: [email protected] <[email protected]> 
Sent: Monday, November 23, 2020 8:59 AM
To: [email protected]
Subject: Re: Reading page using PDFTextStripper

CAUTION: [EXTERNAL]


Hi,
Am Sonntag, den 22.11.2020, 07:10 +0200 schrieb Hesham Gneady:
> I've tried it now, but it made no difference. I've actually explained 
> the problem wrong, here's what actually happens:
>
> The 1st line in the PDF file is:
>
> 131 Comments are made from 1905, / See: Certain Neurotic Mechanisms in
>
> Where "131" is normal text, while the rest of the line has "Subscript"
> formatting. If I copy/paste the line from the PDF manually it copies 
> it right ordered, but when extracting the text using PDFBox it 
> extracts it like
> this:
>
> Comments are made from 1905, / See: Certain Neurotic Mechanisms in
> 131
>
> The text is being read before the "131" number.


that's what I'm getting using the -sort option using PDFBox 2.0.21

131 Comments are made from 1905, / See: Certain Neurotic Mechanisms in 
Jealousy, Paranoia, and Homosexuality. (Internat. Journ. Psycho- Analysis, vol. 
iv, April, 1923.) Freud, S. / A response to a mother’s concern about her son’s 
homosexuality 1935 -Letters of Sigmund Freud. E. L. Freud (Ed.). New York, NY:
Basic Books. P 423. In this letter Freud links homosexuality to ‘arrested 
development.’
132 Allan Schore, Affect Regulation and the Origin of the self, Lawrence 
Erlbaum 1994. p 24

BR
Maruan


>
>
>
>
>
> Best regards,
>
> Hesham
>
>
>
> ---------------------------------------------------------------------
> -------
> ----------------------
>
> Included Message:
>
>
>
> Am 17.11.20 um 07:54 schrieb Hesham Gneady:
>
> > Hi,
>
> >
>
> >
>
> >
>
> > I am trying to read this PDF file using
>
> > PDFTextStripper.processTextPosition():
>
> >
>
> >  <
> > https://dl.dropboxusercontent.com/s/o660xrp4sgp9tbv/PDFTextStripper%
> > 20
> > >
> https://dl.dropboxusercontent.com/s/o660xrp4sgp9tbv/PDFTextStripper%20
>
> > readin
>
> > g%20sample.pdf?dl=0
>
> >
>
> >
>
> >
>
> > But when I do that it reads it with wrong order. It reads the 2nd 
> > line
>
> > before the 1st line because the 1st line has Subscript effect. Is
>
> > there a way to read it right ordered?
>
> I a pdf the text doesn't neccessarly appear in the rendering order.
> You
> should give the sort option a try:
>
>
>
> org.apache.pdfbox.text.PDFTextStripper.setSortByPosition(boolean)
>
>
>
>
>
> Andreas
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail:  <mailto:[email protected]>
> [email protected]
>
> For additional commands, e-mail:  <mailto:
> [email protected]>
> [email protected]
>
>
>

RE: Reading page using PDFTextStripper

Reply via email to