Outline of the solution I'm using is presented below.
Likely will need some adaptations for your case Hesham as subscript/superscript
"height" will probably be substantially smaller than the rest of the text on
the line and I could see that potentially throwing my algorithm off for a line
with substantial amounts of subscript/superscript text.
The class "Line" is a list of "Words" with: a method for determining if a
"word" is on the "Line" (if the word list is empty "word" is on the "Line");
a method for sorting the words on the line based on their leftmost x coordinate;
a method to retrieve the list of words.
Lines = new List<Line>();
var line = new Line() { Words = new
List<UglyToad.PdfPig.Content.Word>() };
Lines.Add(line);
foreach (var page in document.GetPages())
{
foreach (var word in page.GetWords())
{
if (line.IsWordOnLine(word))
{
line.Add(word);
}
else
{
line = new Line() { Words = new
List<UglyToad.PdfPig.Content.Word>() };
Lines.Add(line);
line.Add(word);
}
}
}
...
Lines.ForEach(l => l.Sort()); // sort by left x coordinate of
bounding box.
var words = Lines.SelectMany(l=>l.Words); // retrieve words in the
document in correct order
I use a factor of 0.8 times text height as a decision window for determining if
a "word" is on the "Line".
"Bottom" and "Height" are averages of the "Bottom" and "Height" of all the
words currently on the line.
internal bool IsWordOnLine(Word word)
{
if (Words.Any())
{
return Math.Abs(Bottom - word.BoundingBox.Bottom) < 0.8* Height;
}
else
{
return true;
}
}
The sort routine is pretty simple:
internal void Sort()
{
Words = Words.OrderBy(w => w.BoundingBox.Left).ToList();
}
-----Original Message-----
From: [email protected] <[email protected]>
Sent: Monday, November 23, 2020 8:59 AM
To: [email protected]
Subject: Re: Reading page using PDFTextStripper
CAUTION: [EXTERNAL]
Hi,
Am Sonntag, den 22.11.2020, 07:10 +0200 schrieb Hesham Gneady:
> I've tried it now, but it made no difference. I've actually explained
> the problem wrong, here's what actually happens:
>
> The 1st line in the PDF file is:
>
> 131 Comments are made from 1905, / See: Certain Neurotic Mechanisms in
>
> Where "131" is normal text, while the rest of the line has "Subscript"
> formatting. If I copy/paste the line from the PDF manually it copies
> it right ordered, but when extracting the text using PDFBox it
> extracts it like
> this:
>
> Comments are made from 1905, / See: Certain Neurotic Mechanisms in
> 131
>
> The text is being read before the "131" number.
that's what I'm getting using the -sort option using PDFBox 2.0.21
131 Comments are made from 1905, / See: Certain Neurotic Mechanisms in
Jealousy, Paranoia, and Homosexuality. (Internat. Journ. Psycho- Analysis, vol.
iv, April, 1923.) Freud, S. / A response to a mother’s concern about her son’s
homosexuality 1935 -Letters of Sigmund Freud. E. L. Freud (Ed.). New York, NY:
Basic Books. P 423. In this letter Freud links homosexuality to ‘arrested
development.’
132 Allan Schore, Affect Regulation and the Origin of the self, Lawrence
Erlbaum 1994. p 24
BR
Maruan
>
>
>
>
>
> Best regards,
>
> Hesham
>
>
>
> ---------------------------------------------------------------------
> -------
> ----------------------
>
> Included Message:
>
>
>
> Am 17.11.20 um 07:54 schrieb Hesham Gneady:
>
> > Hi,
>
> >
>
> >
>
> >
>
> > I am trying to read this PDF file using
>
> > PDFTextStripper.processTextPosition():
>
> >
>
> > <
> > https://dl.dropboxusercontent.com/s/o660xrp4sgp9tbv/PDFTextStripper%
> > 20
> > >
> https://dl.dropboxusercontent.com/s/o660xrp4sgp9tbv/PDFTextStripper%20
>
> > readin
>
> > g%20sample.pdf?dl=0
>
> >
>
> >
>
> >
>
> > But when I do that it reads it with wrong order. It reads the 2nd
> > line
>
> > before the 1st line because the 1st line has Subscript effect. Is
>
> > there a way to read it right ordered?
>
> I a pdf the text doesn't neccessarly appear in the rendering order.
> You
> should give the sort option a try:
>
>
>
> org.apache.pdfbox.text.PDFTextStripper.setSortByPosition(boolean)
>
>
>
>
>
> Andreas
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: <mailto:[email protected]>
> [email protected]
>
> For additional commands, e-mail: <mailto:
> [email protected]>
> [email protected]
>
>
>