I am attaching the patch file.
And yes, this patch is simply PDFBOX-3774 as an option, a small cosmetic
change to use idiomatic Java for PDFBOX-5487, and a unit test that
demonstrates the overlapping.
A couple of additional thoughts:
1. I feel that PDFBOX-5487 isn't doing very much. The PDFBOX-3774 feature
will address the problem fixed by PDFBOX-5487, and the "problem" of having
a space glyph entirely within the previous character is a very restricted
edge-case. In the end, the performance hit is not a big deal, but it is
code that needs to be maintained. I thought I'd mention it in case the
PDFBOX-5487 requester would be happy with PDFBOX-3774 as a solution.
2. I noticed that there is a note about JDK7+ sorting requiring transitive
comparators. Given that the build requires JDK8+, I wonder if it is time
to remove the Collections.sort path (and get rid of an exception throw,
etc...)?
- K
On Mon, Dec 16, 2024 at 6:21 AM Tilman Hausherr <[email protected]>
wrote:
> On 16.12.2024 14:02, Kevin Day wrote:
> > I just realized that there is an incorrect note in the getter/setter
> > Javadocs about the setting only taking effect if sorting is enabled.
> >
> > That note can be removed. The new setting is valid regardless of whether
> > sorting is enabled.
>
> Hi,
>
> Could you please resend the patch as text attachment? Somehow the mail
> program messed this up.
>
> From what I understand, the patch is the suggestion from PDFBOX-3774but
> as an option, plus a test. The other change (re PDFBOX-5487) is a
> (useful) cosmetic change. I wonder why I missed that when I committed it.
>
> Tilman
>
Index: pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java
===================================================================
--- pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java
(revision 1922522)
+++ pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java
(working copy)
@@ -40,6 +40,7 @@
import java.util.TreeMap;
import java.util.TreeSet;
import java.util.regex.Pattern;
+import java.util.stream.Collectors;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
@@ -147,6 +148,7 @@
private boolean shouldSeparateByBeads = true;
private boolean sortByPosition = false;
private boolean addMoreFormatting = false;
+ private boolean ignoreContentStreamSpaceGlyphs = false;
private float indentThreshold = defaultIndentThreshold;
private float dropThreshold = defaultDropThreshold;
@@ -524,11 +526,10 @@
{
IterativeMergeSort.sort(textList, comparator);
}
- finally
- {
- // PDFBOX-5487: Remove all space characters if contained
within the adjacent letters
- removeContainedSpaces(textList);
- }
+
+ // PDFBOX-5487: Remove all space characters if contained
within the adjacent letters
+ removeContainedSpaces(textList);
+
}
startArticle();
@@ -556,6 +557,10 @@
PositionWrapper current = new PositionWrapper(position);
String characterValue = position.getUnicode();
+ // PDFBOX-3774 - conditionally ignore spaces from the content
stream
+ if (" ".equals(characterValue) &&
getIgnoreContentStreamSpaceGlyphs())
+ continue;
+
// Resets the average character width when we see a change in
font
// or a change in the font size
if (lastPosition != null &&
@@ -1273,6 +1278,29 @@
sortByPosition = newSortByPosition;
}
+
+ /**
+ * Determines whether spaces in the content stream text rendering
instructions will be ignored during text extraction.
+ *
+ * @return true is space glyphs in the content stream text rendering
instructions will be ignored - default is false
+ */
+ public boolean getIgnoreContentStreamSpaceGlyphs() {
+ return ignoreContentStreamSpaceGlyphs;
+ }
+
+ /**
+ * Instruct the algorithm to ignore any spaces in the text rendering
instructions in the content stream, and
+ * instead rely purely on the algorithm to determine where word breaks are.
+ *
+ * This can improve text extraction results where the content stream is
sorted by position and has text overlapping
+ * spaces, but could cause some word breaks to not be added to the output
+ *
+ * @param newIgnoreRenderedSpaces whether PDF Box should ignore context
stream spaces
+ */
+ public void setIgnoreContentStreamSpaceGlyphs(boolean
newIgnoreContentStreamSpaceGlyphs) {
+ ignoreContentStreamSpaceGlyphs = newIgnoreContentStreamSpaceGlyphs;
+ }
+
/**
* Get the current space width-based tolerance value that is being used to
estimate where spaces in text should be
* added. Note that the default value for this has been determined from
trial and error.
Index:
pdfbox/src/test/java/org/apache/pdfbox/text/PDFTextStripperOverlapTest.java
===================================================================
--- pdfbox/src/test/java/org/apache/pdfbox/text/PDFTextStripperOverlapTest.java
(revision 0)
+++ pdfbox/src/test/java/org/apache/pdfbox/text/PDFTextStripperOverlapTest.java
(revision 0)
@@ -0,0 +1,58 @@
+package org.apache.pdfbox.text;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+import org.apache.pdfbox.pdmodel.PDDocument;
+import org.apache.pdfbox.pdmodel.PDPage;
+import org.apache.pdfbox.pdmodel.PDPageContentStream;
+import org.apache.pdfbox.pdmodel.font.PDFont;
+import org.apache.pdfbox.pdmodel.font.PDType1Font;
+import org.apache.pdfbox.pdmodel.font.Standard14Fonts.FontName;
+import org.junit.jupiter.api.Test;
+
+public class PDFTextStripperOverlapTest {
+
+ @Test
+ void testIgnoreContentStreamSpaceGlyphs() throws Exception
+ {
+ try (PDDocument doc = new PDDocument())
+ {
+ PDPage page = new PDPage();
+ try (PDPageContentStream cs = new PDPageContentStream(doc, page))
+ {
+ float fontHeight = 8;
+ float x = 50;
+ float y = page.getMediaBox().getHeight() - 50;
+ PDFont font = new PDType1Font(FontName.HELVETICA);
+ cs.beginText();
+ cs.setFont(font, fontHeight);
+ cs.newLineAtOffset(x, y);
+ cs.showText("( )");
+ cs.endText();
+
+ int indent = 6;
+ float overlapX = x + indent *
font.getAverageFontWidth()/1000f*fontHeight;
+ PDFont overlapFont = new PDType1Font(FontName.TIMES_ROMAN);
+ cs.beginText();
+ cs.setFont(overlapFont, fontHeight*2f);
+ cs.newLineAtOffset(overlapX, y);
+ cs.showText("overlap");
+ cs.endText();
+ }
+ doc.addPage(page);
+
+ PDFTextStripper stripper = new PDFTextStripper();
+ stripper.setLineSeparator("\n");
+ stripper.setPageEnd("\n");
+ stripper.setStartPage(1);
+ stripper.setEndPage(1);
+ stripper.setSortByPosition(true);
+
+ stripper.setIgnoreContentStreamSpaceGlyphs(true);
+ String text = stripper.getText(doc);
+ assertEquals("( overlap )\n", text);
+
+ }
+ }
+
+}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]