Bug#987057: gscan2pdf: fixes to hOCR

Peter Marschall Fri, 16 Apr 2021 09:54:15 -0700

Package: gscan2pdf
Version: 2.11.0-1
Severity: wishlist
Tags: patch upstream


Hi,

please find attached 3 patches improving hOCR support.
- recognize more tags when reading hOCR
  In addition this also helps for rotated texts (in a separate report)
- preserve more properties when writing hOCR
- fix-indentation of intermediate non-leaf elements

The patches are against upstream's version 2.11.2

Please consider incorporating them into gscan2pdf's next version

Thanks in advance
Peter

-- System Information:
Debian Release: bullseye/sid
  APT prefers testing
  APT policy: (990, 'testing'), (500, 'unstable'), (500, 'stable'), (1, 
'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 5.10.0-5-amd64 (SMP w/12 CPU threads)
Kernel taint flags: TAINT_CRAP
Locale: LANG=de_DE.UTF-8, LC_CTYPE=de_DE.UTF-8 (charmap=UTF-8), LANGUAGE=en_GB
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages gscan2pdf depends on:
ii  imagemagick                            8:6.9.11.60+dfsg-1
ii  imagemagick-6.q16 [imagemagick]        8:6.9.11.60+dfsg-1
ii  libconfig-general-perl                 2.63-1
ii  libdate-calc-perl                      6.4-1.1
ii  libfilesys-df-perl                     0.92-6+b6
ii  libgoocanvas2-perl                     0.06-2
ii  libgtk3-imageview-perl                 6-1
ii  libgtk3-perl                           0.038-1
ii  libgtk3-simplelist-perl                0.21-1
ii  libhtml-parser-perl                    3.75-1+b1
ii  libimage-magick-perl                   8:6.9.11.60+dfsg-1
ii  libimage-sane-perl                     5-1+b1
ii  liblist-moreutils-perl                 0.430-2
ii  liblocale-codes-perl                   3.66-1
ii  liblocale-gettext-perl                 1.07-4+b1
ii  liblog-log4perl-perl                   1.54-1
ii  libossp-uuid-perl [libdata-uuid-perl]  1.6.2-1.5+b9
ii  libpdf-builder-perl                    3.021-2
ii  libproc-processtable-perl              0.59-2+b1
ii  libreadonly-perl                       2.050-3
ii  librsvg2-common                        2.50.3+dfsg-1
ii  libset-intspan-perl                    1.19-1.1
ii  libtiff-tools                          4.2.0-1
ii  libtry-tiny-perl                       0.30-1
hi  sane-utils                             1.0.31-4pm1

Versions of packages gscan2pdf recommends:
ii  djvulibre-bin       3.5.28-1
ii  gocr                0.52-3
ii  pdftk-java [pdftk]  3.2.2-1
ii  tesseract-ocr       4.1.1-2.1
ii  unpaper             6.1-2+b2
ii  xdg-utils           1.1.3-4

gscan2pdf suggests no packages.

-- no debconf information

>From f9d32fbeb11619637ea6263881b6333afc4ebeaf Mon Sep 17 00:00:00 2001
From: Peter Marschall <pe...@adpm.de>
Date: Wed, 14 Apr 2021 17:41:56 +0200
Subject: [PATCH 1/3] Bboxtree: preserve more information in to_hocr()

Keep 'textangle' and 'baseline' properties in to_hocr() method.

Signed-off-by: Peter Marschall <pe...@adpm.de>
---
 lib/Gscan2pdf/Bboxtree.pm | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/lib/Gscan2pdf/Bboxtree.pm b/lib/Gscan2pdf/Bboxtree.pm
index 59edaf61..2ee47834 100644
--- a/lib/Gscan2pdf/Bboxtree.pm
+++ b/lib/Gscan2pdf/Bboxtree.pm
@@ -514,6 +514,12 @@ EOS
         $string .= $SPACE x ( 2 + $bbox->{depth} ) . "<$tag class='$type'";
         if ( defined $bbox->{id} ) { $string .= " id='$bbox->{id}'" }
         $string .= " title='bbox $x1 $y1 $x2 $y2";
+        if ( defined $bbox->{baseline} ) {
+            $string .= '; baseline ' . join( $SPACE, @{ $bbox->{baseline} } );
+        }
+        if ( defined $bbox->{textangle} ) {
+            $string .= "; textangle $bbox->{textangle}";
+        }
         if ( defined $bbox->{confidence} ) {
             $string .= "; x_wconf $bbox->{confidence}";
         }
-- 
2.30.2

>From 11ed93483c800082525cd6a3afcfb08f0e15be9c Mon Sep 17 00:00:00 2001
From: Peter Marschall <pe...@adpm.de>
Date: Wed, 14 Apr 2021 18:02:16 +0200
Subject: [PATCH 2/3] Bboxtree: recognize more hOCR elements in _hocr2boxes()

Recognize additional elements 'ocr_header', 'ocr_footer', 'ocr_caption'
as well as their 'ocrx_...' counterparts when parsing hOCR into a Bboxtree

Recent versions of tesseract seem to generate some of these elements
instead of 'ocrx_line'.

As the line-like elements contain impoartant information, gscan2pdf needs
to recognize them, in order to
* properly diplay the OCR'ed text
* preserve as much information as possible when storing hOCR files

Signed-off-by: Peter Marschall <pe...@adpm.de>
---
 lib/Gscan2pdf/Bboxtree.pm | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/lib/Gscan2pdf/Bboxtree.pm b/lib/Gscan2pdf/Bboxtree.pm
index 2ee47834..e8fa83d3 100644
--- a/lib/Gscan2pdf/Bboxtree.pm
+++ b/lib/Gscan2pdf/Bboxtree.pm
@@ -138,6 +138,15 @@ sub _hocr2boxes {
                         when (/_par$/xsm) {
                             $data->{type} = 'para';
                         }
+                        when (/_header$/xsm) {
+                            $data->{type} = 'header';
+                        }
+                        when (/_footer$/xsm) {
+                            $data->{type} = 'footer';
+                        }
+                        when (/_caption$/xsm) {
+                            $data->{type} = 'caption';
+                        }
                         when (/_line$/xsm) {
                             $data->{type} = 'line';
                         }
-- 
2.30.2

>From 32b921376cb61ab8750ecf11c2152ec30517a579 Mon Sep 17 00:00:00 2001
From: Peter Marschall <pe...@adpm.de>
Date: Wed, 14 Apr 2021 18:41:47 +0200
Subject: [PATCH 3/3] Bboxtree: fix indentation of intermediate closing tags in
 to_hocr()

When writing hOCR files using to_hocr(), make sure closing tags
of intermedate non-leaf elements are correctly indented, even
if sibling elements follow.

This makes sure to also keep the "visual" structure of hOCR files
generated by the OCR engines.

Signed-off-by: Peter Marschall <pe...@adpm.de>
---
 lib/Gscan2pdf/Bboxtree.pm | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/Gscan2pdf/Bboxtree.pm b/lib/Gscan2pdf/Bboxtree.pm
index e8fa83d3..4c267e02 100644
--- a/lib/Gscan2pdf/Bboxtree.pm
+++ b/lib/Gscan2pdf/Bboxtree.pm
@@ -495,8 +495,10 @@ EOS
     while ( my $bbox = $iter->() ) {
         if ( defined $prev_depth ) {
             if ( $prev_depth >= $bbox->{depth} ) {
+                if (@tags) { $string .= '</' . pop(@tags) . ">\n" }
+                $prev_depth--;
                 while ( $prev_depth-- >= $bbox->{depth} ) {
-                    $string .= '</' . pop(@tags) . ">\n";
+                    $string .= $SPACE x ( 2 + $prev_depth + 1 ) . '</' . 
pop(@tags) . ">\n";
                 }
             }
             else {
-- 
2.30.2

Bug#987057: gscan2pdf: fixes to hOCR

Reply via email to