Package: gscan2pdf Version: 2.11.0-1 Severity: wishlist Tags: patch upstream
Hi, please find attached 3 patches improving hOCR support. - recognize more tags when reading hOCR In addition this also helps for rotated texts (in a separate report) - preserve more properties when writing hOCR - fix-indentation of intermediate non-leaf elements The patches are against upstream's version 2.11.2 Please consider incorporating them into gscan2pdf's next version Thanks in advance Peter -- System Information: Debian Release: bullseye/sid APT prefers testing APT policy: (990, 'testing'), (500, 'unstable'), (500, 'stable'), (1, 'experimental') Architecture: amd64 (x86_64) Kernel: Linux 5.10.0-5-amd64 (SMP w/12 CPU threads) Kernel taint flags: TAINT_CRAP Locale: LANG=de_DE.UTF-8, LC_CTYPE=de_DE.UTF-8 (charmap=UTF-8), LANGUAGE=en_GB Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages gscan2pdf depends on: ii imagemagick 8:6.9.11.60+dfsg-1 ii imagemagick-6.q16 [imagemagick] 8:6.9.11.60+dfsg-1 ii libconfig-general-perl 2.63-1 ii libdate-calc-perl 6.4-1.1 ii libfilesys-df-perl 0.92-6+b6 ii libgoocanvas2-perl 0.06-2 ii libgtk3-imageview-perl 6-1 ii libgtk3-perl 0.038-1 ii libgtk3-simplelist-perl 0.21-1 ii libhtml-parser-perl 3.75-1+b1 ii libimage-magick-perl 8:6.9.11.60+dfsg-1 ii libimage-sane-perl 5-1+b1 ii liblist-moreutils-perl 0.430-2 ii liblocale-codes-perl 3.66-1 ii liblocale-gettext-perl 1.07-4+b1 ii liblog-log4perl-perl 1.54-1 ii libossp-uuid-perl [libdata-uuid-perl] 1.6.2-1.5+b9 ii libpdf-builder-perl 3.021-2 ii libproc-processtable-perl 0.59-2+b1 ii libreadonly-perl 2.050-3 ii librsvg2-common 2.50.3+dfsg-1 ii libset-intspan-perl 1.19-1.1 ii libtiff-tools 4.2.0-1 ii libtry-tiny-perl 0.30-1 hi sane-utils 1.0.31-4pm1 Versions of packages gscan2pdf recommends: ii djvulibre-bin 3.5.28-1 ii gocr 0.52-3 ii pdftk-java [pdftk] 3.2.2-1 ii tesseract-ocr 4.1.1-2.1 ii unpaper 6.1-2+b2 ii xdg-utils 1.1.3-4 gscan2pdf suggests no packages. -- no debconf information
>From f9d32fbeb11619637ea6263881b6333afc4ebeaf Mon Sep 17 00:00:00 2001 From: Peter Marschall <pe...@adpm.de> Date: Wed, 14 Apr 2021 17:41:56 +0200 Subject: [PATCH 1/3] Bboxtree: preserve more information in to_hocr() Keep 'textangle' and 'baseline' properties in to_hocr() method. Signed-off-by: Peter Marschall <pe...@adpm.de> --- lib/Gscan2pdf/Bboxtree.pm | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/lib/Gscan2pdf/Bboxtree.pm b/lib/Gscan2pdf/Bboxtree.pm index 59edaf61..2ee47834 100644 --- a/lib/Gscan2pdf/Bboxtree.pm +++ b/lib/Gscan2pdf/Bboxtree.pm @@ -514,6 +514,12 @@ EOS $string .= $SPACE x ( 2 + $bbox->{depth} ) . "<$tag class='$type'"; if ( defined $bbox->{id} ) { $string .= " id='$bbox->{id}'" } $string .= " title='bbox $x1 $y1 $x2 $y2"; + if ( defined $bbox->{baseline} ) { + $string .= '; baseline ' . join( $SPACE, @{ $bbox->{baseline} } ); + } + if ( defined $bbox->{textangle} ) { + $string .= "; textangle $bbox->{textangle}"; + } if ( defined $bbox->{confidence} ) { $string .= "; x_wconf $bbox->{confidence}"; } -- 2.30.2
>From 11ed93483c800082525cd6a3afcfb08f0e15be9c Mon Sep 17 00:00:00 2001 From: Peter Marschall <pe...@adpm.de> Date: Wed, 14 Apr 2021 18:02:16 +0200 Subject: [PATCH 2/3] Bboxtree: recognize more hOCR elements in _hocr2boxes() Recognize additional elements 'ocr_header', 'ocr_footer', 'ocr_caption' as well as their 'ocrx_...' counterparts when parsing hOCR into a Bboxtree Recent versions of tesseract seem to generate some of these elements instead of 'ocrx_line'. As the line-like elements contain impoartant information, gscan2pdf needs to recognize them, in order to * properly diplay the OCR'ed text * preserve as much information as possible when storing hOCR files Signed-off-by: Peter Marschall <pe...@adpm.de> --- lib/Gscan2pdf/Bboxtree.pm | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/lib/Gscan2pdf/Bboxtree.pm b/lib/Gscan2pdf/Bboxtree.pm index 2ee47834..e8fa83d3 100644 --- a/lib/Gscan2pdf/Bboxtree.pm +++ b/lib/Gscan2pdf/Bboxtree.pm @@ -138,6 +138,15 @@ sub _hocr2boxes { when (/_par$/xsm) { $data->{type} = 'para'; } + when (/_header$/xsm) { + $data->{type} = 'header'; + } + when (/_footer$/xsm) { + $data->{type} = 'footer'; + } + when (/_caption$/xsm) { + $data->{type} = 'caption'; + } when (/_line$/xsm) { $data->{type} = 'line'; } -- 2.30.2
>From 32b921376cb61ab8750ecf11c2152ec30517a579 Mon Sep 17 00:00:00 2001 From: Peter Marschall <pe...@adpm.de> Date: Wed, 14 Apr 2021 18:41:47 +0200 Subject: [PATCH 3/3] Bboxtree: fix indentation of intermediate closing tags in to_hocr() When writing hOCR files using to_hocr(), make sure closing tags of intermedate non-leaf elements are correctly indented, even if sibling elements follow. This makes sure to also keep the "visual" structure of hOCR files generated by the OCR engines. Signed-off-by: Peter Marschall <pe...@adpm.de> --- lib/Gscan2pdf/Bboxtree.pm | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/lib/Gscan2pdf/Bboxtree.pm b/lib/Gscan2pdf/Bboxtree.pm index e8fa83d3..4c267e02 100644 --- a/lib/Gscan2pdf/Bboxtree.pm +++ b/lib/Gscan2pdf/Bboxtree.pm @@ -495,8 +495,10 @@ EOS while ( my $bbox = $iter->() ) { if ( defined $prev_depth ) { if ( $prev_depth >= $bbox->{depth} ) { + if (@tags) { $string .= '</' . pop(@tags) . ">\n" } + $prev_depth--; while ( $prev_depth-- >= $bbox->{depth} ) { - $string .= '</' . pop(@tags) . ">\n"; + $string .= $SPACE x ( 2 + $prev_depth + 1 ) . '</' . pop(@tags) . ">\n"; } } else { -- 2.30.2