Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

Mateusz Matela Thu, 05 Sep 2024 12:15:12 -0700

See my first answer, I've run an experiment and the training went exactly 
the same with both approaches (separate box per character or the same 
line-box for all characters).

Mateusz

czwartek, 5 września 2024 o 17:41:50 UTC+2 Danny napisał(a):

Hi Zdenko,
Thanks for the response. However, ocrd-testset.zip contains training
images and ground truth text without boxes.

True, the images contain a full line of text:
[image: alexis_ruhe01_1852_0099_012.png]

But there are no box files in the training set.

I'd like to confirm if the LSTM training set's xxx.box file is expected
contain one box per line (wrapping the entire line) or one box per
character in the line... Any insight?

On Thursday, September 5, 2024 at 9:15:12 PM UTC+8 zdenop wrote:

have a look at provided example ocrd-testset.zip
<https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip>

Zdenko

ut 3. 9. 2024 o 16:04 'Danny' via tesseract-ocr <tesser...@googlegroups.com>
napísal(a):

@zdenop wrote:
| Tesseract LSTM engine (tesseract >=v4) training script is based on lines
(group of words)
| Box files reflect that. And yes - box files are important.

Zdenko, does this mean a "box file" for LSTM training should wrap the
entire text line and NOT the individual characters?
Which is correct for LSTM training:

A) individual boxes like this, or
[image: sub_2.png]
B) One box for entire line:
[image: sub_2 line.png]
Thanks.

On Sunday, July 14, 2024 at 9:05:48 PM UTC+8 zdenop wrote:

Ehm:

1. Tesseract v3 (legacy) engine training is based on characters.
2. Tesseract LSTM engine (tesseract >=v4) training script is based on
lines (group of words)

Box files reflect that. And yes - box files are important.

Zdenko

pi 12. 7. 2024 o 14:14 Mateusz Matela <mateusz...@gmail.com> napísal(a):

As an experiment, I run the training on a small sample produced with
text2image. Then I converted the .box files so that each character is
assigned common bounding rectangle from all the characters and run the
training again. The outputs were identical in both cases. Then I removed
the box file and let the training script autogenerate them. In that case
the reported error rates were crazy, like 99% instead of 0.5%.
This suggests that conclusion 3 is correct.

środa, 10 lipca 2024 o 15:17:07 UTC+2 Mateusz Matela napisał(a):

Hi all,

Sorry if double posting, my previous message didn't appear and I don't see
any info about waiting for acceptance or something.
I was searching for this topic in this forum and it was mentioned a few
times, but I couldn't find a clear and definitive explanation.

How does the information put in the .box files affect the training process?
The file contains coordinates for each character in the txt file, but the
documentation says that since Tesseract 4.0 the model operates on the level
of whole lines. Some tools like text2image generate the .box files with
accurate coordinates for each character. When the .box files are missing
the tesstrain Makefile generates them using generate_line_box.py, which
assigns the same full image area to each character.

I see 3 possible conclusions, which one is closest to the truth?

1. The .box files do not affect the LSTM training at all and are just a
leftover from the times of Tesseract 3. In that case, ideally in the future
they could be completely dropped or only required/generated when
specifically working with the legacy engine.

2. There is still a chance that training will work better with exact
coordinates and the generate_line_box.py is just a cheap workaround that
could be improved on in the future.

3. The .box file is still important in case you prefer to define the
coordinates for the text in the image instead of cropping the image. The
granularity of the coordinates is not imporant as Tesseract will just work
on a box that encapsulates all of the character boxes. Even if confusing,
this approach is still better than having a different .box file formats for
LSTM and the legacy engine.

I'll be grateful for any wisdom on this.

Thanks
Mateusz

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/b17225d5-2b78-41bd-994f-05305b9a443dn%40googlegroups.com

<https://groups.google.com/d/msgid/tesseract-ocr/b17225d5-2b78-41bd-994f-05305b9a443dn%40googlegroups.com?utm_medium=email&utm_source=footer>
.

To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/ba9b210d-a38e-446d-80e1-4d22b213f210n%40googlegroups.com

<https://groups.google.com/d/msgid/tesseract-ocr/ba9b210d-a38e-446d-80e1-4d22b213f210n%40googlegroups.com?utm_medium=email&utm_source=footer>
.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/73bdc2be-3c9e-4172-a121-de5a7c98ff72n%40googlegroups.com.

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

Reply via email to