Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-07 Thread Zdenko Podobny
tesstrain is a tested method to train/improve tesseract language mode. It creates box files for you. You can try your ways, but your problems are your problems and you should not to expect somebody will adjust the code to your needs. Of course, you are welcome to contribute your solution. Zdenko

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-06 Thread 'Danny' via tesseract-ocr
I think this google group is having technical troubles. I got an email about a new post from Menelik Berhan but his message doesn't appear on the web. He said: *| This might be helpful: https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-06 Thread Menelik Berhan
This might be helpful: https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html And also some details in: https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#making-box-files On Thursday, September 5, 2024 at 6:41:50 PM UTC+3 Danny wrote: > Hi Zdenko, > Thanks f

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-06 Thread Tom Morris
That's weird. I posted an answer to this thread yesterday and now, in it's place, Google Groups says "Message has been deleted." Let me try again... This page https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html says "lstmbox - Generated by tesseract using lstmbox config from image

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-05 Thread Mateusz Matela
See my first answer, I've run an experiment and the training went exactly the same with both approaches (separate box per character or the same line-box for all characters). Mateusz czwartek, 5 września 2024 o 17:41:50 UTC+2 Danny napisał(a): Hi Zdenko, Thanks for the response. However, ocrd-

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-05 Thread Zdenko Podobny
What about reading tesstrain Readme and using the example data to understand the training process better? Zdenko št 5. 9. 2024 o 17:41 'Danny' via tesseract-ocr < tesseract-ocr@googlegroups.com> napísal(a): > Hi Zdenko, > Thanks for the response. However, ocrd-testset.zip contains training > i

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-05 Thread 'Danny' via tesseract-ocr
Hi Zdenko, Thanks for the response. However, ocrd-testset.zip contains training images and ground truth text without boxes. True, the images contain a full line of text: [image: alexis_ruhe01_1852_0099_012.png] But there are no box files in the training set. I'd like to confirm if the LSTM t

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-05 Thread Zdenko Podobny
have a look at provided example ocrd-testset.zip Zdenko ut 3. 9. 2024 o 16:04 'Danny' via tesseract-ocr < tesseract-ocr@googlegroups.com> napísal(a): > @zdenop wrote: > | Tesseract LSTM engine (tesseract >=v4) training scr

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-03 Thread 'Danny' via tesseract-ocr
@zdenop wrote: | Tesseract LSTM engine (tesseract >=v4) training script is based on lines (group of words) | Box files reflect that. And yes - box files are important. Zdenko, does this mean a "box file" for LSTM training should wrap the entire text line and NOT the individual characters? Which

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-07-14 Thread Zdenko Podobny
Ehm: 1. Tesseract v3 (legacy) engine training is based on characters. 2. Tesseract LSTM engine (tesseract >=v4) training script is based on lines (group of words) Box files reflect that. And yes - box files are important. Zdenko pi 12. 7. 2024 o 14:14 Mateusz Matela napísal(a): > A

[tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-07-12 Thread Mateusz Matela
As an experiment, I run the training on a small sample produced with text2image. Then I converted the .box files so that each character is assigned common bounding rectangle from all the characters and run the training again. The outputs were identical in both cases. Then I removed the box file