Tesseract can do multiple languages in one file. Try “-l eng+ita” for example.
John Muccigrosso > Il giorno 4 feb 2020, alle ore 19:21, William Bader > <[email protected]> ha scritto: > > > > which tools could be used to extract the text on the Images? > > $ pdfimages -png 20020122exam.pdf im > $ tesseract im-000.png im-000 > Tesseract Open Source OCR Engine v4.1.0 with Leptonica > $ cat im-000.txt > ‘The Vultures’ Roost > 'Sasca Ga ay, Te Goa Rapa: Poa > $ tesseract -l eng im-000.png im-000 hocr > Tesseract Open Source OCR Engine v4.1.0 with Leptonica > $ grep word im-000.hocr | grep -v '> <' | head -10 > <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line > ocrx_word ocrp_wconf'/> > <span class='ocrx_word' id='word_1_5' title='bbox 60 254 80 262; > x_wconf 83'>‘The</span> > <span class='ocrx_word' id='word_1_6' title='bbox 84 254 129 262; > x_wconf 89'>Vultures’</span> > <span class='ocrx_word' id='word_1_7' title='bbox 133 254 163 262; > x_wconf 91'>Roost</span> > <span class='ocrx_word' id='word_1_9' title='bbox 5 274 30 280; x_wconf > 0'>‘Sasca</span> > <span class='ocrx_word' id='word_1_10' title='bbox 35 274 52 280; > x_wconf 58'>Ga</span> > <span class='ocrx_word' id='word_1_11' title='bbox 57 274 77 282; > x_wconf 71'>ay,</span> > <span class='ocrx_word' id='word_1_12' title='bbox 83 274 95 280; > x_wconf 52'>Te</span> > <span class='ocrx_word' id='word_1_13' title='bbox 98 274 127 280; > x_wconf 47'>Goa</span> > <span class='ocrx_word' id='word_1_14' title='bbox 130 274 160 281; > x_wconf 18'>Rapa:</span> > > You could post-process this or maybe write a more powerful class using CSS. > > I don't know of any open source OCR that supports multiple languages in the > same file. Supporting a single language is hard enough. > > >Why don't poppler utils: > >a) underline text segments since they know their exact X,Y offsets; > > You could add an option for that or maybe write CSS. > > $ pdftotext -bbox 20020122exam.pdf > $ grep xMin 20020122exam.html | head -10 > <word xMin="207.337000" yMin="48.999400" xMax="226.855400" > yMax="60.395400">The</word> > <word xMin="229.970600" yMin="48.999400" xMax="280.375900" > yMax="60.395400">University</word> > <word xMin="283.491100" yMin="48.999400" xMax="293.354800" > yMax="60.395400">of</word> > <word xMin="296.470000" yMin="48.999400" xMax="312.523400" > yMax="60.395400">the</word> > <word xMin="315.638600" yMin="48.999400" xMax="340.617400" > yMax="60.395400">State</word> > <word xMin="343.732600" yMin="48.999400" xMax="353.596300" > yMax="60.395400">of</word> > <word xMin="356.711500" yMin="48.999400" xMax="379.078900" > yMax="60.395400">New</word> > <word xMin="382.194100" yMin="48.999400" xMax="404.647300" > yMax="60.395400">York</word> > <word xMin="187.461100" yMin="71.999300" xMax="242.047500" > yMax="83.395300">REGENTS</word> > <word xMin="248.771800" yMin="71.999300" xMax="280.536500" > yMax="83.395300">HIGH</word> > > Regards, William > > > From: poppler <[email protected]> on behalf of Albretch > Mueller <[email protected]> > Sent: Tuesday, February 4, 2020 7:37 AM > To: [email protected] <[email protected]> > Subject: [poppler] approches used for language detection on images ... > > Hi *: > > I work on pdf files some of which might be image-based (with or > without the text included), or searchable pdf which include images of > varying quality and with text embedded in various ways. This would be > the typical text I would be dealing with: > > https://www.nysedregents.org/USHistoryGov/Archive/20020122exam.pdf > > which tools could be used to extract the text on the Images? > > As Liam on the gimpusers Forum pointed out to me, you Need: > > (1) feature extraction, finding the writing, > (2) OCR of some sort, to turn pictures of letters into letters, and then > (3) the linguistic analysis. > > which tools and/or strategies could be used for steps 1-3? > > Another example of textual file I work with would be: > > https://scholarworks.iu.edu/dspace/bitstream/handle/2022/18961/Notes > and Texts.pdf > > on that searchable file file pdftohtml produces one background file > per page, but when you stratify the content (simply using hash > signatures) you realize most files are of the same kind (just blank > background images or files containing a single line (for example, > underlining a title) or framing a blocked message), then there are > full-page blank Images with segments of greek text, ... > > Why don't poppler utils: > > a) underline text segments since they know their exact X,Y offsets; > > b) encode blocked text using html blocks; > > c) include the image of textual characters in foreing languages as > character sequences; > > instead of creating for such purposes a background Image for each page? > > Maybe there is a way to work around such hurdles I don't know and/or > someone has already written code to take care of that. > > Do you know of such a code? > > Thank you, > lbrtchx > _______________________________________________ > poppler mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/poppler > _______________________________________________ > poppler mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/poppler
_______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
