Re: TIKA OCR not working

Konstantin Gribov Mon, 27 Apr 2015 09:44:13 -0700

JFYI, there's no tesseract & leptonica for centos6/rhel6 (even in epel), so
I have specs for building tesseract and leptonica (its dependency) on
github (https://github.com/grossws/tesseract-ocr-specs). Feel free to use
if you're on centos/rhel.


Also, tesseract language packs are trained for one language each, so
dual-lang document would have quite bad OCR result even when both languages
use latin chars. You can use
o.a.tika.parsers.ocr.TesseractOCRConfig.setLanguage(String) to set lang
pack for OCR.

-- 
Best regards,
Konstantin Gribov

пн, 27 апр. 2015 г. в 17:36, Uwe Schindler <u...@thetaphi.de>:

> Yes that is fixed.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -----Original Message-----
> > From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
> > Sent: Monday, April 27, 2015 4:29 PM
> > To: u...@tika.apache.org
> > Cc: trung...@anlab.vn; solr-user@lucene.apache.org
> > Subject: Re: TIKA OCR not working
> >
> > It should work out of the box in Solr as long as Tesseract is installed
> and on
> > the class path. Solr had an issue with it since Tika sends 2
> startDocument calls,
> > but I fixed that with Uwe and it was shipped in 4.10.4 and in 5.x I
> think?
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > ++++++++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398) NASA Jet
> > Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: chris.a.mattm...@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > ++++++++
> > Adjunct Associate Professor, Computer Science Department University of
> > Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > ++++++++
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: <Allison>, "Timothy B." <talli...@mitre.org>
> > Reply-To: "u...@tika.apache.org" <u...@tika.apache.org>
> > Date: Monday, April 27, 2015 at 10:26 AM
> > To: "u...@tika.apache.org" <u...@tika.apache.org>
> > Cc: "trung...@anlab.vn" <trung...@anlab.vn>, "solr-
> > u...@lucene.apache.org"
> > <solr-user@lucene.apache.org>
> > Subject: FW: TIKA OCR not working
> >
> > >Trung,
> > >
> > >I haven't experimented with our OCR parser yet, but this should give a
> > >good start: https://wiki.apache.org/tika/TikaOCR .
> > >
> > >Have you installed tesseract?
> > >
> > >Tika colleagues,
> > >  Any other tips?  What else has to be configured and how?
> > >
> > >-----Original Message-----
> > >From: trung.ht [mailto:trung...@anlab.vn]
> > >Sent: Friday, April 24, 2015 11:22 PM
> > >To: solr-user@lucene.apache.org
> > >Subject: Re: TIKA OCR not working
> > >
> > >HI everyone,
> > >
> > >Does anyone have the answer for this problem :)?
> > >
> > >
> > >I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika
> > >1.7,
> > >> but it looks like it does not work. Does anyone know that TIKA OCR
> > >> works automatically with Solr or I have to change some settings?
> > >>
> > >>>
> > >Trung.
> > >
> > >
> > >> It's not clear if OCR would happen automatically in Solr Cell, or if
> > >>> changes to Solr would be needed.
> > >>>
> > >>> For Tika OCR info, see:
> > >>>
> > >>> https://issues.apache.org/jira/browse/TIKA-93
> > >>> https://wiki.apache.org/tika/TikaOCR
> > >>>
> > >>>
> > >>>
> > >>> -- Jack Krupansky
> > >>>
> > >>> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
> > >>> arafa...@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't
> > >>>seen
> > >>> it
> > >>> > in use yet.
> > >>> >
> > >>> > Regards,
> > >>> >     Alex
> > >>> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan"
> > >>> > <iori...@yahoo.com.invalid>
> > >>> wrote:
> > >>> >
> > >>> > > Hi Trung,
> > >>> > >
> > >>> > > I didn't know about OCR capabilities of tika.
> > >>> > > Someone who is familiar with sold-cell can inform us whether
> > >>> > > this functionality is added to solr or not.
> > >>> > >
> > >>> > > Ahmet
> > >>> > >
> > >>> > >
> > >>> > >
> > >>> > > On Thursday, April 23, 2015 2:06 PM, trung.ht
> > >>> > > <trung...@anlab.vn>
> > >>> wrote:
> > >>> > > Hi Ahmet,
> > >>> > >
> > >>> > > I used a png file, not a pdf file. From the document, I
> > >>> > > understand
> > >>> that
> > >>> > > solr will post the file to tika, and since tika 1.7, OCR is
> > >>>included.
> > >>> Is
> > >>> > > there something I misunderstood.
> > >>> > >
> > >>> > > Trung.
> > >>> > >
> > >>> > >
> > >>> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
> > >>> <iori...@yahoo.com.invalid
> > >>> > >
> > >>> > > wrote:
> > >>> > >
> > >>> > > > Hi Trung,
> > >>> > > >
> > >>> > > > solr-cell (tika) does not do OCR. It cannot exact text from
> > >>> > > > image
> > >>> based
> > >>> > > > pdfs.
> > >>> > > >
> > >>> > > > Ahmet
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht
> > >>> > > > <trung...@anlab.vn>
> > >>> > wrote:
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > > Hi,
> > >>> > > >
> > >>> > > > I want to use solr to index some scanned document, after
> > >>> > > > settings
> > >>> solr
> > >>> > > > document with a two field "content" and "filename", I tried to
> > >>> upload
> > >>> > the
> > >>> > > > attached file, but it seems that the content of the file is
> > >>> > > > only
> > >>> "\n \n
> > >>> > > > \n....".
> > >>> > > > But if I used the tesseract from command line I got the result
> > >>> > correctly.
> > >>> > > >
> > >>> > > > The log when solr receive my request:
> > >>> > > > -----------
> > >>> > > > INFO  - 2015-04-23 03:49:25.941;
> > >>> > > > org.apache.solr.update.processor.LogUpdateProcessor;
> > >>>[collection1]
> > >>> > > > webapp=/solr path=/update/extract
> > >>>params={literal.groupid=2&json.nl
> > >>> > > =flat&
> > >>> > > > resource.name=phplNiPrs&literal.id
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>>=4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=tr
> > ue&
> > >>>lit
> > >>>eral.userid=3&literal.createddate=2015-04-
> > 22T15:00:00Z&fmap.content=c
> > >>>ont ent&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
> > >>> > > >
> > >>> > > > ------------
> > >>> > > >
> > >>> > > > The document when I check on solr admin page:
> > >>> > > > -------------
> > >>> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
> > >>> "createddate":
> > >>> > > > "2015-04-22T15:00:00Z", "filename":
> > >>> > "\\\\trunght\\test\\tesseract_3.png",
> > >>> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
> > >>> > > "content": "
> > >>> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> > >>> > > > \n
> > >>> \n
> > >>> > \n
> > >>> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> > >>>\n
> > >>> \n
> > >>> > ",
> > >>> > > > "_version_": 1499213034586898400 }
> > >>> > > >
> > >>> > > > -----------
> > >>> > > >
> > >>> > > > Since I am a solr newbie I do not know where to look, can
> > >>> > > > anyone
> > >>> give
> > >>> > me
> > >>> > > > an advice for where to look for error or settings to make it
> > >>>work.
> > >>> > > > Thanks in advanced.
> > >>> > > >
> > >>> > > > Trung.
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
>
>
>

Re: TIKA OCR not working

Reply via email to