JFYI, there's no tesseract & leptonica for centos6/rhel6 (even in epel), so I have specs for building tesseract and leptonica (its dependency) on github (https://github.com/grossws/tesseract-ocr-specs). Feel free to use if you're on centos/rhel.
Also, tesseract language packs are trained for one language each, so dual-lang document would have quite bad OCR result even when both languages use latin chars. You can use o.a.tika.parsers.ocr.TesseractOCRConfig.setLanguage(String) to set lang pack for OCR. -- Best regards, Konstantin Gribov пн, 27 апр. 2015 г. в 17:36, Uwe Schindler <u...@thetaphi.de>: > Yes that is fixed. > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -----Original Message----- > > From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] > > Sent: Monday, April 27, 2015 4:29 PM > > To: u...@tika.apache.org > > Cc: trung...@anlab.vn; solr-user@lucene.apache.org > > Subject: Re: TIKA OCR not working > > > > It should work out of the box in Solr as long as Tesseract is installed > and on > > the class path. Solr had an issue with it since Tika sends 2 > startDocument calls, > > but I fixed that with Uwe and it was shipped in 4.10.4 and in 5.x I > think? > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > ++++++++ > > Chris Mattmann, Ph.D. > > Chief Architect > > Instrument Software and Science Data Systems Section (398) NASA Jet > > Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 168-519, Mailstop: 168-527 > > Email: chris.a.mattm...@nasa.gov > > WWW: http://sunset.usc.edu/~mattmann/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > ++++++++ > > Adjunct Associate Professor, Computer Science Department University of > > Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > ++++++++ > > > > > > > > > > > > > > -----Original Message----- > > From: <Allison>, "Timothy B." <talli...@mitre.org> > > Reply-To: "u...@tika.apache.org" <u...@tika.apache.org> > > Date: Monday, April 27, 2015 at 10:26 AM > > To: "u...@tika.apache.org" <u...@tika.apache.org> > > Cc: "trung...@anlab.vn" <trung...@anlab.vn>, "solr- > > u...@lucene.apache.org" > > <solr-user@lucene.apache.org> > > Subject: FW: TIKA OCR not working > > > > >Trung, > > > > > >I haven't experimented with our OCR parser yet, but this should give a > > >good start: https://wiki.apache.org/tika/TikaOCR . > > > > > >Have you installed tesseract? > > > > > >Tika colleagues, > > > Any other tips? What else has to be configured and how? > > > > > >-----Original Message----- > > >From: trung.ht [mailto:trung...@anlab.vn] > > >Sent: Friday, April 24, 2015 11:22 PM > > >To: solr-user@lucene.apache.org > > >Subject: Re: TIKA OCR not working > > > > > >HI everyone, > > > > > >Does anyone have the answer for this problem :)? > > > > > > > > >I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika > > >1.7, > > >> but it looks like it does not work. Does anyone know that TIKA OCR > > >> works automatically with Solr or I have to change some settings? > > >> > > >>> > > >Trung. > > > > > > > > >> It's not clear if OCR would happen automatically in Solr Cell, or if > > >>> changes to Solr would be needed. > > >>> > > >>> For Tika OCR info, see: > > >>> > > >>> https://issues.apache.org/jira/browse/TIKA-93 > > >>> https://wiki.apache.org/tika/TikaOCR > > >>> > > >>> > > >>> > > >>> -- Jack Krupansky > > >>> > > >>> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch < > > >>> arafa...@gmail.com> > > >>> wrote: > > >>> > > >>> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't > > >>>seen > > >>> it > > >>> > in use yet. > > >>> > > > >>> > Regards, > > >>> > Alex > > >>> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" > > >>> > <iori...@yahoo.com.invalid> > > >>> wrote: > > >>> > > > >>> > > Hi Trung, > > >>> > > > > >>> > > I didn't know about OCR capabilities of tika. > > >>> > > Someone who is familiar with sold-cell can inform us whether > > >>> > > this functionality is added to solr or not. > > >>> > > > > >>> > > Ahmet > > >>> > > > > >>> > > > > >>> > > > > >>> > > On Thursday, April 23, 2015 2:06 PM, trung.ht > > >>> > > <trung...@anlab.vn> > > >>> wrote: > > >>> > > Hi Ahmet, > > >>> > > > > >>> > > I used a png file, not a pdf file. From the document, I > > >>> > > understand > > >>> that > > >>> > > solr will post the file to tika, and since tika 1.7, OCR is > > >>>included. > > >>> Is > > >>> > > there something I misunderstood. > > >>> > > > > >>> > > Trung. > > >>> > > > > >>> > > > > >>> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan > > >>> <iori...@yahoo.com.invalid > > >>> > > > > >>> > > wrote: > > >>> > > > > >>> > > > Hi Trung, > > >>> > > > > > >>> > > > solr-cell (tika) does not do OCR. It cannot exact text from > > >>> > > > image > > >>> based > > >>> > > > pdfs. > > >>> > > > > > >>> > > > Ahmet > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht > > >>> > > > <trung...@anlab.vn> > > >>> > wrote: > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > Hi, > > >>> > > > > > >>> > > > I want to use solr to index some scanned document, after > > >>> > > > settings > > >>> solr > > >>> > > > document with a two field "content" and "filename", I tried to > > >>> upload > > >>> > the > > >>> > > > attached file, but it seems that the content of the file is > > >>> > > > only > > >>> "\n \n > > >>> > > > \n....". > > >>> > > > But if I used the tesseract from command line I got the result > > >>> > correctly. > > >>> > > > > > >>> > > > The log when solr receive my request: > > >>> > > > ----------- > > >>> > > > INFO - 2015-04-23 03:49:25.941; > > >>> > > > org.apache.solr.update.processor.LogUpdateProcessor; > > >>>[collection1] > > >>> > > > webapp=/solr path=/update/extract > > >>>params={literal.groupid=2&json.nl > > >>> > > =flat& > > >>> > > > resource.name=phplNiPrs&literal.id > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>>=4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=tr > > ue& > > >>>lit > > >>>eral.userid=3&literal.createddate=2015-04- > > 22T15:00:00Z&fmap.content=c > > >>>ont ent&wt=json&literal.filename=\\trunght\test\tesseract_3.png} > > >>> > > > > > >>> > > > ------------ > > >>> > > > > > >>> > > > The document when I check on solr admin page: > > >>> > > > ------------- > > >>> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3, > > >>> "createddate": > > >>> > > > "2015-04-22T15:00:00Z", "filename": > > >>> > "\\\\trunght\\test\\tesseract_3.png", > > >>> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ], > > >>> > > "content": " > > >>> > > > \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n > > >>> > > > \n > > >>> \n > > >>> > \n > > >>> > > > \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n > > >>>\n > > >>> \n > > >>> > ", > > >>> > > > "_version_": 1499213034586898400 } > > >>> > > > > > >>> > > > ----------- > > >>> > > > > > >>> > > > Since I am a solr newbie I do not know where to look, can > > >>> > > > anyone > > >>> give > > >>> > me > > >>> > > > an advice for where to look for error or settings to make it > > >>>work. > > >>> > > > Thanks in advanced. > > >>> > > > > > >>> > > > Trung. > > >>> > > > > > >>> > > > > >>> > > > >>> > > >> > > >> > > >