It should work out of the box in Solr as long as Tesseract is installed and on the class path. Solr had an issue with it since Tika sends 2 startDocument calls, but I fixed that with Uwe and it was shipped in 4.10.4 and in 5.x I think?
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: <Allison>, "Timothy B." <talli...@mitre.org> Reply-To: "u...@tika.apache.org" <u...@tika.apache.org> Date: Monday, April 27, 2015 at 10:26 AM To: "u...@tika.apache.org" <u...@tika.apache.org> Cc: "trung...@anlab.vn" <trung...@anlab.vn>, "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> Subject: FW: TIKA OCR not working >Trung, > >I haven't experimented with our OCR parser yet, but this should give a >good start: https://wiki.apache.org/tika/TikaOCR . > >Have you installed tesseract? > >Tika colleagues, > Any other tips? What else has to be configured and how? > >-----Original Message----- >From: trung.ht [mailto:trung...@anlab.vn] >Sent: Friday, April 24, 2015 11:22 PM >To: solr-user@lucene.apache.org >Subject: Re: TIKA OCR not working > >HI everyone, > >Does anyone have the answer for this problem :)? > > >I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika >1.7, >> but it looks like it does not work. Does anyone know that TIKA OCR works >> automatically with Solr or I have to change some settings? >> >>> >Trung. > > >> It's not clear if OCR would happen automatically in Solr Cell, or if >>> changes to Solr would be needed. >>> >>> For Tika OCR info, see: >>> >>> https://issues.apache.org/jira/browse/TIKA-93 >>> https://wiki.apache.org/tika/TikaOCR >>> >>> >>> >>> -- Jack Krupansky >>> >>> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch < >>> arafa...@gmail.com> >>> wrote: >>> >>> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't >>>seen >>> it >>> > in use yet. >>> > >>> > Regards, >>> > Alex >>> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <iori...@yahoo.com.invalid> >>> wrote: >>> > >>> > > Hi Trung, >>> > > >>> > > I didn't know about OCR capabilities of tika. >>> > > Someone who is familiar with sold-cell can inform us whether this >>> > > functionality is added to solr or not. >>> > > >>> > > Ahmet >>> > > >>> > > >>> > > >>> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <trung...@anlab.vn> >>> wrote: >>> > > Hi Ahmet, >>> > > >>> > > I used a png file, not a pdf file. From the document, I understand >>> that >>> > > solr will post the file to tika, and since tika 1.7, OCR is >>>included. >>> Is >>> > > there something I misunderstood. >>> > > >>> > > Trung. >>> > > >>> > > >>> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan >>> <iori...@yahoo.com.invalid >>> > > >>> > > wrote: >>> > > >>> > > > Hi Trung, >>> > > > >>> > > > solr-cell (tika) does not do OCR. It cannot exact text from image >>> based >>> > > > pdfs. >>> > > > >>> > > > Ahmet >>> > > > >>> > > > >>> > > > >>> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht <trung...@anlab.vn> >>> > wrote: >>> > > > >>> > > > >>> > > > >>> > > > Hi, >>> > > > >>> > > > I want to use solr to index some scanned document, after settings >>> solr >>> > > > document with a two field "content" and "filename", I tried to >>> upload >>> > the >>> > > > attached file, but it seems that the content of the file is only >>> "\n \n >>> > > > \n....". >>> > > > But if I used the tesseract from command line I got the result >>> > correctly. >>> > > > >>> > > > The log when solr receive my request: >>> > > > ----------- >>> > > > INFO - 2015-04-23 03:49:25.941; >>> > > > org.apache.solr.update.processor.LogUpdateProcessor; >>>[collection1] >>> > > > webapp=/solr path=/update/extract >>>params={literal.groupid=2&json.nl >>> > > =flat& >>> > > > resource.name=phplNiPrs&literal.id >>> > > > >>> > > >>> > >>> >>>=4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&lit >>>eral.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=cont >>>ent&wt=json&literal.filename=\\trunght\test\tesseract_3.png} >>> > > > >>> > > > ------------ >>> > > > >>> > > > The document when I check on solr admin page: >>> > > > ------------- >>> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3, >>> "createddate": >>> > > > "2015-04-22T15:00:00Z", "filename": >>> > "\\\\trunght\\test\\tesseract_3.png", >>> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ], >>> > > "content": " >>> > > > \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n >>> \n >>> > \n >>> > > > \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n >>>\n >>> \n >>> > ", >>> > > > "_version_": 1499213034586898400 } >>> > > > >>> > > > ----------- >>> > > > >>> > > > Since I am a solr newbie I do not know where to look, can anyone >>> give >>> > me >>> > > > an advice for where to look for error or settings to make it >>>work. >>> > > > Thanks in advanced. >>> > > > >>> > > > Trung. >>> > > > >>> > > >>> > >>> >> >>