Re: TIKA OCR not working

Mattmann, Chris A (3980) Mon, 27 Apr 2015 07:33:07 -0700

It should work out of the box in Solr as long as Tesseract is
installed and on the class path. Solr had an issue with it since
Tika sends 2 startDocument calls, but I fixed that with Uwe and
it was shipped in 4.10.4 and in 5.x I think?


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Allison>, "Timothy B." <talli...@mitre.org>
Reply-To: "u...@tika.apache.org" <u...@tika.apache.org>
Date: Monday, April 27, 2015 at 10:26 AM
To: "u...@tika.apache.org" <u...@tika.apache.org>
Cc: "trung...@anlab.vn" <trung...@anlab.vn>, "solr-user@lucene.apache.org"
<solr-user@lucene.apache.org>
Subject: FW: TIKA OCR not working

>Trung,
>
>I haven't experimented with our OCR parser yet, but this should give a
>good start: https://wiki.apache.org/tika/TikaOCR .
>
>Have you installed tesseract?
>
>Tika colleagues,
>  Any other tips?  What else has to be configured and how?
>
>-----Original Message-----
>From: trung.ht [mailto:trung...@anlab.vn]
>Sent: Friday, April 24, 2015 11:22 PM
>To: solr-user@lucene.apache.org
>Subject: Re: TIKA OCR not working
>
>HI everyone,
>
>Does anyone have the answer for this problem :)?
>
>
>I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika
>1.7,
>> but it looks like it does not work. Does anyone know that TIKA OCR works
>> automatically with Solr or I have to change some settings?
>>
>>>
>Trung.
>
>
>> It's not clear if OCR would happen automatically in Solr Cell, or if
>>> changes to Solr would be needed.
>>>
>>> For Tika OCR info, see:
>>>
>>> https://issues.apache.org/jira/browse/TIKA-93
>>> https://wiki.apache.org/tika/TikaOCR
>>>
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
>>> arafa...@gmail.com>
>>> wrote:
>>>
>>> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't
>>>seen
>>> it
>>> > in use yet.
>>> >
>>> > Regards,
>>> >     Alex
>>> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <iori...@yahoo.com.invalid>
>>> wrote:
>>> >
>>> > > Hi Trung,
>>> > >
>>> > > I didn't know about OCR capabilities of tika.
>>> > > Someone who is familiar with sold-cell can inform us whether this
>>> > > functionality is added to solr or not.
>>> > >
>>> > > Ahmet
>>> > >
>>> > >
>>> > >
>>> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <trung...@anlab.vn>
>>> wrote:
>>> > > Hi Ahmet,
>>> > >
>>> > > I used a png file, not a pdf file. From the document, I understand
>>> that
>>> > > solr will post the file to tika, and since tika 1.7, OCR is
>>>included.
>>> Is
>>> > > there something I misunderstood.
>>> > >
>>> > > Trung.
>>> > >
>>> > >
>>> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
>>> <iori...@yahoo.com.invalid
>>> > >
>>> > > wrote:
>>> > >
>>> > > > Hi Trung,
>>> > > >
>>> > > > solr-cell (tika) does not do OCR. It cannot exact text from image
>>> based
>>> > > > pdfs.
>>> > > >
>>> > > > Ahmet
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht <trung...@anlab.vn>
>>> > wrote:
>>> > > >
>>> > > >
>>> > > >
>>> > > > Hi,
>>> > > >
>>> > > > I want to use solr to index some scanned document, after settings
>>> solr
>>> > > > document with a two field "content" and "filename", I tried to
>>> upload
>>> > the
>>> > > > attached file, but it seems that the content of the file is only
>>> "\n \n
>>> > > > \n....".
>>> > > > But if I used the tesseract from command line I got the result
>>> > correctly.
>>> > > >
>>> > > > The log when solr receive my request:
>>> > > > -----------
>>> > > > INFO  - 2015-04-23 03:49:25.941;
>>> > > > org.apache.solr.update.processor.LogUpdateProcessor;
>>>[collection1]
>>> > > > webapp=/solr path=/update/extract
>>>params={literal.groupid=2&json.nl
>>> > > =flat&
>>> > > > resource.name=phplNiPrs&literal.id
>>> > > >
>>> > >
>>> >
>>> 
>>>=4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&lit
>>>eral.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=cont
>>>ent&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
>>> > > >
>>> > > > ------------
>>> > > >
>>> > > > The document when I check on solr admin page:
>>> > > > -------------
>>> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
>>> "createddate":
>>> > > > "2015-04-22T15:00:00Z", "filename":
>>> > "\\\\trunght\\test\\tesseract_3.png",
>>> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
>>> > > "content": "
>>> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>>> \n
>>> > \n
>>> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>>>\n
>>> \n
>>> > ",
>>> > > > "_version_": 1499213034586898400 }
>>> > > >
>>> > > > -----------
>>> > > >
>>> > > > Since I am a solr newbie I do not know where to look, can anyone
>>> give
>>> > me
>>> > > > an advice for where to look for error or settings to make it
>>>work.
>>> > > > Thanks in advanced.
>>> > > >
>>> > > > Trung.
>>> > > >
>>> > >
>>> >
>>>
>>
>>

Re: TIKA OCR not working

Reply via email to