RE: TIKA OCR not working

Uwe Schindler Mon, 27 Apr 2015 07:36:53 -0700

Hi,
TIKA OCR is definitely working automatically with Solr 5.x.

It is just important to install TesseractOCR on path (which is a native tool 
that does the actual work). On Ubuntu Linux, this should be quite simple 
("apt-get install tesseract-ocr" or like that). You may also need to ainstall 
additional language for better results.


Unless you are on a Turkish localized machine (which causes a bug in the JDK on 
spawning external processes) and the native tools are installed, it should work 
OOB, no configuration needed. Please also check log files.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -----Original Message-----
> From: Allison, Timothy B. [mailto:[email protected]]
> Sent: Monday, April 27, 2015 4:27 PM
> To: [email protected]
> Cc: [email protected]; [email protected]
> Subject: FW: TIKA OCR not working
> 
> Trung,
> 
> I haven't experimented with our OCR parser yet, but this should give a good
> start: https://wiki.apache.org/tika/TikaOCR .
> 
> Have you installed tesseract?
> 
> Tika colleagues,
>   Any other tips?  What else has to be configured and how?
> 
> -----Original Message-----
> From: trung.ht [mailto:[email protected]]
> Sent: Friday, April 24, 2015 11:22 PM
> To: [email protected]
> Subject: Re: TIKA OCR not working
> 
> HI everyone,
> 
> Does anyone have the answer for this problem :)?
> 
> 
> I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika 1.7,
> > but it looks like it does not work. Does anyone know that TIKA OCR
> > works automatically with Solr or I have to change some settings?
> >
> >>
> Trung.
> 
> 
> > It's not clear if OCR would happen automatically in Solr Cell, or if
> >> changes to Solr would be needed.
> >>
> >> For Tika OCR info, see:
> >>
> >> https://issues.apache.org/jira/browse/TIKA-93
> >> https://wiki.apache.org/tika/TikaOCR
> >>
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
> >> [email protected]>
> >> wrote:
> >>
> >> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't
> >> > seen
> >> it
> >> > in use yet.
> >> >
> >> > Regards,
> >> >     Alex
> >> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <[email protected]>
> >> wrote:
> >> >
> >> > > Hi Trung,
> >> > >
> >> > > I didn't know about OCR capabilities of tika.
> >> > > Someone who is familiar with sold-cell can inform us whether this
> >> > > functionality is added to solr or not.
> >> > >
> >> > > Ahmet
> >> > >
> >> > >
> >> > >
> >> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <[email protected]>
> >> wrote:
> >> > > Hi Ahmet,
> >> > >
> >> > > I used a png file, not a pdf file. From the document, I
> >> > > understand
> >> that
> >> > > solr will post the file to tika, and since tika 1.7, OCR is included.
> >> Is
> >> > > there something I misunderstood.
> >> > >
> >> > > Trung.
> >> > >
> >> > >
> >> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
> >> <[email protected]
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Hi Trung,
> >> > > >
> >> > > > solr-cell (tika) does not do OCR. It cannot exact text from
> >> > > > image
> >> based
> >> > > > pdfs.
> >> > > >
> >> > > > Ahmet
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht
> >> > > > <[email protected]>
> >> > wrote:
> >> > > >
> >> > > >
> >> > > >
> >> > > > Hi,
> >> > > >
> >> > > > I want to use solr to index some scanned document, after
> >> > > > settings
> >> solr
> >> > > > document with a two field "content" and "filename", I tried to
> >> upload
> >> > the
> >> > > > attached file, but it seems that the content of the file is
> >> > > > only
> >> "\n \n
> >> > > > \n....".
> >> > > > But if I used the tesseract from command line I got the result
> >> > correctly.
> >> > > >
> >> > > > The log when solr receive my request:
> >> > > > -----------
> >> > > > INFO  - 2015-04-23 03:49:25.941;
> >> > > > org.apache.solr.update.processor.LogUpdateProcessor;
> >> > > > [collection1] webapp=/solr path=/update/extract
> >> > > > params={literal.groupid=2&json.nl
> >> > > =flat&
> >> > > > resource.name=phplNiPrs&literal.id
> >> > > >
> >> > >
> >> >
> >>
> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&
> >> literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.conten
> >> t=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
> >> > > >
> >> > > > ------------
> >> > > >
> >> > > > The document when I check on solr admin page:
> >> > > > -------------
> >> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
> >> "createddate":
> >> > > > "2015-04-22T15:00:00Z", "filename":
> >> > "\\\\trunght\\test\\tesseract_3.png",
> >> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
> >> > > "content": "
> >> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> >> > > > \n
> >> \n
> >> > \n
> >> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> >> > > > \n
> >> \n
> >> > ",
> >> > > > "_version_": 1499213034586898400 }
> >> > > >
> >> > > > -----------
> >> > > >
> >> > > > Since I am a solr newbie I do not know where to look, can
> >> > > > anyone
> >> give
> >> > me
> >> > > > an advice for where to look for error or settings to make it work.
> >> > > > Thanks in advanced.
> >> > > >
> >> > > > Trung.
> >> > > >
> >> > >
> >> >
> >>
> >
> >

RE: TIKA OCR not working

Reply via email to