Re: TIKA OCR not working

Jack Krupansky Thu, 23 Apr 2015 08:09:08 -0700

It's not clear if OCR would happen automatically in Solr Cell, or if
changes to Solr would be needed.


For Tika OCR info, see:

https://issues.apache.org/jira/browse/TIKA-93
https://wiki.apache.org/tika/TikaOCR



-- Jack Krupansky

On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't seen it
> in use yet.
>
> Regards,
>     Alex
> On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <iori...@yahoo.com.invalid> wrote:
>
> > Hi Trung,
> >
> > I didn't know about OCR capabilities of tika.
> > Someone who is familiar with sold-cell can inform us whether this
> > functionality is added to solr or not.
> >
> > Ahmet
> >
> >
> >
> > On Thursday, April 23, 2015 2:06 PM, trung.ht <trung...@anlab.vn> wrote:
> > Hi Ahmet,
> >
> > I used a png file, not a pdf file. From the document, I understand that
> > solr will post the file to tika, and since tika 1.7, OCR is included. Is
> > there something I misunderstood.
> >
> > Trung.
> >
> >
> > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan <iori...@yahoo.com.invalid
> >
> > wrote:
> >
> > > Hi Trung,
> > >
> > > solr-cell (tika) does not do OCR. It cannot exact text from image based
> > > pdfs.
> > >
> > > Ahmet
> > >
> > >
> > >
> > > On Thursday, April 23, 2015 7:33 AM, trung.ht <trung...@anlab.vn>
> wrote:
> > >
> > >
> > >
> > > Hi,
> > >
> > > I want to use solr to index some scanned document, after settings solr
> > > document with a two field "content" and "filename", I tried to upload
> the
> > > attached file, but it seems that the content of the file is only "\n \n
> > > \n....".
> > > But if I used the tesseract from command line I got the result
> correctly.
> > >
> > > The log when solr receive my request:
> > > -----------
> > > INFO  - 2015-04-23 03:49:25.941;
> > > org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
> > > webapp=/solr path=/update/extract params={literal.groupid=2&json.nl
> > =flat&
> > > resource.name=phplNiPrs&literal.id
> > >
> >
> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
> > >
> > > ------------
> > >
> > > The document when I check on solr admin page:
> > > -------------
> > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3, "createddate":
> > > "2015-04-22T15:00:00Z", "filename":
> "\\\\trunght\\test\\tesseract_3.png",
> > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
> > "content": "
> > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> \n
> > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n \n
> ",
> > > "_version_": 1499213034586898400 }
> > >
> > > -----------
> > >
> > > Since I am a solr newbie I do not know where to look, can anyone give
> me
> > > an advice for where to look for error or settings to make it work.
> > > Thanks in advanced.
> > >
> > > Trung.
> > >
> >
>

Re: TIKA OCR not working

Reply via email to