Re: TIKA OCR not working

trung.ht Thu, 23 Apr 2015 19:11:53 -0700

Hi Jack, Alexandre,

Thanks for answering.
I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika 1.7,
but it looks like it does not work. Does anyone know that TIKA OCR works
automatically with Solr or I have to change some settings?


Trung.




On Thu, Apr 23, 2015 at 10:02 PM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> It's not clear if OCR would happen automatically in Solr Cell, or if
> changes to Solr would be needed.
>
> For Tika OCR info, see:
>
> https://issues.apache.org/jira/browse/TIKA-93
> https://wiki.apache.org/tika/TikaOCR
>
>
>
> -- Jack Krupansky
>
> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <arafa...@gmail.com
> >
> wrote:
>
> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't seen
> it
> > in use yet.
> >
> > Regards,
> >     Alex
> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <iori...@yahoo.com.invalid>
> wrote:
> >
> > > Hi Trung,
> > >
> > > I didn't know about OCR capabilities of tika.
> > > Someone who is familiar with sold-cell can inform us whether this
> > > functionality is added to solr or not.
> > >
> > > Ahmet
> > >
> > >
> > >
> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <trung...@anlab.vn>
> wrote:
> > > Hi Ahmet,
> > >
> > > I used a png file, not a pdf file. From the document, I understand that
> > > solr will post the file to tika, and since tika 1.7, OCR is included.
> Is
> > > there something I misunderstood.
> > >
> > > Trung.
> > >
> > >
> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
> <iori...@yahoo.com.invalid
> > >
> > > wrote:
> > >
> > > > Hi Trung,
> > > >
> > > > solr-cell (tika) does not do OCR. It cannot exact text from image
> based
> > > > pdfs.
> > > >
> > > > Ahmet
> > > >
> > > >
> > > >
> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht <trung...@anlab.vn>
> > wrote:
> > > >
> > > >
> > > >
> > > > Hi,
> > > >
> > > > I want to use solr to index some scanned document, after settings
> solr
> > > > document with a two field "content" and "filename", I tried to upload
> > the
> > > > attached file, but it seems that the content of the file is only "\n
> \n
> > > > \n....".
> > > > But if I used the tesseract from command line I got the result
> > correctly.
> > > >
> > > > The log when solr receive my request:
> > > > -----------
> > > > INFO  - 2015-04-23 03:49:25.941;
> > > > org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
> > > > webapp=/solr path=/update/extract params={literal.groupid=2&json.nl
> > > =flat&
> > > > resource.name=phplNiPrs&literal.id
> > > >
> > >
> >
> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
> > > >
> > > > ------------
> > > >
> > > > The document when I check on solr admin page:
> > > > -------------
> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
> "createddate":
> > > > "2015-04-22T15:00:00Z", "filename":
> > "\\\\trunght\\test\\tesseract_3.png",
> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
> > > "content": "
> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> > \n
> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n \n
> > ",
> > > > "_version_": 1499213034586898400 }
> > > >
> > > > -----------
> > > >
> > > > Since I am a solr newbie I do not know where to look, can anyone give
> > me
> > > > an advice for where to look for error or settings to make it work.
> > > > Thanks in advanced.
> > > >
> > > > Trung.
> > > >
> > >
> >
>

Re: TIKA OCR not working

Reply via email to