Re: TIKA OCR not working

Ahmet Arslan Thu, 23 Apr 2015 04:00:38 -0700

Hi Trung,

solr-cell (tika) does not do OCR. It cannot exact text from image based pdfs.


Ahmet 



On Thursday, April 23, 2015 7:33 AM, trung.ht <trung...@anlab.vn> wrote:



Hi,

I want to use solr to index some scanned document, after settings solr document 
with a two field "content" and "filename", I tried to upload the attached file, 
but it seems that the content of the file is only "\n \n \n....". 
But if I used the tesseract from command line I got the result correctly.

The log when solr receive my request:
-----------
INFO  - 2015-04-23 03:49:25.941; 
org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr 
path=/update/extract 
params={literal.groupid=2&json.nl=flat&resource.name=phplNiPrs&literal.id=4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}

------------

The document when I check on solr admin page:
-------------
{ "groupid": 2, "id": "4", "historyid": 4, "userid": 3, "createddate": 
"2015-04-22T15:00:00Z", "filename": "\\\\trunght\\test\\tesseract_3.png", 
"autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ], "content": " \n 
\n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  
\n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n \n  ", "_version_": 
1499213034586898400 }

-----------

Since I am a solr newbie I do not know where to look, can anyone give me an 
advice for where to look for error or settings to make it work.
Thanks in advanced.

Trung.

Re: TIKA OCR not working

Reply via email to