Hi Trung,

solr-cell (tika) does not do OCR. It cannot exact text from image based pdfs.

Ahmet 



On Thursday, April 23, 2015 7:33 AM, trung.ht <trung...@anlab.vn> wrote:



Hi,

I want to use solr to index some scanned document, after settings solr document 
with a two field "content" and "filename", I tried to upload the attached file, 
but it seems that the content of the file is only "\n \n \n....". 
But if I used the tesseract from command line I got the result correctly.

The log when solr receive my request:
-----------
INFO  - 2015-04-23 03:49:25.941; 
org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr 
path=/update/extract 
params={literal.groupid=2&json.nl=flat&resource.name=phplNiPrs&literal.id=4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}

------------

The document when I check on solr admin page:
-------------
{ "groupid": 2, "id": "4", "historyid": 4, "userid": 3, "createddate": 
"2015-04-22T15:00:00Z", "filename": "\\\\trunght\\test\\tesseract_3.png", 
"autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ], "content": " \n 
\n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  
\n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n \n  ", "_version_": 
1499213034586898400 }

-----------

Since I am a solr newbie I do not know where to look, can anyone give me an 
advice for where to look for error or settings to make it work.
Thanks in advanced.

Trung.

Reply via email to