Hi Ahmet,

I used a png file, not a pdf file. From the document, I understand that
solr will post the file to tika, and since tika 1.7, OCR is included. Is
there something I misunderstood.

Trung.

On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan <iori...@yahoo.com.invalid>
wrote:

> Hi Trung,
>
> solr-cell (tika) does not do OCR. It cannot exact text from image based
> pdfs.
>
> Ahmet
>
>
>
> On Thursday, April 23, 2015 7:33 AM, trung.ht <trung...@anlab.vn> wrote:
>
>
>
> Hi,
>
> I want to use solr to index some scanned document, after settings solr
> document with a two field "content" and "filename", I tried to upload the
> attached file, but it seems that the content of the file is only "\n \n
> \n....".
> But if I used the tesseract from command line I got the result correctly.
>
> The log when solr receive my request:
> -----------
> INFO  - 2015-04-23 03:49:25.941;
> org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
> webapp=/solr path=/update/extract params={literal.groupid=2&json.nl=flat&
> resource.name=phplNiPrs&literal.id
> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
>
> ------------
>
> The document when I check on solr admin page:
> -------------
> { "groupid": 2, "id": "4", "historyid": 4, "userid": 3, "createddate":
> "2015-04-22T15:00:00Z", "filename": "\\\\trunght\\test\\tesseract_3.png",
> "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ], "content": "
> \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n \n  ",
> "_version_": 1499213034586898400 }
>
> -----------
>
> Since I am a solr newbie I do not know where to look, can anyone give me
> an advice for where to look for error or settings to make it work.
> Thanks in advanced.
>
> Trung.
>

Reply via email to