Re: TIKA OCR not working

trung.ht Fri, 24 Apr 2015 20:24:07 -0700

HI everyone,

Does anyone have the answer for this problem :)?



I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika 1.7,
> but it looks like it does not work. Does anyone know that TIKA OCR works
> automatically with Solr or I have to change some settings?
>
>>
Trung.


> It's not clear if OCR would happen automatically in Solr Cell, or if
>> changes to Solr would be needed.
>>
>> For Tika OCR info, see:
>>
>> https://issues.apache.org/jira/browse/TIKA-93
>> https://wiki.apache.org/tika/TikaOCR
>>
>>
>>
>> -- Jack Krupansky
>>
>> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
>> [email protected]>
>> wrote:
>>
>> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't seen
>> it
>> > in use yet.
>> >
>> > Regards,
>> >     Alex
>> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <[email protected]>
>> wrote:
>> >
>> > > Hi Trung,
>> > >
>> > > I didn't know about OCR capabilities of tika.
>> > > Someone who is familiar with sold-cell can inform us whether this
>> > > functionality is added to solr or not.
>> > >
>> > > Ahmet
>> > >
>> > >
>> > >
>> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <[email protected]>
>> wrote:
>> > > Hi Ahmet,
>> > >
>> > > I used a png file, not a pdf file. From the document, I understand
>> that
>> > > solr will post the file to tika, and since tika 1.7, OCR is included.
>> Is
>> > > there something I misunderstood.
>> > >
>> > > Trung.
>> > >
>> > >
>> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
>> <[email protected]
>> > >
>> > > wrote:
>> > >
>> > > > Hi Trung,
>> > > >
>> > > > solr-cell (tika) does not do OCR. It cannot exact text from image
>> based
>> > > > pdfs.
>> > > >
>> > > > Ahmet
>> > > >
>> > > >
>> > > >
>> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht <[email protected]>
>> > wrote:
>> > > >
>> > > >
>> > > >
>> > > > Hi,
>> > > >
>> > > > I want to use solr to index some scanned document, after settings
>> solr
>> > > > document with a two field "content" and "filename", I tried to
>> upload
>> > the
>> > > > attached file, but it seems that the content of the file is only
>> "\n \n
>> > > > \n....".
>> > > > But if I used the tesseract from command line I got the result
>> > correctly.
>> > > >
>> > > > The log when solr receive my request:
>> > > > -----------
>> > > > INFO  - 2015-04-23 03:49:25.941;
>> > > > org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
>> > > > webapp=/solr path=/update/extract params={literal.groupid=2&json.nl
>> > > =flat&
>> > > > resource.name=phplNiPrs&literal.id
>> > > >
>> > >
>> >
>> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
>> > > >
>> > > > ------------
>> > > >
>> > > > The document when I check on solr admin page:
>> > > > -------------
>> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
>> "createddate":
>> > > > "2015-04-22T15:00:00Z", "filename":
>> > "\\\\trunght\\test\\tesseract_3.png",
>> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
>> > > "content": "
>> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> \n
>> > \n
>> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> \n
>> > ",
>> > > > "_version_": 1499213034586898400 }
>> > > >
>> > > > -----------
>> > > >
>> > > > Since I am a solr newbie I do not know where to look, can anyone
>> give
>> > me
>> > > > an advice for where to look for error or settings to make it work.
>> > > > Thanks in advanced.
>> > > >
>> > > > Trung.
>> > > >
>> > >
>> >
>>
>
>

Re: TIKA OCR not working

Reply via email to