Re: TIKA OCR not working

trung.ht Tue, 28 Apr 2015 20:55:07 -0700

Hi Uwe,

Today, I downloaded Solr 5.1 and it worked fine. It seems that this bug fix
SOLR-7139 is only included in 5.1, not 5.0.


Thank everyone for your support.

Trung.

On Tue, Apr 28, 2015 at 10:21 AM, trung.ht <trung...@anlab.vn> wrote:

> Hi Uwe,
>
> Thanks for the answer, but it looks like it does not work on my machine.
>
> I use Mac OS 10.10.3, tesseract is installed through homebrew, and tested
> with the same file I post to solr.
> I think tesseract is on path since I run this command successfully: "tesseract
> test_tesseract.png output"
>
> On command line, I got correct result (output is the correct content of
> the image), but when I upload to solr, the content is only some new line
> characters. (I used
>
> About log file, I did not see anything abnormal in solr log file (nothing
> abnormal after my POST request), am I missing another log file?
>
> With best regards,
> Trung.
>
>
> On Mon, Apr 27, 2015 at 9:34 PM, Uwe Schindler <u...@thetaphi.de> wrote:
>
>> Hi,
>> TIKA OCR is definitely working automatically with Solr 5.x.
>>
>> It is just important to install TesseractOCR on path (which is a native
>> tool that does the actual work). On Ubuntu Linux, this should be quite
>> simple ("apt-get install tesseract-ocr" or like that). You may also need to
>> ainstall additional language for better results.
>>
>> Unless you are on a Turkish localized machine (which causes a bug in the
>> JDK on spawning external processes) and the native tools are installed, it
>> should work OOB, no configuration needed. Please also check log files.
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>>
>> > -----Original Message-----
>> > From: Allison, Timothy B. [mailto:talli...@mitre.org]
>> > Sent: Monday, April 27, 2015 4:27 PM
>> > To: u...@tika.apache.org
>> > Cc: trung...@anlab.vn; solr-user@lucene.apache.org
>> > Subject: FW: TIKA OCR not working
>> >
>> > Trung,
>> >
>> > I haven't experimented with our OCR parser yet, but this should give a
>> good
>> > start: https://wiki.apache.org/tika/TikaOCR .
>> >
>> > Have you installed tesseract?
>> >
>> > Tika colleagues,
>> >   Any other tips?  What else has to be configured and how?
>> >
>> > -----Original Message-----
>> > From: trung.ht [mailto:trung...@anlab.vn]
>> > Sent: Friday, April 24, 2015 11:22 PM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: TIKA OCR not working
>> >
>> > HI everyone,
>> >
>> > Does anyone have the answer for this problem :)?
>> >
>> >
>> > I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika
>> 1.7,
>> > > but it looks like it does not work. Does anyone know that TIKA OCR
>> > > works automatically with Solr or I have to change some settings?
>> > >
>> > >>
>> > Trung.
>> >
>> >
>> > > It's not clear if OCR would happen automatically in Solr Cell, or if
>> > >> changes to Solr would be needed.
>> > >>
>> > >> For Tika OCR info, see:
>> > >>
>> > >> https://issues.apache.org/jira/browse/TIKA-93
>> > >> https://wiki.apache.org/tika/TikaOCR
>> > >>
>> > >>
>> > >>
>> > >> -- Jack Krupansky
>> > >>
>> > >> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
>> > >> arafa...@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't
>> > >> > seen
>> > >> it
>> > >> > in use yet.
>> > >> >
>> > >> > Regards,
>> > >> >     Alex
>> > >> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <iori...@yahoo.com.invalid
>> >
>> > >> wrote:
>> > >> >
>> > >> > > Hi Trung,
>> > >> > >
>> > >> > > I didn't know about OCR capabilities of tika.
>> > >> > > Someone who is familiar with sold-cell can inform us whether this
>> > >> > > functionality is added to solr or not.
>> > >> > >
>> > >> > > Ahmet
>> > >> > >
>> > >> > >
>> > >> > >
>> > >> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <trung...@anlab.vn
>> >
>> > >> wrote:
>> > >> > > Hi Ahmet,
>> > >> > >
>> > >> > > I used a png file, not a pdf file. From the document, I
>> > >> > > understand
>> > >> that
>> > >> > > solr will post the file to tika, and since tika 1.7, OCR is
>> included.
>> > >> Is
>> > >> > > there something I misunderstood.
>> > >> > >
>> > >> > > Trung.
>> > >> > >
>> > >> > >
>> > >> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
>> > >> <iori...@yahoo.com.invalid
>> > >> > >
>> > >> > > wrote:
>> > >> > >
>> > >> > > > Hi Trung,
>> > >> > > >
>> > >> > > > solr-cell (tika) does not do OCR. It cannot exact text from
>> > >> > > > image
>> > >> based
>> > >> > > > pdfs.
>> > >> > > >
>> > >> > > > Ahmet
>> > >> > > >
>> > >> > > >
>> > >> > > >
>> > >> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht
>> > >> > > > <trung...@anlab.vn>
>> > >> > wrote:
>> > >> > > >
>> > >> > > >
>> > >> > > >
>> > >> > > > Hi,
>> > >> > > >
>> > >> > > > I want to use solr to index some scanned document, after
>> > >> > > > settings
>> > >> solr
>> > >> > > > document with a two field "content" and "filename", I tried to
>> > >> upload
>> > >> > the
>> > >> > > > attached file, but it seems that the content of the file is
>> > >> > > > only
>> > >> "\n \n
>> > >> > > > \n....".
>> > >> > > > But if I used the tesseract from command line I got the result
>> > >> > correctly.
>> > >> > > >
>> > >> > > > The log when solr receive my request:
>> > >> > > > -----------
>> > >> > > > INFO  - 2015-04-23 03:49:25.941;
>> > >> > > > org.apache.solr.update.processor.LogUpdateProcessor;
>> > >> > > > [collection1] webapp=/solr path=/update/extract
>> > >> > > > params={literal.groupid=2&json.nl
>> > >> > > =flat&
>> > >> > > > resource.name=phplNiPrs&literal.id
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&
>> > >> literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.conten
>> > >> t=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
>> > >> > > >
>> > >> > > > ------------
>> > >> > > >
>> > >> > > > The document when I check on solr admin page:
>> > >> > > > -------------
>> > >> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
>> > >> "createddate":
>> > >> > > > "2015-04-22T15:00:00Z", "filename":
>> > >> > "\\\\trunght\\test\\tesseract_3.png",
>> > >> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
>> > >> > > "content": "
>> > >> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> > >> > > > \n
>> > >> \n
>> > >> > \n
>> > >> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> > >> > > > \n
>> > >> \n
>> > >> > ",
>> > >> > > > "_version_": 1499213034586898400 }
>> > >> > > >
>> > >> > > > -----------
>> > >> > > >
>> > >> > > > Since I am a solr newbie I do not know where to look, can
>> > >> > > > anyone
>> > >> give
>> > >> > me
>> > >> > > > an advice for where to look for error or settings to make it
>> work.
>> > >> > > > Thanks in advanced.
>> > >> > > >
>> > >> > > > Trung.
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>>
>>
>

Re: TIKA OCR not working

Reply via email to