Re: TIKA OCR not working

Erick Erickson Wed, 29 Apr 2015 06:56:14 -0700

Yes, the critical bit for knowing what release a JIRA is in is the
"Fix Version/s" entry.
You have to be a little careful though to only read that when the
Resolution is "Fixed",
as the "fix version" is sometimes set while the JIRA is still open.


On Tue, Apr 28, 2015 at 8:52 PM, trung.ht <trung...@anlab.vn> wrote:
> Hi Uwe,
>
> Today, I downloaded Solr 5.1 and it worked fine. It seems that this bug fix
> SOLR-7139 is only included in 5.1, not 5.0.
>
> Thank everyone for your support.
>
> Trung.
>
> On Tue, Apr 28, 2015 at 10:21 AM, trung.ht <trung...@anlab.vn> wrote:
>
>> Hi Uwe,
>>
>> Thanks for the answer, but it looks like it does not work on my machine.
>>
>> I use Mac OS 10.10.3, tesseract is installed through homebrew, and tested
>> with the same file I post to solr.
>> I think tesseract is on path since I run this command successfully: 
>> "tesseract
>> test_tesseract.png output"
>>
>> On command line, I got correct result (output is the correct content of
>> the image), but when I upload to solr, the content is only some new line
>> characters. (I used
>>
>> About log file, I did not see anything abnormal in solr log file (nothing
>> abnormal after my POST request), am I missing another log file?
>>
>> With best regards,
>> Trung.
>>
>>
>> On Mon, Apr 27, 2015 at 9:34 PM, Uwe Schindler <u...@thetaphi.de> wrote:
>>
>>> Hi,
>>> TIKA OCR is definitely working automatically with Solr 5.x.
>>>
>>> It is just important to install TesseractOCR on path (which is a native
>>> tool that does the actual work). On Ubuntu Linux, this should be quite
>>> simple ("apt-get install tesseract-ocr" or like that). You may also need to
>>> ainstall additional language for better results.
>>>
>>> Unless you are on a Turkish localized machine (which causes a bug in the
>>> JDK on spawning external processes) and the native tools are installed, it
>>> should work OOB, no configuration needed. Please also check log files.
>>>
>>> Uwe
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: u...@thetaphi.de
>>>
>>>
>>> > -----Original Message-----
>>> > From: Allison, Timothy B. [mailto:talli...@mitre.org]
>>> > Sent: Monday, April 27, 2015 4:27 PM
>>> > To: u...@tika.apache.org
>>> > Cc: trung...@anlab.vn; solr-user@lucene.apache.org
>>> > Subject: FW: TIKA OCR not working
>>> >
>>> > Trung,
>>> >
>>> > I haven't experimented with our OCR parser yet, but this should give a
>>> good
>>> > start: https://wiki.apache.org/tika/TikaOCR .
>>> >
>>> > Have you installed tesseract?
>>> >
>>> > Tika colleagues,
>>> >   Any other tips?  What else has to be configured and how?
>>> >
>>> > -----Original Message-----
>>> > From: trung.ht [mailto:trung...@anlab.vn]
>>> > Sent: Friday, April 24, 2015 11:22 PM
>>> > To: solr-user@lucene.apache.org
>>> > Subject: Re: TIKA OCR not working
>>> >
>>> > HI everyone,
>>> >
>>> > Does anyone have the answer for this problem :)?
>>> >
>>> >
>>> > I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika
>>> 1.7,
>>> > > but it looks like it does not work. Does anyone know that TIKA OCR
>>> > > works automatically with Solr or I have to change some settings?
>>> > >
>>> > >>
>>> > Trung.
>>> >
>>> >
>>> > > It's not clear if OCR would happen automatically in Solr Cell, or if
>>> > >> changes to Solr would be needed.
>>> > >>
>>> > >> For Tika OCR info, see:
>>> > >>
>>> > >> https://issues.apache.org/jira/browse/TIKA-93
>>> > >> https://wiki.apache.org/tika/TikaOCR
>>> > >>
>>> > >>
>>> > >>
>>> > >> -- Jack Krupansky
>>> > >>
>>> > >> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
>>> > >> arafa...@gmail.com>
>>> > >> wrote:
>>> > >>
>>> > >> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't
>>> > >> > seen
>>> > >> it
>>> > >> > in use yet.
>>> > >> >
>>> > >> > Regards,
>>> > >> >     Alex
>>> > >> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <iori...@yahoo.com.invalid
>>> >
>>> > >> wrote:
>>> > >> >
>>> > >> > > Hi Trung,
>>> > >> > >
>>> > >> > > I didn't know about OCR capabilities of tika.
>>> > >> > > Someone who is familiar with sold-cell can inform us whether this
>>> > >> > > functionality is added to solr or not.
>>> > >> > >
>>> > >> > > Ahmet
>>> > >> > >
>>> > >> > >
>>> > >> > >
>>> > >> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <trung...@anlab.vn
>>> >
>>> > >> wrote:
>>> > >> > > Hi Ahmet,
>>> > >> > >
>>> > >> > > I used a png file, not a pdf file. From the document, I
>>> > >> > > understand
>>> > >> that
>>> > >> > > solr will post the file to tika, and since tika 1.7, OCR is
>>> included.
>>> > >> Is
>>> > >> > > there something I misunderstood.
>>> > >> > >
>>> > >> > > Trung.
>>> > >> > >
>>> > >> > >
>>> > >> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
>>> > >> <iori...@yahoo.com.invalid
>>> > >> > >
>>> > >> > > wrote:
>>> > >> > >
>>> > >> > > > Hi Trung,
>>> > >> > > >
>>> > >> > > > solr-cell (tika) does not do OCR. It cannot exact text from
>>> > >> > > > image
>>> > >> based
>>> > >> > > > pdfs.
>>> > >> > > >
>>> > >> > > > Ahmet
>>> > >> > > >
>>> > >> > > >
>>> > >> > > >
>>> > >> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht
>>> > >> > > > <trung...@anlab.vn>
>>> > >> > wrote:
>>> > >> > > >
>>> > >> > > >
>>> > >> > > >
>>> > >> > > > Hi,
>>> > >> > > >
>>> > >> > > > I want to use solr to index some scanned document, after
>>> > >> > > > settings
>>> > >> solr
>>> > >> > > > document with a two field "content" and "filename", I tried to
>>> > >> upload
>>> > >> > the
>>> > >> > > > attached file, but it seems that the content of the file is
>>> > >> > > > only
>>> > >> "\n \n
>>> > >> > > > \n....".
>>> > >> > > > But if I used the tesseract from command line I got the result
>>> > >> > correctly.
>>> > >> > > >
>>> > >> > > > The log when solr receive my request:
>>> > >> > > > -----------
>>> > >> > > > INFO  - 2015-04-23 03:49:25.941;
>>> > >> > > > org.apache.solr.update.processor.LogUpdateProcessor;
>>> > >> > > > [collection1] webapp=/solr path=/update/extract
>>> > >> > > > params={literal.groupid=2&json.nl
>>> > >> > > =flat&
>>> > >> > > > resource.name=phplNiPrs&literal.id
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> > =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&
>>> > >> literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.conten
>>> > >> t=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
>>> > >> > > >
>>> > >> > > > ------------
>>> > >> > > >
>>> > >> > > > The document when I check on solr admin page:
>>> > >> > > > -------------
>>> > >> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
>>> > >> "createddate":
>>> > >> > > > "2015-04-22T15:00:00Z", "filename":
>>> > >> > "\\\\trunght\\test\\tesseract_3.png",
>>> > >> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
>>> > >> > > "content": "
>>> > >> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>>> > >> > > > \n
>>> > >> \n
>>> > >> > \n
>>> > >> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>>> > >> > > > \n
>>> > >> \n
>>> > >> > ",
>>> > >> > > > "_version_": 1499213034586898400 }
>>> > >> > > >
>>> > >> > > > -----------
>>> > >> > > >
>>> > >> > > > Since I am a solr newbie I do not know where to look, can
>>> > >> > > > anyone
>>> > >> give
>>> > >> > me
>>> > >> > > > an advice for where to look for error or settings to make it
>>> work.
>>> > >> > > > Thanks in advanced.
>>> > >> > > >
>>> > >> > > > Trung.
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> > >
>>> > >
>>>
>>>
>>

Re: TIKA OCR not working

Reply via email to