Hi Uwe, Today, I downloaded Solr 5.1 and it worked fine. It seems that this bug fix SOLR-7139 is only included in 5.1, not 5.0.
Thank everyone for your support. Trung. On Tue, Apr 28, 2015 at 10:21 AM, trung.ht <trung...@anlab.vn> wrote: > Hi Uwe, > > Thanks for the answer, but it looks like it does not work on my machine. > > I use Mac OS 10.10.3, tesseract is installed through homebrew, and tested > with the same file I post to solr. > I think tesseract is on path since I run this command successfully: "tesseract > test_tesseract.png output" > > On command line, I got correct result (output is the correct content of > the image), but when I upload to solr, the content is only some new line > characters. (I used > > About log file, I did not see anything abnormal in solr log file (nothing > abnormal after my POST request), am I missing another log file? > > With best regards, > Trung. > > > On Mon, Apr 27, 2015 at 9:34 PM, Uwe Schindler <u...@thetaphi.de> wrote: > >> Hi, >> TIKA OCR is definitely working automatically with Solr 5.x. >> >> It is just important to install TesseractOCR on path (which is a native >> tool that does the actual work). On Ubuntu Linux, this should be quite >> simple ("apt-get install tesseract-ocr" or like that). You may also need to >> ainstall additional language for better results. >> >> Unless you are on a Turkish localized machine (which causes a bug in the >> JDK on spawning external processes) and the native tools are installed, it >> should work OOB, no configuration needed. Please also check log files. >> >> Uwe >> >> ----- >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> >> > -----Original Message----- >> > From: Allison, Timothy B. [mailto:talli...@mitre.org] >> > Sent: Monday, April 27, 2015 4:27 PM >> > To: u...@tika.apache.org >> > Cc: trung...@anlab.vn; solr-user@lucene.apache.org >> > Subject: FW: TIKA OCR not working >> > >> > Trung, >> > >> > I haven't experimented with our OCR parser yet, but this should give a >> good >> > start: https://wiki.apache.org/tika/TikaOCR . >> > >> > Have you installed tesseract? >> > >> > Tika colleagues, >> > Any other tips? What else has to be configured and how? >> > >> > -----Original Message----- >> > From: trung.ht [mailto:trung...@anlab.vn] >> > Sent: Friday, April 24, 2015 11:22 PM >> > To: solr-user@lucene.apache.org >> > Subject: Re: TIKA OCR not working >> > >> > HI everyone, >> > >> > Does anyone have the answer for this problem :)? >> > >> > >> > I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika >> 1.7, >> > > but it looks like it does not work. Does anyone know that TIKA OCR >> > > works automatically with Solr or I have to change some settings? >> > > >> > >> >> > Trung. >> > >> > >> > > It's not clear if OCR would happen automatically in Solr Cell, or if >> > >> changes to Solr would be needed. >> > >> >> > >> For Tika OCR info, see: >> > >> >> > >> https://issues.apache.org/jira/browse/TIKA-93 >> > >> https://wiki.apache.org/tika/TikaOCR >> > >> >> > >> >> > >> >> > >> -- Jack Krupansky >> > >> >> > >> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch < >> > >> arafa...@gmail.com> >> > >> wrote: >> > >> >> > >> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't >> > >> > seen >> > >> it >> > >> > in use yet. >> > >> > >> > >> > Regards, >> > >> > Alex >> > >> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <iori...@yahoo.com.invalid >> > >> > >> wrote: >> > >> > >> > >> > > Hi Trung, >> > >> > > >> > >> > > I didn't know about OCR capabilities of tika. >> > >> > > Someone who is familiar with sold-cell can inform us whether this >> > >> > > functionality is added to solr or not. >> > >> > > >> > >> > > Ahmet >> > >> > > >> > >> > > >> > >> > > >> > >> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <trung...@anlab.vn >> > >> > >> wrote: >> > >> > > Hi Ahmet, >> > >> > > >> > >> > > I used a png file, not a pdf file. From the document, I >> > >> > > understand >> > >> that >> > >> > > solr will post the file to tika, and since tika 1.7, OCR is >> included. >> > >> Is >> > >> > > there something I misunderstood. >> > >> > > >> > >> > > Trung. >> > >> > > >> > >> > > >> > >> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan >> > >> <iori...@yahoo.com.invalid >> > >> > > >> > >> > > wrote: >> > >> > > >> > >> > > > Hi Trung, >> > >> > > > >> > >> > > > solr-cell (tika) does not do OCR. It cannot exact text from >> > >> > > > image >> > >> based >> > >> > > > pdfs. >> > >> > > > >> > >> > > > Ahmet >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht >> > >> > > > <trung...@anlab.vn> >> > >> > wrote: >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > Hi, >> > >> > > > >> > >> > > > I want to use solr to index some scanned document, after >> > >> > > > settings >> > >> solr >> > >> > > > document with a two field "content" and "filename", I tried to >> > >> upload >> > >> > the >> > >> > > > attached file, but it seems that the content of the file is >> > >> > > > only >> > >> "\n \n >> > >> > > > \n....". >> > >> > > > But if I used the tesseract from command line I got the result >> > >> > correctly. >> > >> > > > >> > >> > > > The log when solr receive my request: >> > >> > > > ----------- >> > >> > > > INFO - 2015-04-23 03:49:25.941; >> > >> > > > org.apache.solr.update.processor.LogUpdateProcessor; >> > >> > > > [collection1] webapp=/solr path=/update/extract >> > >> > > > params={literal.groupid=2&json.nl >> > >> > > =flat& >> > >> > > > resource.name=phplNiPrs&literal.id >> > >> > > > >> > >> > > >> > >> > >> > >> >> > =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true& >> > >> literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.conten >> > >> t=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png} >> > >> > > > >> > >> > > > ------------ >> > >> > > > >> > >> > > > The document when I check on solr admin page: >> > >> > > > ------------- >> > >> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3, >> > >> "createddate": >> > >> > > > "2015-04-22T15:00:00Z", "filename": >> > >> > "\\\\trunght\\test\\tesseract_3.png", >> > >> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ], >> > >> > > "content": " >> > >> > > > \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n >> > >> > > > \n >> > >> \n >> > >> > \n >> > >> > > > \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n >> > >> > > > \n >> > >> \n >> > >> > ", >> > >> > > > "_version_": 1499213034586898400 } >> > >> > > > >> > >> > > > ----------- >> > >> > > > >> > >> > > > Since I am a solr newbie I do not know where to look, can >> > >> > > > anyone >> > >> give >> > >> > me >> > >> > > > an advice for where to look for error or settings to make it >> work. >> > >> > > > Thanks in advanced. >> > >> > > > >> > >> > > > Trung. >> > >> > > > >> > >> > > >> > >> > >> > >> >> > > >> > > >> >> >