HI everyone, Does anyone have the answer for this problem :)?
I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika 1.7, > but it looks like it does not work. Does anyone know that TIKA OCR works > automatically with Solr or I have to change some settings? > >> Trung. > It's not clear if OCR would happen automatically in Solr Cell, or if >> changes to Solr would be needed. >> >> For Tika OCR info, see: >> >> https://issues.apache.org/jira/browse/TIKA-93 >> https://wiki.apache.org/tika/TikaOCR >> >> >> >> -- Jack Krupansky >> >> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch < >> arafa...@gmail.com> >> wrote: >> >> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't seen >> it >> > in use yet. >> > >> > Regards, >> > Alex >> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <iori...@yahoo.com.invalid> >> wrote: >> > >> > > Hi Trung, >> > > >> > > I didn't know about OCR capabilities of tika. >> > > Someone who is familiar with sold-cell can inform us whether this >> > > functionality is added to solr or not. >> > > >> > > Ahmet >> > > >> > > >> > > >> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <trung...@anlab.vn> >> wrote: >> > > Hi Ahmet, >> > > >> > > I used a png file, not a pdf file. From the document, I understand >> that >> > > solr will post the file to tika, and since tika 1.7, OCR is included. >> Is >> > > there something I misunderstood. >> > > >> > > Trung. >> > > >> > > >> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan >> <iori...@yahoo.com.invalid >> > > >> > > wrote: >> > > >> > > > Hi Trung, >> > > > >> > > > solr-cell (tika) does not do OCR. It cannot exact text from image >> based >> > > > pdfs. >> > > > >> > > > Ahmet >> > > > >> > > > >> > > > >> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht <trung...@anlab.vn> >> > wrote: >> > > > >> > > > >> > > > >> > > > Hi, >> > > > >> > > > I want to use solr to index some scanned document, after settings >> solr >> > > > document with a two field "content" and "filename", I tried to >> upload >> > the >> > > > attached file, but it seems that the content of the file is only >> "\n \n >> > > > \n....". >> > > > But if I used the tesseract from command line I got the result >> > correctly. >> > > > >> > > > The log when solr receive my request: >> > > > ----------- >> > > > INFO - 2015-04-23 03:49:25.941; >> > > > org.apache.solr.update.processor.LogUpdateProcessor; [collection1] >> > > > webapp=/solr path=/update/extract params={literal.groupid=2&json.nl >> > > =flat& >> > > > resource.name=phplNiPrs&literal.id >> > > > >> > > >> > >> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png} >> > > > >> > > > ------------ >> > > > >> > > > The document when I check on solr admin page: >> > > > ------------- >> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3, >> "createddate": >> > > > "2015-04-22T15:00:00Z", "filename": >> > "\\\\trunght\\test\\tesseract_3.png", >> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ], >> > > "content": " >> > > > \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n >> \n >> > \n >> > > > \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n >> \n >> > ", >> > > > "_version_": 1499213034586898400 } >> > > > >> > > > ----------- >> > > > >> > > > Since I am a solr newbie I do not know where to look, can anyone >> give >> > me >> > > > an advice for where to look for error or settings to make it work. >> > > > Thanks in advanced. >> > > > >> > > > Trung. >> > > > >> > > >> > >> > >