I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't seen it in use yet.
Regards, Alex On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <iori...@yahoo.com.invalid> wrote: > Hi Trung, > > I didn't know about OCR capabilities of tika. > Someone who is familiar with sold-cell can inform us whether this > functionality is added to solr or not. > > Ahmet > > > > On Thursday, April 23, 2015 2:06 PM, trung.ht <trung...@anlab.vn> wrote: > Hi Ahmet, > > I used a png file, not a pdf file. From the document, I understand that > solr will post the file to tika, and since tika 1.7, OCR is included. Is > there something I misunderstood. > > Trung. > > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan <iori...@yahoo.com.invalid> > wrote: > > > Hi Trung, > > > > solr-cell (tika) does not do OCR. It cannot exact text from image based > > pdfs. > > > > Ahmet > > > > > > > > On Thursday, April 23, 2015 7:33 AM, trung.ht <trung...@anlab.vn> wrote: > > > > > > > > Hi, > > > > I want to use solr to index some scanned document, after settings solr > > document with a two field "content" and "filename", I tried to upload the > > attached file, but it seems that the content of the file is only "\n \n > > \n....". > > But if I used the tesseract from command line I got the result correctly. > > > > The log when solr receive my request: > > ----------- > > INFO - 2015-04-23 03:49:25.941; > > org.apache.solr.update.processor.LogUpdateProcessor; [collection1] > > webapp=/solr path=/update/extract params={literal.groupid=2&json.nl > =flat& > > resource.name=phplNiPrs&literal.id > > > =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png} > > > > ------------ > > > > The document when I check on solr admin page: > > ------------- > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3, "createddate": > > "2015-04-22T15:00:00Z", "filename": "\\\\trunght\\test\\tesseract_3.png", > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ], > "content": " > > \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n > > \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n ", > > "_version_": 1499213034586898400 } > > > > ----------- > > > > Since I am a solr newbie I do not know where to look, can anyone give me > > an advice for where to look for error or settings to make it work. > > Thanks in advanced. > > > > Trung. > > >