See also: http://stackoverflow.com/a/39792337/6281268
This includes jai. Most importantly: be aware of the licensing implications of using levigo and jai. If they had been Apache 2.0 compatible, we would have included them. Finally, there's a new option (coming out in Tika 1.15) that renders each PDF page as a single image before running OCR on it. We found a couple of crazy PDFs that had 1000s of images where a single image was used to represent one line in a table (and I don't mean row, I mean a literal line in a table). That "new" option is documented on our wiki: https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR Finally (I mean it this time), I've updated our wiki to mention the two optional dependencies. Thank you. Cheers, Tim -----Original Message----- From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] Sent: Monday, March 27, 2017 11:48 AM To: solr-user@lucene.apache.org Subject: Re: Index scanned documents I tried this solution from Tim Allison, and it works. http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files Regards, Edwin On 27 March 2017 at 20:07, Allison, Timothy B. <talli...@mitre.org> wrote: > Please also see: > > https://wiki.apache.org/tika/TikaOCR > > and > > https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR > > If you have any other questions about Apache Tika and OCR, please feel > free to ask on our users list as well: u...@tika.apache.org > > Cheers, > > Tim > > -----Original Message----- > From: Arian Pasquali [mailto:arianpasqu...@gmail.com] > Sent: Sunday, March 26, 2017 11:44 AM > To: solr-user@lucene.apache.org > Subject: Re: Index scanned documents > > Hi Walled, > > I've never done that with solr, but you would probably need to use > some OCR preprocessing before indexing. > The most popular library I know for the job is tesseract-orc < > https://github.com/tesseract-ocr>. > > If you want to do that inside solr I've found that Tika has some > support for that too. > Take a look Vijay Mhaskar's post on how to do this using TikaOCR > > http://blog.thedigitalgroup.com/vijaym/using-solr-and- > tikaocr-to-search-text-inside-an-image/ > > I hope that guides you > > Em dom, 26 de mar de 2017 às 16:09, Waleed Raza < > waleed.raza.parhi...@gmail.com> escreveu: > > > Hello > > I want to ask you that how can we extract text in solr from images > > which are inside pdf and MS office documents ? > > i found many websites but did not get a reply of it please guide me. > > > > On Sun, Mar 26, 2017 at 2:57 PM, Waleed Raza < > > waleed.raza.parhi...@gmail.com > > > wrote: > > > > > Hello > > > I want to ask you that how can we extract in solr text from images > > > which are inside pdf and MS office documents ? > > > i found many websites but did not get a reply of it please guide me. > > > > > > > > > -- > [image: INESC TEC] > > *Arian Rodrigo Pasquali* > Laboratório de Inteligência Artificial e Apoio à Decisão Laboratory of > Artificial Intelligence and Decision Support > > *INESC TEC* > Campus da FEUP > Rua Dr Roberto Frias > 4200-465 Porto > Portugal > > T +351 22 040 2963 > F +351 22 209 4050 > arian.r.pasqu...@inesctec.pt > www.inesctec.pt >