RE: Index scanned documents

Allison, Timothy B. Mon, 27 Mar 2017 09:55:38 -0700

See also:

http://stackoverflow.com/a/39792337/6281268


This includes jai.

Most importantly: be aware of the licensing implications of using levigo and 
jai.  If they had been Apache 2.0 compatible, we would have included them.

Finally, there's a new option (coming out in Tika 1.15) that renders each PDF 
page as a single image before running OCR on it.  We found a couple of crazy 
PDFs that had 1000s of images where a single image was used to represent one 
line in a table (and I don't mean row, I mean a literal line in a table).

That "new" option is documented on our wiki:

https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR

Finally (I mean it this time), I've updated our wiki to mention the two 
optional dependencies.  Thank you.

Cheers,

              Tim

-----Original Message-----
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] 
Sent: Monday, March 27, 2017 11:48 AM
To: solr-user@lucene.apache.org
Subject: Re: Index scanned documents

I tried this solution from Tim Allison, and it works.

http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files

Regards,
Edwin

On 27 March 2017 at 20:07, Allison, Timothy B. <talli...@mitre.org> wrote:

> Please also see:
>
> https://wiki.apache.org/tika/TikaOCR
>
> and
>
> https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR
>
> If you have any other questions about Apache Tika and OCR, please feel 
> free to ask on our users list as well: u...@tika.apache.org
>
> Cheers,
>
>            Tim
>
> -----Original Message-----
> From: Arian Pasquali [mailto:arianpasqu...@gmail.com]
> Sent: Sunday, March 26, 2017 11:44 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Index scanned documents
>
> Hi Walled,
>
> I've never done that with solr, but you would probably need to use 
> some OCR preprocessing before indexing.
> The most popular library I know for the job is tesseract-orc < 
> https://github.com/tesseract-ocr>.
>
> If you want to do that inside solr I've found that Tika has some 
> support for that too.
> Take a look Vijay Mhaskar's post on how to do this using TikaOCR
>
> http://blog.thedigitalgroup.com/vijaym/using-solr-and-
> tikaocr-to-search-text-inside-an-image/
>
> I hope that guides you
>
> Em dom, 26 de mar de 2017 às 16:09, Waleed Raza < 
> waleed.raza.parhi...@gmail.com> escreveu:
>
> > Hello
> > I want to ask you that how can we extract text in solr from images 
> > which are inside pdf and MS office documents ?
> > i found many websites but did not get a reply of it please guide me.
> >
> > On Sun, Mar 26, 2017 at 2:57 PM, Waleed Raza < 
> > waleed.raza.parhi...@gmail.com
> > > wrote:
> >
> > > Hello
> > > I want to ask you that how can we extract in solr text from images 
> > > which are inside pdf and MS office documents ?
> > > i found many websites but did not get a reply of it please guide me.
> > >
> > >
> >
> --
> [image: INESC TEC]
>
> *Arian Rodrigo Pasquali*
> Laboratório de Inteligência Artificial e Apoio à Decisão Laboratory of 
> Artificial Intelligence and Decision Support
>
> *INESC TEC*
> Campus da FEUP
> Rua Dr Roberto Frias
> 4200-465 Porto
> Portugal
>
> T +351 22 040 2963
> F +351 22 209 4050
> arian.r.pasqu...@inesctec.pt
> www.inesctec.pt
>

RE: Index scanned documents

Reply via email to