RE: Index scanned documents

2017-03-27 Thread Allison, Timothy B.
] Sent: Monday, March 27, 2017 11:48 AM To: solr-user@lucene.apache.org Subject: Re: Index scanned documents I tried this solution from Tim Allison, and it works. http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files Regards, Edwin On 27 March 2017 at 20:07, A

Re: Index scanned documents

2017-03-27 Thread Zheng Lin Edwin Yeo
e- > From: Arian Pasquali [mailto:arianpasqu...@gmail.com] > Sent: Sunday, March 26, 2017 11:44 AM > To: solr-user@lucene.apache.org > Subject: Re: Index scanned documents > > Hi Walled, > > I've never done that with solr, but you would probably need to use some > OCR

RE: Index scanned documents

2017-03-27 Thread Allison, Timothy B.
-Original Message- From: Arian Pasquali [mailto:arianpasqu...@gmail.com] Sent: Sunday, March 26, 2017 11:44 AM To: solr-user@lucene.apache.org Subject: Re: Index scanned documents Hi Walled, I've never done that with solr, but you would probably need to use some OCR preprocessing before ind

RE: Index scanned documents

2017-03-26 Thread Phil Scadden
While building directly into Solr might be appealing, I would argue that it is best to use OCR software first, outside of SOLR, to convert the PDF into "searchable" PDF format. That way when the document is retrieved, it is a lot more useful to the searcher - making it easy to find the text with

Re: Index scanned documents

2017-03-26 Thread Arian Pasquali
Hi Walled, I've never done that with solr, but you would probably need to use some OCR preprocessing before indexing. The most popular library I know for the job is tesseract-orc . If you want to do that inside solr I've found that Tika has some support for that

Re: Index scanned documents

2017-03-26 Thread Zheng Lin Edwin Yeo
I'm also working on this issue right now, to extract the text in the scanned image in PDF files. >From what I know, we can use Tesseract OCR to extract the text in the image through Apache Tika, and it will come together with the Solr. By the way, which Solr version are you using? Regards, Edwin

Index scanned documents

2017-03-26 Thread Waleed Raza
Hello I want to ask you that how can we extract in solr text from images which are inside pdf and MS office documents ? i found many websites but did not get a reply of it please guide me.

Re: Index scanned documents

2017-03-26 Thread Waleed Raza
Hello I want to ask you that how can we extract text in solr from images which are inside pdf and MS office documents ? i found many websites but did not get a reply of it please guide me. On Sun, Mar 26, 2017 at 2:57 PM, Waleed Raza wrote: > Hello > I want to ask you that how can we extract in