Hi all,


I am looking to index the entire directory of PDF files. We have a very large 
volume of PDFs (3000+, possibly much more), so adding them manually would be 
cumbersome.



I have seen more than a couple of dozen links explaining how to index PDF using 
SOLR, but none were details enough to help me get started.

I understand that indexing a word or PDF document requires the use of the 
ExtractingRequestHandler which uses Apache Tika.



My question is:  How do I configure the Handler so that it can extract the 
required information from bulk loads of PDF?

I know I am asking a broad question, but I am struggling to find a good 
guidance and something that would give me a step to step approach.



There is an example configuration in the following link: 
http://wiki.apache.org/solr/ExtractingRequestHandler

I have also seen these threads:

http://stackoverflow.com/questions/5947157/index-search-pdf-content-with-solr

http://www.gossamer-threads.com/lists/lucene/general/158117



I am still trying to understand the configuration process, so any concrete help 
would be welcome.



Thanks,

Sadika Amreen

Data Scientist

PYA Analytics



****DISCLOSURE****



Any U.S. tax advice contained in the body of this email was not intended or 
written to be used, and cannot be used, by the recipient for the purpose of 
avoiding penalties that may be imposed under the Internal Revenue Code or 
applicable state or local tax provisions.



****IMPORTANT NOTICE****



This E-mail (including any attachments) contains PRIVILEGED AND CONFIDENTIAL 
INFORMATION protected by Federal and/or State law and is intended only for the 
use of the individual(s) or entity(ies) designated as recipient(s). If you are 
not an intended recipient of the E-mail, you are hereby notified that any 
disclosure, copying, distribution, or action taken in reliance on the contents 
of this E-mail is strictly prohibited. Disclosure to anyone other than the 
intended recipient does not constitute a waiver of any applicable privilege.

If you have received this E-mail in error, please immediately notify us by 
phone at (800) 270-9629 or reply to the sender of this email and then 
permanently delete the original and any copy of this E-mail (including any 
attachments) and destroy any printout thereof.

Reply via email to