Hi all,
I am looking to index the entire directory of PDF files. We have a very large volume of PDFs (3000+, possibly much more), so adding them manually would be cumbersome. I have seen more than a couple of dozen links explaining how to index PDF using SOLR, but none were details enough to help me get started. I understand that indexing a word or PDF document requires the use of the ExtractingRequestHandler which uses Apache Tika. My question is: How do I configure the Handler so that it can extract the required information from bulk loads of PDF? I know I am asking a broad question, but I am struggling to find a good guidance and something that would give me a step to step approach. There is an example configuration in the following link: http://wiki.apache.org/solr/ExtractingRequestHandler I have also seen these threads: http://stackoverflow.com/questions/5947157/index-search-pdf-content-with-solr http://www.gossamer-threads.com/lists/lucene/general/158117 I am still trying to understand the configuration process, so any concrete help would be welcome. Thanks, Sadika Amreen Data Scientist PYA Analytics ****DISCLOSURE**** Any U.S. tax advice contained in the body of this email was not intended or written to be used, and cannot be used, by the recipient for the purpose of avoiding penalties that may be imposed under the Internal Revenue Code or applicable state or local tax provisions. ****IMPORTANT NOTICE**** This E-mail (including any attachments) contains PRIVILEGED AND CONFIDENTIAL INFORMATION protected by Federal and/or State law and is intended only for the use of the individual(s) or entity(ies) designated as recipient(s). If you are not an intended recipient of the E-mail, you are hereby notified that any disclosure, copying, distribution, or action taken in reliance on the contents of this E-mail is strictly prohibited. Disclosure to anyone other than the intended recipient does not constitute a waiver of any applicable privilege. If you have received this E-mail in error, please immediately notify us by phone at (800) 270-9629 or reply to the sender of this email and then permanently delete the original and any copy of this E-mail (including any attachments) and destroy any printout thereof.