Consider using a SolrJ program, perhaps multiple ones running in parallel. See: http://searchhub.org/dev/2012/02/14/indexing-with-solrj/
Best, Erick On Mon, Sep 23, 2013 at 3:31 PM, Sadika Amreen <samr...@pyaanalytics.com> wrote: > Hi all, > > > > I am looking to index the entire directory of PDF files. We have a very large > volume of PDFs (3000+, possibly much more), so adding them manually would be > cumbersome. > > > > I have seen more than a couple of dozen links explaining how to index PDF > using SOLR, but none were details enough to help me get started. > > I understand that indexing a word or PDF document requires the use of the > ExtractingRequestHandler which uses Apache Tika. > > > > My question is: How do I configure the Handler so that it can extract the > required information from bulk loads of PDF? > > I know I am asking a broad question, but I am struggling to find a good > guidance and something that would give me a step to step approach. > > > > There is an example configuration in the following link: > http://wiki.apache.org/solr/ExtractingRequestHandler > > I have also seen these threads: > > http://stackoverflow.com/questions/5947157/index-search-pdf-content-with-solr > > http://www.gossamer-threads.com/lists/lucene/general/158117 > > > > I am still trying to understand the configuration process, so any concrete > help would be welcome. > > > > Thanks, > > Sadika Amreen > > Data Scientist > > PYA Analytics > > > > ****DISCLOSURE**** > > > > Any U.S. tax advice contained in the body of this email was not intended or > written to be used, and cannot be used, by the recipient for the purpose of > avoiding penalties that may be imposed under the Internal Revenue Code or > applicable state or local tax provisions. > > > > ****IMPORTANT NOTICE**** > > > > This E-mail (including any attachments) contains PRIVILEGED AND CONFIDENTIAL > INFORMATION protected by Federal and/or State law and is intended only for > the use of the individual(s) or entity(ies) designated as recipient(s). If > you are not an intended recipient of the E-mail, you are hereby notified that > any disclosure, copying, distribution, or action taken in reliance on the > contents of this E-mail is strictly prohibited. Disclosure to anyone other > than the intended recipient does not constitute a waiver of any applicable > privilege. > > If you have received this E-mail in error, please immediately notify us by > phone at (800) 270-9629 or reply to the sender of this email and then > permanently delete the original and any copy of this E-mail (including any > attachments) and destroy any printout thereof.