Consider using a SolrJ program, perhaps multiple
ones running in parallel.

See: http://searchhub.org/dev/2012/02/14/indexing-with-solrj/

Best,
Erick

On Mon, Sep 23, 2013 at 3:31 PM, Sadika Amreen <samr...@pyaanalytics.com> wrote:
> Hi all,
>
>
>
> I am looking to index the entire directory of PDF files. We have a very large 
> volume of PDFs (3000+, possibly much more), so adding them manually would be 
> cumbersome.
>
>
>
> I have seen more than a couple of dozen links explaining how to index PDF 
> using SOLR, but none were details enough to help me get started.
>
> I understand that indexing a word or PDF document requires the use of the 
> ExtractingRequestHandler which uses Apache Tika.
>
>
>
> My question is:  How do I configure the Handler so that it can extract the 
> required information from bulk loads of PDF?
>
> I know I am asking a broad question, but I am struggling to find a good 
> guidance and something that would give me a step to step approach.
>
>
>
> There is an example configuration in the following link: 
> http://wiki.apache.org/solr/ExtractingRequestHandler
>
> I have also seen these threads:
>
> http://stackoverflow.com/questions/5947157/index-search-pdf-content-with-solr
>
> http://www.gossamer-threads.com/lists/lucene/general/158117
>
>
>
> I am still trying to understand the configuration process, so any concrete 
> help would be welcome.
>
>
>
> Thanks,
>
> Sadika Amreen
>
> Data Scientist
>
> PYA Analytics
>
>
>
> ****DISCLOSURE****
>
>
>
> Any U.S. tax advice contained in the body of this email was not intended or 
> written to be used, and cannot be used, by the recipient for the purpose of 
> avoiding penalties that may be imposed under the Internal Revenue Code or 
> applicable state or local tax provisions.
>
>
>
> ****IMPORTANT NOTICE****
>
>
>
> This E-mail (including any attachments) contains PRIVILEGED AND CONFIDENTIAL 
> INFORMATION protected by Federal and/or State law and is intended only for 
> the use of the individual(s) or entity(ies) designated as recipient(s). If 
> you are not an intended recipient of the E-mail, you are hereby notified that 
> any disclosure, copying, distribution, or action taken in reliance on the 
> contents of this E-mail is strictly prohibited. Disclosure to anyone other 
> than the intended recipient does not constitute a waiver of any applicable 
> privilege.
>
> If you have received this E-mail in error, please immediately notify us by 
> phone at (800) 270-9629 or reply to the sender of this email and then 
> permanently delete the original and any copy of this E-mail (including any 
> attachments) and destroy any printout thereof.

Reply via email to