Re: Indexing PDF on SOLR 8.5

Jörn Franke Sun, 07 Jun 2020 11:07:20 -0700

You have to write an external application that creates multiple threads, parses 
the PDFs and index them in Solr. Ideally you parse the PDFs once and store the 
resulting text on some file system and then index it. Reason is that if you 
upgrade to two major versions of Solr you might need to reindex again. Then you 
can save time because you don’t need to parse the PDFs again. 
It can be also useful in case you are not sure yet about the final schema and 
need to index several times in different schemas etc


You can also use Apache manifoldCF.



> Am 07.06.2020 um 19:19 schrieb Fiz N <fiznewy...@gmail.com>:
> 
> Hello SOLR Experts,
> 
> I am working on a POC to Index millions of PDF documents present in
> Multiple Folder in fileshare.
> 
> Could you please let me the best practices and step to implement it.
> 
> Thanks
> Fiz Nadiyal.

Re: Indexing PDF on SOLR 8.5

Reply via email to