Re: Indexing PDF on SOLR 8.5

Fiz N Sun, 07 Jun 2020 12:23:20 -0700

Thanks Jorn and Erick.

Hi Erick, looks like the skeletal SOLRJ program attachment is missing.


Thanks
Fiz

On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson <[email protected]>
wrote:

> Here’s a skeletal SolrJ program using Tika as another alternative.
>
> Best,
> Erick
>
> > On Jun 7, 2020, at 2:06 PM, Jörn Franke <[email protected]> wrote:
> >
> > You have to write an external application that creates multiple threads,
> parses the PDFs and index them in Solr. Ideally you parse the PDFs once and
> store the resulting text on some file system and then index it. Reason is
> that if you upgrade to two major versions of Solr you might need to reindex
> again. Then you can save time because you don’t need to parse the PDFs
> again.
> > It can be also useful in case you are not sure yet about the final
> schema and need to index several times in different schemas etc
> >
> > You can also use Apache manifoldCF.
> >
> >
> >
> >> Am 07.06.2020 um 19:19 schrieb Fiz N <[email protected]>:
> >>
> >> Hello SOLR Experts,
> >>
> >> I am working on a POC to Index millions of PDF documents present in
> >> Multiple Folder in fileshare.
> >>
> >> Could you please let me the best practices and step to implement it.
> >>
> >> Thanks
> >> Fiz Nadiyal.
>
>

Re: Indexing PDF on SOLR 8.5

Reply via email to