First, running multiple threads with PDF files to a Solr running 4G of
JVM is...ambitious. You say it crashes; how? OOMs?

Second while the extracting request handler is a fine way to get up
and running, any problems with Tika will affect Solr. Tika does a
great job of extraction, but there are so many variants of so many
file formats that this scenario isn' recommended for production.
Consider extracting the PDF on a client and sending the docs to Solr.
Tika can run as a server also so you aren't coupling Solr and Tika.

For a sample SolrJ program, see:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

Best,
Erick

On Fri, Mar 31, 2017 at 10:44 AM, tstusr <ulfrhe...@gmail.com> wrote:
> Hi there.
>
> We are currently indexing some PDF files, the main handler to index is
> /extract where we perform simple processing (extract relevant fields and
> store on some fields).
>
> The PDF files are about 10M~100M size and we have to have available the text
> extracted. So, everything works correct on test stages, but when we try to
> index all the 14K files (around 120Gb) on a client application that only
> sends http curls through 3-4 concurrent threads to /extract handler it
> crashes. I can't find some relevant information about on solr logs (We
> checked in server/logs & in core_dir/tlog).
>
> My question is about performance. I think it is a small amount of info we
> are processing, the deploy scenario is in a docker container with 4gb of JVM
> Memory and ~50gb of physical memory (reported through dashboard) we are
> using a single instance.
>
> I don't think is a normal behaviour that handler crashes. So, what are some
> general tips about improving performance for this scenario?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to