First, running multiple threads with PDF files to a Solr running 4G of JVM is...ambitious. You say it crashes; how? OOMs?
Second while the extracting request handler is a fine way to get up and running, any problems with Tika will affect Solr. Tika does a great job of extraction, but there are so many variants of so many file formats that this scenario isn' recommended for production. Consider extracting the PDF on a client and sending the docs to Solr. Tika can run as a server also so you aren't coupling Solr and Tika. For a sample SolrJ program, see: https://lucidworks.com/2012/02/14/indexing-with-solrj/ Best, Erick On Fri, Mar 31, 2017 at 10:44 AM, tstusr <ulfrhe...@gmail.com> wrote: > Hi there. > > We are currently indexing some PDF files, the main handler to index is > /extract where we perform simple processing (extract relevant fields and > store on some fields). > > The PDF files are about 10M~100M size and we have to have available the text > extracted. So, everything works correct on test stages, but when we try to > index all the 14K files (around 120Gb) on a client application that only > sends http curls through 3-4 concurrent threads to /extract handler it > crashes. I can't find some relevant information about on solr logs (We > checked in server/logs & in core_dir/tlog). > > My question is about performance. I think it is a small amount of info we > are processing, the deploy scenario is in a docker container with 4gb of JVM > Memory and ~50gb of physical memory (reported through dashboard) we are > using a single instance. > > I don't think is a normal behaviour that handler crashes. So, what are some > general tips about improving performance for this scenario? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886.html > Sent from the Solr - User mailing list archive at Nabble.com.