I'm assuming you're using the ExtractingRequestHandler. Offloading the entire work onto your Solr box that is also serving queries and indexing is not going to scale well.
Consider using Tika/SolrJ (Tika is what the ERH uses anyway) to offload the PDF parsing amongst as many clients as you can afford. Here's a way to get started: http://searchhub.org/2012/02/14/indexing-with-solrj/ Best, Erick On Wed, Nov 27, 2013 at 10:00 AM, Marcello Lorenzi <mlore...@sorint.it>wrote: > Hi All, > on our test environment we have implemented a new search engine based on > Solr 4.3 with 2 instances hosted on different servers and 1 shard present > on each servlet container. > > During some stress test we noticed a bottleneck into crawling of large PDF > file that blocks the serving of results from queries to the collections. > > Is it possible to boost or mitigate the overhead created by PDFBOX during > the crawling? > > Thanks, > Marcello >