Re: SolR vs large PDF

2013-11-27 Thread Marcello Lorenzi
Hi Erick, On our architecture we use Apache Manifoldcf to invoke the schedulation from Manifold-web and we use the Manifold-agent to take the pdf file from the filesystem to SolR instances. Is it possibile to redirect the Manifold schedulation to the SolrJ instance for specific schedules? Tha

Re: SolR vs large PDF

2013-11-27 Thread Erick Erickson
I'm assuming you're using the ExtractingRequestHandler. Offloading the entire work onto your Solr box that is also serving queries and indexing is not going to scale well. Consider using Tika/SolrJ (Tika is what the ERH uses anyway) to offload the PDF parsing amongst as many clients as you can aff

SolR vs large PDF

2013-11-27 Thread Marcello Lorenzi
Hi All, on our test environment we have implemented a new search engine based on Solr 4.3 with 2 instances hosted on different servers and 1 shard present on each servlet container. During some stress test we noticed a bottleneck into crawling of large PDF file that blocks the serving of resu