Brian: Yeah, if you can share the PDF that would be great. Parsing via Tika should not bring down Solr, although I supposed there could be something in Tika that is pathologically bad.
You could also try using Tika itself in SolrJ and indexing from a client. That might let you 1> more gracefully handle this without shutting down Solr 2> use different versions of Tika. Personally I like offloading the document parsing to clients anyway since it lessens the load on the Solr server and scales much better, but YMMV. It's not actually very difficult, here's a skeleton (rip out the DB parts) http://searchhub.org/2012/02/14/indexing-with-solrj/ Best, Erick On Sun, May 25, 2014 at 2:07 AM, Siegfried Goeschl <sgoes...@gmx.at> wrote: > Sorry typo :- can you send me the PDF by email directly :-) > > Siegfried Goeschl > > On 25 May 2014, at 10:06, Siegfried Goeschl <sgoes...@gmx.at> wrote: > >> Hi Brian, >> >> can you send me the email? I would like to play around :-) >> >> Have you opened a JIRA for PdfBox? If not I willl open one if I can >> reproduce the issue … >> >> Thanks in advance >> >> Siegfried Goeschl >> >> >> On 25 May 2014, at 04:18, Brian McDowell <brianmc...@gmail.com> wrote: >> >>> Our feeding (indexing) tool halts because Solr becomes unresponsive after >>> getting some really bad pdfs. There are levels of pdf "badness." Some just >>> will not parse and that's fine, but others are more problematic in that our >>> Operations team has to restart Solr because it just hangs and accepts no >>> more documents. I actually have identified a pdf that will bring down Solr >>> every time. Does anyone think that doing pre-validation using the pdfbox >>> jar will work? Or, will trying to validate just hang as well? Any help is >>> appreciated. >>> >>> >>> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky >>> <j...@basetechnology.com>wrote: >>> >>>> Yeah, I recall running into infinite loop issues with PDFBox in Solr years >>>> ago. They keep fixing these issues, but they keep popping up again. Sigh. >>>> >>>> -- Jack Krupansky >>>> >>>> -----Original Message----- From: Siegfried Goeschl >>>> Sent: Thursday, May 22, 2014 4:35 AM >>>> To: solr-user@lucene.apache.org >>>> Subject: Re: pdfs >>>> >>>> >>>> Hi folks, >>>> >>>> for a small customer project I'm running SOLR with embedded Tikka. >>>> >>>> * memory consumption is an issue but can be handled >>>> * there is an issue with PDFBox hitting an infinite loop which causes >>>> excessive CPU usage - requires SOLR restart but happens only once >>>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit >>>> erratic since I was never able to track the problem back to a particular >>>> PDF document >>>> >>>> Having said that we wire SOLR with Nagios to get an alarm when CPU >>>> consumption goes through the roof >>>> >>>> If you doing really serious stuff I would recommend >>>> * moving the document extraction stuff out of SOLR >>>> * provide monitoring and recovery and stuck document extractions >>>> ** killing worker threads >>>> ** using external processed and kill them when spinning out of control >>>> >>>> Cheers, >>>> >>>> Siegfried Goeschl >>>> >>>> On 22.05.14 06:46, Jack Krupansky wrote: >>>> >>>>> Yeah, PDF extraction has always been at least somewhat problematic. It >>>>> has improved over the years, but still not likely to be perfect. >>>>> >>>>> That said, I'm not aware of any specific PDF extraction issue that would >>>>> bring down Solr - as opposed to causing a 500 status with an exception >>>>> in PDF extraction, with the exception of memory usage. Some PDF >>>>> documents, especially those which are graphic-intense can require a lot >>>>> of memory. The rest of Solr could be adversely affected if all available >>>>> JVM heap is consumed. The solution is to give the JVM more heap space. >>>>> >>>>> So, what is your specific symptom? >>>>> >>>>> -- Jack Krupansky >>>>> >>>>> -----Original Message----- From: Brian McDowell >>>>> Sent: Thursday, May 22, 2014 12:24 AM >>>>> To: solr-user@lucene.apache.org >>>>> Subject: pdfs >>>>> >>>>> Has anyone had issues with indexing pdf files? Some pdfs are bringing down >>>>> Solr completely so that it actually needs to be manually restarted. We are >>>>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the >>>>> problem because the release notes associated with the new tika version and >>>>> also the new pdfbox indicate fixes for pdf issues. It didn't work and now >>>>> this issue is causing us to reevaluate using Solr. Any help on this matter >>>>> would be greatly appreciated. Thank you! >>>>> >>>> >>>> >> >