Sorry typo :- can you send me the PDF by email directly :-) Siegfried Goeschl
On 25 May 2014, at 10:06, Siegfried Goeschl <sgoes...@gmx.at> wrote: > Hi Brian, > > can you send me the email? I would like to play around :-) > > Have you opened a JIRA for PdfBox? If not I willl open one if I can reproduce > the issue … > > Thanks in advance > > Siegfried Goeschl > > > On 25 May 2014, at 04:18, Brian McDowell <brianmc...@gmail.com> wrote: > >> Our feeding (indexing) tool halts because Solr becomes unresponsive after >> getting some really bad pdfs. There are levels of pdf "badness." Some just >> will not parse and that's fine, but others are more problematic in that our >> Operations team has to restart Solr because it just hangs and accepts no >> more documents. I actually have identified a pdf that will bring down Solr >> every time. Does anyone think that doing pre-validation using the pdfbox >> jar will work? Or, will trying to validate just hang as well? Any help is >> appreciated. >> >> >> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky >> <j...@basetechnology.com>wrote: >> >>> Yeah, I recall running into infinite loop issues with PDFBox in Solr years >>> ago. They keep fixing these issues, but they keep popping up again. Sigh. >>> >>> -- Jack Krupansky >>> >>> -----Original Message----- From: Siegfried Goeschl >>> Sent: Thursday, May 22, 2014 4:35 AM >>> To: solr-user@lucene.apache.org >>> Subject: Re: pdfs >>> >>> >>> Hi folks, >>> >>> for a small customer project I'm running SOLR with embedded Tikka. >>> >>> * memory consumption is an issue but can be handled >>> * there is an issue with PDFBox hitting an infinite loop which causes >>> excessive CPU usage - requires SOLR restart but happens only once >>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit >>> erratic since I was never able to track the problem back to a particular >>> PDF document >>> >>> Having said that we wire SOLR with Nagios to get an alarm when CPU >>> consumption goes through the roof >>> >>> If you doing really serious stuff I would recommend >>> * moving the document extraction stuff out of SOLR >>> * provide monitoring and recovery and stuck document extractions >>> ** killing worker threads >>> ** using external processed and kill them when spinning out of control >>> >>> Cheers, >>> >>> Siegfried Goeschl >>> >>> On 22.05.14 06:46, Jack Krupansky wrote: >>> >>>> Yeah, PDF extraction has always been at least somewhat problematic. It >>>> has improved over the years, but still not likely to be perfect. >>>> >>>> That said, I'm not aware of any specific PDF extraction issue that would >>>> bring down Solr - as opposed to causing a 500 status with an exception >>>> in PDF extraction, with the exception of memory usage. Some PDF >>>> documents, especially those which are graphic-intense can require a lot >>>> of memory. The rest of Solr could be adversely affected if all available >>>> JVM heap is consumed. The solution is to give the JVM more heap space. >>>> >>>> So, what is your specific symptom? >>>> >>>> -- Jack Krupansky >>>> >>>> -----Original Message----- From: Brian McDowell >>>> Sent: Thursday, May 22, 2014 12:24 AM >>>> To: solr-user@lucene.apache.org >>>> Subject: pdfs >>>> >>>> Has anyone had issues with indexing pdf files? Some pdfs are bringing down >>>> Solr completely so that it actually needs to be manually restarted. We are >>>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the >>>> problem because the release notes associated with the new tika version and >>>> also the new pdfbox indicate fixes for pdf issues. It didn't work and now >>>> this issue is causing us to reevaluate using Solr. Any help on this matter >>>> would be greatly appreciated. Thank you! >>>> >>> >>> >