Siegfried: Thanks! That pretty well nails the issue as being in Tika, it's nice to know!
Erick On Mon, Jun 2, 2014 at 10:14 AM, Siegfried Goeschl <sgoes...@gmx.at> wrote: > Hi folks, > > Brian was so kind and sent me the troublesome PDF document > > I gave it a try with PDFBox directly in order to extract the text (PDFBox > is used by Tikka to extract the textual content of a PDF document) > > * hitting an infinite loop with PDFBox 1.8.3 > * no problems with PDFBox 1.8.4 & 1.8.5 > * PDFBox 1.8.4 is part of Apache Tika 1.5 (see http://www.apache.org/dist/ > tika/CHANGES-1.5.txt) > * Apache SOLR 4.8 uses Tika 1.5 (see https://cwiki.apache.org/ > confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika) > > In short the problem with this particular PDF is solved by > > * Apache PDFBox 1.8.4 onwards > * Apache Tika 1.5 > * Apache SOLR 4.8 > > Cheers, > > Siegfried Goeschl > > > > On 26.05.14 18:20, Erick Erickson wrote: > >> Brian: >> >> Yeah, if you can share the PDF that would be great. Parsing via Tika >> should >> not bring down Solr, although I supposed there could be something in Tika >> that is pathologically bad. >> >> You could also try using Tika itself in SolrJ and indexing from a client. >> That >> might let you >> 1> more gracefully handle this without shutting down Solr >> 2> use different versions of Tika. >> >> Personally I like offloading the document parsing to clients anyway since >> it >> lessens the load on the Solr server and scales much better, but YMMV. >> >> It's not actually very difficult, here's a skeleton (rip out the DB parts) >> http://searchhub.org/2012/02/14/indexing-with-solrj/ >> >> Best, >> Erick >> >> On Sun, May 25, 2014 at 2:07 AM, Siegfried Goeschl <sgoes...@gmx.at> >> wrote: >> >>> Sorry typo :- can you send me the PDF by email directly :-) >>> >>> Siegfried Goeschl >>> >>> On 25 May 2014, at 10:06, Siegfried Goeschl <sgoes...@gmx.at> wrote: >>> >>> Hi Brian, >>>> >>>> can you send me the email? I would like to play around :-) >>>> >>>> Have you opened a JIRA for PdfBox? If not I willl open one if I can >>>> reproduce the issue … >>>> >>>> Thanks in advance >>>> >>>> Siegfried Goeschl >>>> >>>> >>>> On 25 May 2014, at 04:18, Brian McDowell <brianmc...@gmail.com> wrote: >>>> >>>> Our feeding (indexing) tool halts because Solr becomes unresponsive >>>>> after >>>>> getting some really bad pdfs. There are levels of pdf "badness." Some >>>>> just >>>>> will not parse and that's fine, but others are more problematic in >>>>> that our >>>>> Operations team has to restart Solr because it just hangs and accepts >>>>> no >>>>> more documents. I actually have identified a pdf that will bring down >>>>> Solr >>>>> every time. Does anyone think that doing pre-validation using the >>>>> pdfbox >>>>> jar will work? Or, will trying to validate just hang as well? Any help >>>>> is >>>>> appreciated. >>>>> >>>>> >>>>> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky < >>>>> j...@basetechnology.com>wrote: >>>>> >>>>> Yeah, I recall running into infinite loop issues with PDFBox in Solr >>>>>> years >>>>>> ago. They keep fixing these issues, but they keep popping up again. >>>>>> Sigh. >>>>>> >>>>>> -- Jack Krupansky >>>>>> >>>>>> -----Original Message----- From: Siegfried Goeschl >>>>>> Sent: Thursday, May 22, 2014 4:35 AM >>>>>> To: solr-user@lucene.apache.org >>>>>> Subject: Re: pdfs >>>>>> >>>>>> >>>>>> Hi folks, >>>>>> >>>>>> for a small customer project I'm running SOLR with embedded Tikka. >>>>>> >>>>>> * memory consumption is an issue but can be handled >>>>>> * there is an issue with PDFBox hitting an infinite loop which causes >>>>>> excessive CPU usage - requires SOLR restart but happens only once >>>>>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit >>>>>> erratic since I was never able to track the problem back to a >>>>>> particular >>>>>> PDF document >>>>>> >>>>>> Having said that we wire SOLR with Nagios to get an alarm when CPU >>>>>> consumption goes through the roof >>>>>> >>>>>> If you doing really serious stuff I would recommend >>>>>> * moving the document extraction stuff out of SOLR >>>>>> * provide monitoring and recovery and stuck document extractions >>>>>> ** killing worker threads >>>>>> ** using external processed and kill them when spinning out of control >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Siegfried Goeschl >>>>>> >>>>>> On 22.05.14 06:46, Jack Krupansky wrote: >>>>>> >>>>>> Yeah, PDF extraction has always been at least somewhat problematic. >>>>>>> It >>>>>>> has improved over the years, but still not likely to be perfect. >>>>>>> >>>>>>> That said, I'm not aware of any specific PDF extraction issue that >>>>>>> would >>>>>>> bring down Solr - as opposed to causing a 500 status with an >>>>>>> exception >>>>>>> in PDF extraction, with the exception of memory usage. Some PDF >>>>>>> documents, especially those which are graphic-intense can require a >>>>>>> lot >>>>>>> of memory. The rest of Solr could be adversely affected if all >>>>>>> available >>>>>>> JVM heap is consumed. The solution is to give the JVM more heap >>>>>>> space. >>>>>>> >>>>>>> So, what is your specific symptom? >>>>>>> >>>>>>> -- Jack Krupansky >>>>>>> >>>>>>> -----Original Message----- From: Brian McDowell >>>>>>> Sent: Thursday, May 22, 2014 12:24 AM >>>>>>> To: solr-user@lucene.apache.org >>>>>>> Subject: pdfs >>>>>>> >>>>>>> Has anyone had issues with indexing pdf files? Some pdfs are >>>>>>> bringing down >>>>>>> Solr completely so that it actually needs to be manually restarted. >>>>>>> We are >>>>>>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the >>>>>>> problem because the release notes associated with the new tika >>>>>>> version and >>>>>>> also the new pdfbox indicate fixes for pdf issues. It didn't work >>>>>>> and now >>>>>>> this issue is causing us to reevaluate using Solr. Any help on this >>>>>>> matter >>>>>>> would be greatly appreciated. Thank you! >>>>>>> >>>>>>> >>>>>> >>>>>> >>>> >>> >