Siegfried:

Thanks! That pretty well nails the issue as being in Tika, it's nice to
know!

Erick


On Mon, Jun 2, 2014 at 10:14 AM, Siegfried Goeschl <sgoes...@gmx.at> wrote:

> Hi folks,
>
> Brian was so kind and sent me the troublesome PDF document
>
> I gave it a try with PDFBox directly in order to extract the text (PDFBox
> is used by Tikka to extract the textual content of a PDF document)
>
> * hitting an infinite loop with PDFBox 1.8.3
> * no problems with PDFBox 1.8.4 & 1.8.5
> * PDFBox 1.8.4 is part of Apache Tika 1.5 (see http://www.apache.org/dist/
> tika/CHANGES-1.5.txt)
> * Apache SOLR 4.8 uses Tika 1.5 (see https://cwiki.apache.org/
> confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika)
>
> In short the problem with this particular PDF is solved by
>
> * Apache PDFBox 1.8.4 onwards
> * Apache Tika 1.5
> * Apache SOLR 4.8
>
> Cheers,
>
> Siegfried Goeschl
>
>
>
> On 26.05.14 18:20, Erick Erickson wrote:
>
>> Brian:
>>
>> Yeah, if you can share the PDF that would be great. Parsing via Tika
>> should
>> not bring down Solr, although I supposed there could be something in Tika
>> that is pathologically bad.
>>
>> You could also try using Tika itself in SolrJ and indexing from a client.
>> That
>> might let you
>> 1> more gracefully handle this without shutting down Solr
>> 2> use different versions of Tika.
>>
>> Personally I like offloading the document parsing to clients anyway since
>> it
>> lessens the load on the Solr server and scales much better, but YMMV.
>>
>> It's not actually very difficult, here's a skeleton (rip out the DB parts)
>> http://searchhub.org/2012/02/14/indexing-with-solrj/
>>
>> Best,
>> Erick
>>
>> On Sun, May 25, 2014 at 2:07 AM, Siegfried Goeschl <sgoes...@gmx.at>
>> wrote:
>>
>>> Sorry typo :- can you send me the PDF by email directly :-)
>>>
>>> Siegfried Goeschl
>>>
>>> On 25 May 2014, at 10:06, Siegfried Goeschl <sgoes...@gmx.at> wrote:
>>>
>>>  Hi Brian,
>>>>
>>>> can you send me the email? I would like to play around :-)
>>>>
>>>> Have you opened a JIRA for PdfBox? If not I willl open one if I can
>>>> reproduce the issue …
>>>>
>>>> Thanks in advance
>>>>
>>>> Siegfried Goeschl
>>>>
>>>>
>>>> On 25 May 2014, at 04:18, Brian McDowell <brianmc...@gmail.com> wrote:
>>>>
>>>>  Our feeding (indexing) tool halts because Solr becomes unresponsive
>>>>> after
>>>>> getting some really bad pdfs. There are levels of pdf "badness." Some
>>>>> just
>>>>> will not parse and that's fine, but others are more problematic in
>>>>> that our
>>>>> Operations team has to restart Solr because it just hangs and accepts
>>>>> no
>>>>> more documents. I actually have identified a pdf that will bring down
>>>>> Solr
>>>>> every time. Does anyone think that doing pre-validation using the
>>>>> pdfbox
>>>>> jar will work? Or, will trying to validate just hang as well? Any help
>>>>> is
>>>>> appreciated.
>>>>>
>>>>>
>>>>> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky <
>>>>> j...@basetechnology.com>wrote:
>>>>>
>>>>>  Yeah, I recall running into infinite loop issues with PDFBox in Solr
>>>>>> years
>>>>>> ago. They keep fixing these issues, but they keep popping up again.
>>>>>> Sigh.
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>> -----Original Message----- From: Siegfried Goeschl
>>>>>> Sent: Thursday, May 22, 2014 4:35 AM
>>>>>> To: solr-user@lucene.apache.org
>>>>>> Subject: Re: pdfs
>>>>>>
>>>>>>
>>>>>> Hi folks,
>>>>>>
>>>>>> for a small customer project I'm running SOLR with embedded Tikka.
>>>>>>
>>>>>> * memory consumption is an issue but can be handled
>>>>>> * there is an issue with PDFBox hitting an infinite loop which causes
>>>>>> excessive CPU usage - requires SOLR restart but happens only once
>>>>>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit
>>>>>> erratic since I was never able to track the problem back to a
>>>>>> particular
>>>>>> PDF document
>>>>>>
>>>>>> Having said that we wire SOLR with Nagios to get an alarm when CPU
>>>>>> consumption goes through the roof
>>>>>>
>>>>>> If you doing really serious stuff I would recommend
>>>>>> * moving the document extraction stuff out of SOLR
>>>>>> * provide monitoring and recovery and stuck document extractions
>>>>>> ** killing worker threads
>>>>>> ** using external processed and kill them when spinning out of control
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Siegfried Goeschl
>>>>>>
>>>>>> On 22.05.14 06:46, Jack Krupansky wrote:
>>>>>>
>>>>>>  Yeah, PDF extraction has always been at least somewhat problematic.
>>>>>>> It
>>>>>>> has improved over the years, but still not likely to be perfect.
>>>>>>>
>>>>>>> That said, I'm not aware of any specific PDF extraction issue that
>>>>>>> would
>>>>>>> bring down Solr - as opposed to causing a 500 status with an
>>>>>>> exception
>>>>>>> in PDF extraction, with the exception of memory usage. Some PDF
>>>>>>> documents, especially those which are graphic-intense can require a
>>>>>>> lot
>>>>>>> of memory. The rest of Solr could be adversely affected if all
>>>>>>> available
>>>>>>> JVM heap is consumed. The solution is to give the JVM more heap
>>>>>>> space.
>>>>>>>
>>>>>>> So, what is your specific symptom?
>>>>>>>
>>>>>>> -- Jack Krupansky
>>>>>>>
>>>>>>> -----Original Message----- From: Brian McDowell
>>>>>>> Sent: Thursday, May 22, 2014 12:24 AM
>>>>>>> To: solr-user@lucene.apache.org
>>>>>>> Subject: pdfs
>>>>>>>
>>>>>>> Has anyone had issues with indexing pdf files? Some pdfs are
>>>>>>> bringing down
>>>>>>> Solr completely so that it actually needs to be manually restarted.
>>>>>>> We are
>>>>>>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
>>>>>>> problem because the release notes associated with the new tika
>>>>>>> version and
>>>>>>> also the new pdfbox indicate fixes for pdf issues. It didn't work
>>>>>>> and now
>>>>>>> this issue is causing us to reevaluate using Solr. Any help on this
>>>>>>> matter
>>>>>>> would be greatly appreciated. Thank you!
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>

Reply via email to