Re: pdfs

Siegfried Goeschl Sun, 25 May 2014 02:08:27 -0700

Sorry typo :- can you send me the PDF by email directly :-)

Siegfried Goeschl


On 25 May 2014, at 10:06, Siegfried Goeschl <sgoes...@gmx.at> wrote:

> Hi Brian,
> 
> can you send me the email? I would like to play around :-)
> 
> Have you opened a JIRA for PdfBox? If not I willl open one if I can reproduce 
> the issue … 
> 
> Thanks in advance
> 
> Siegfried Goeschl
> 
> 
> On 25 May 2014, at 04:18, Brian McDowell <brianmc...@gmail.com> wrote:
> 
>> Our feeding (indexing) tool halts because Solr becomes unresponsive after
>> getting some really bad pdfs. There are levels of pdf "badness." Some just
>> will not parse and that's fine, but others are more problematic in that our
>> Operations team has to restart Solr because it just hangs and accepts no
>> more documents. I actually have identified a pdf that will bring down Solr
>> every time. Does anyone think that doing pre-validation using the pdfbox
>> jar will work? Or, will trying to validate just hang as well? Any help is
>> appreciated.
>> 
>> 
>> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky 
>> <j...@basetechnology.com>wrote:
>> 
>>> Yeah, I recall running into infinite loop issues with PDFBox in Solr years
>>> ago. They keep fixing these issues, but they keep popping up again. Sigh.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -----Original Message----- From: Siegfried Goeschl
>>> Sent: Thursday, May 22, 2014 4:35 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: pdfs
>>> 
>>> 
>>> Hi folks,
>>> 
>>> for a small customer project I'm running SOLR with embedded Tikka.
>>> 
>>> * memory consumption is an issue but can be handled
>>> * there is an issue with PDFBox hitting an infinite loop which causes
>>> excessive CPU usage - requires SOLR restart but happens only once
>>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit
>>> erratic since I was never able to track the problem back to a particular
>>> PDF document
>>> 
>>> Having said that we wire SOLR with Nagios to get an alarm when CPU
>>> consumption goes through the roof
>>> 
>>> If you doing really serious stuff I would recommend
>>> * moving the document extraction stuff out of SOLR
>>> * provide monitoring and recovery and stuck document extractions
>>> ** killing worker threads
>>> ** using external processed and kill them when spinning out of control
>>> 
>>> Cheers,
>>> 
>>> Siegfried Goeschl
>>> 
>>> On 22.05.14 06:46, Jack Krupansky wrote:
>>> 
>>>> Yeah, PDF extraction has always been at least somewhat problematic. It
>>>> has improved over the years, but still not likely to be perfect.
>>>> 
>>>> That said, I'm not aware of any specific PDF extraction issue that would
>>>> bring down Solr - as opposed to causing a 500 status with an exception
>>>> in PDF extraction, with the exception of memory usage. Some PDF
>>>> documents, especially those which are graphic-intense can require a lot
>>>> of memory. The rest of Solr could be adversely affected if all available
>>>> JVM heap is consumed. The solution is to give the JVM more heap space.
>>>> 
>>>> So, what is your specific symptom?
>>>> 
>>>> -- Jack Krupansky
>>>> 
>>>> -----Original Message----- From: Brian McDowell
>>>> Sent: Thursday, May 22, 2014 12:24 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: pdfs
>>>> 
>>>> Has anyone had issues with indexing pdf files? Some pdfs are bringing down
>>>> Solr completely so that it actually needs to be manually restarted. We are
>>>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
>>>> problem because the release notes associated with the new tika version and
>>>> also the new pdfbox indicate fixes for pdf issues. It didn't work and now
>>>> this issue is causing us to reevaluate using Solr. Any help on this matter
>>>> would be greatly appreciated. Thank you!
>>>> 
>>> 
>>> 
>

Re: pdfs

Reply via email to