Hi folks,

Brian was so kind and sent me the troublesome PDF document

I gave it a try with PDFBox directly in order to extract the text (PDFBox is used by Tikka to extract the textual content of a PDF document)

* hitting an infinite loop with PDFBox 1.8.3
* no problems with PDFBox 1.8.4 & 1.8.5
* PDFBox 1.8.4 is part of Apache Tika 1.5 (see http://www.apache.org/dist/tika/CHANGES-1.5.txt) * Apache SOLR 4.8 uses Tika 1.5 (see https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika)

In short the problem with this particular PDF is solved by

* Apache PDFBox 1.8.4 onwards
* Apache Tika 1.5
* Apache SOLR 4.8

Cheers,

Siegfried Goeschl



On 26.05.14 18:20, Erick Erickson wrote:
Brian:

Yeah, if you can share the PDF that would be great. Parsing via Tika should
not bring down Solr, although I supposed there could be something in Tika
that is pathologically bad.

You could also try using Tika itself in SolrJ and indexing from a client. That
might let you
1> more gracefully handle this without shutting down Solr
2> use different versions of Tika.

Personally I like offloading the document parsing to clients anyway since it
lessens the load on the Solr server and scales much better, but YMMV.

It's not actually very difficult, here's a skeleton (rip out the DB parts)
http://searchhub.org/2012/02/14/indexing-with-solrj/

Best,
Erick

On Sun, May 25, 2014 at 2:07 AM, Siegfried Goeschl <sgoes...@gmx.at> wrote:
Sorry typo :- can you send me the PDF by email directly :-)

Siegfried Goeschl

On 25 May 2014, at 10:06, Siegfried Goeschl <sgoes...@gmx.at> wrote:

Hi Brian,

can you send me the email? I would like to play around :-)

Have you opened a JIRA for PdfBox? If not I willl open one if I can reproduce 
the issue …

Thanks in advance

Siegfried Goeschl


On 25 May 2014, at 04:18, Brian McDowell <brianmc...@gmail.com> wrote:

Our feeding (indexing) tool halts because Solr becomes unresponsive after
getting some really bad pdfs. There are levels of pdf "badness." Some just
will not parse and that's fine, but others are more problematic in that our
Operations team has to restart Solr because it just hangs and accepts no
more documents. I actually have identified a pdf that will bring down Solr
every time. Does anyone think that doing pre-validation using the pdfbox
jar will work? Or, will trying to validate just hang as well? Any help is
appreciated.


On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky <j...@basetechnology.com>wrote:

Yeah, I recall running into infinite loop issues with PDFBox in Solr years
ago. They keep fixing these issues, but they keep popping up again. Sigh.

-- Jack Krupansky

-----Original Message----- From: Siegfried Goeschl
Sent: Thursday, May 22, 2014 4:35 AM
To: solr-user@lucene.apache.org
Subject: Re: pdfs


Hi folks,

for a small customer project I'm running SOLR with embedded Tikka.

* memory consumption is an issue but can be handled
* there is an issue with PDFBox hitting an infinite loop which causes
excessive CPU usage - requires SOLR restart but happens only once
withing 400.000 documents (PDF, Word, ect) but is seems a little bit
erratic since I was never able to track the problem back to a particular
PDF document

Having said that we wire SOLR with Nagios to get an alarm when CPU
consumption goes through the roof

If you doing really serious stuff I would recommend
* moving the document extraction stuff out of SOLR
* provide monitoring and recovery and stuck document extractions
** killing worker threads
** using external processed and kill them when spinning out of control

Cheers,

Siegfried Goeschl

On 22.05.14 06:46, Jack Krupansky wrote:

Yeah, PDF extraction has always been at least somewhat problematic. It
has improved over the years, but still not likely to be perfect.

That said, I'm not aware of any specific PDF extraction issue that would
bring down Solr - as opposed to causing a 500 status with an exception
in PDF extraction, with the exception of memory usage. Some PDF
documents, especially those which are graphic-intense can require a lot
of memory. The rest of Solr could be adversely affected if all available
JVM heap is consumed. The solution is to give the JVM more heap space.

So, what is your specific symptom?

-- Jack Krupansky

-----Original Message----- From: Brian McDowell
Sent: Thursday, May 22, 2014 12:24 AM
To: solr-user@lucene.apache.org
Subject: pdfs

Has anyone had issues with indexing pdf files? Some pdfs are bringing down
Solr completely so that it actually needs to be manually restarted. We are
using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
problem because the release notes associated with the new tika version and
also the new pdfbox indicate fixes for pdf issues. It didn't work and now
this issue is causing us to reevaluate using Solr. Any help on this matter
would be greatly appreciated. Thank you!






Reply via email to