Re: iText hitting infinite loop - Was Re: pdfs

2014-06-02 Thread Erick Erickson
;>> every time. Does anyone think that doing pre-validation using the >>>>> pdfbox >>>>> jar will work? Or, will trying to validate just hang as well? Any help >>>>> is >>>>> appreciated. >>>>> >>>>> >>>

iText hitting infinite loop - Was Re: pdfs

2014-06-02 Thread Siegfried Goeschl
: Thursday, May 22, 2014 4:35 AM To: solr-user@lucene.apache.org Subject: Re: pdfs Hi folks, for a small customer project I'm running SOLR with embedded Tikka. * memory consumption is an issue but can be handled * there is an issue with PDFBox hitting an infinite loop which causes excessi

Re: pdfs

2014-05-26 Thread Erick Erickson
loop issues with PDFBox in Solr years >>>> ago. They keep fixing these issues, but they keep popping up again. Sigh. >>>> >>>> -- Jack Krupansky >>>> >>>> -Original Message- From: Siegfried Goeschl >>>> Sent: Thursday,

Re: pdfs

2014-05-25 Thread Siegfried Goeschl
eah, I recall running into infinite loop issues with PDFBox in Solr years >>> ago. They keep fixing these issues, but they keep popping up again. Sigh. >>> >>> -- Jack Krupansky >>> >>> -Original Message- From: Siegfried Goeschl >>

Re: pdfs

2014-05-25 Thread Siegfried Goeschl
Krupansky > wrote: > >> Yeah, I recall running into infinite loop issues with PDFBox in Solr years >> ago. They keep fixing these issues, but they keep popping up again. Sigh. >> >> -- Jack Krupansky >> >> -Original Message- From: Siegfried Goeschl

Re: pdfs

2014-05-24 Thread Brian McDowell
AM > To: solr-user@lucene.apache.org > Subject: Re: pdfs > > > Hi folks, > > for a small customer project I'm running SOLR with embedded Tikka. > > * memory consumption is an issue but can be handled > * there is an issue with PDFBox hitting an infinite loop which cau

Re: pdfs

2014-05-22 Thread Jack Krupansky
Subject: Re: pdfs Hi folks, for a small customer project I'm running SOLR with embedded Tikka. * memory consumption is an issue but can be handled * there is an issue with PDFBox hitting an infinite loop which causes excessive CPU usage - requires SOLR restart but happens only once withing 40

Re: pdfs

2014-05-22 Thread Siegfried Goeschl
Hi folks, for a small customer project I'm running SOLR with embedded Tikka. * memory consumption is an issue but can be handled * there is an issue with PDFBox hitting an infinite loop which causes excessive CPU usage - requires SOLR restart but happens only once withing 400.000 documents (PD

Re: pdfs

2014-05-21 Thread Jack Krupansky
Yeah, PDF extraction has always been at least somewhat problematic. It has improved over the years, but still not likely to be perfect. That said, I'm not aware of any specific PDF extraction issue that would bring down Solr - as opposed to causing a 500 status with an exception in PDF extract

Re: pdfs

2014-05-21 Thread Alexandre Rafalovitch
Run Tika in a client instead? Or as a standalone server listening over TCP socket). Ship only extractions to Solr. This is more efficient as well. I suspect, there would always be PDFs that cause strange behaviour, even if just based on memory requirements (e.g. embedded images). If that becomes a