;>> every time. Does anyone think that doing pre-validation using the
>>>>> pdfbox
>>>>> jar will work? Or, will trying to validate just hang as well? Any help
>>>>> is
>>>>> appreciated.
>>>>>
>>>>>
>>>
: Thursday, May 22, 2014 4:35 AM
To: solr-user@lucene.apache.org
Subject: Re: pdfs
Hi folks,
for a small customer project I'm running SOLR with embedded Tikka.
* memory consumption is an issue but can be handled
* there is an issue with PDFBox hitting an infinite loop which causes
excessi
loop issues with PDFBox in Solr years
>>>> ago. They keep fixing these issues, but they keep popping up again. Sigh.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> -Original Message- From: Siegfried Goeschl
>>>> Sent: Thursday,
eah, I recall running into infinite loop issues with PDFBox in Solr years
>>> ago. They keep fixing these issues, but they keep popping up again. Sigh.
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: Siegfried Goeschl
>>
Krupansky
> wrote:
>
>> Yeah, I recall running into infinite loop issues with PDFBox in Solr years
>> ago. They keep fixing these issues, but they keep popping up again. Sigh.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Siegfried Goeschl
AM
> To: solr-user@lucene.apache.org
> Subject: Re: pdfs
>
>
> Hi folks,
>
> for a small customer project I'm running SOLR with embedded Tikka.
>
> * memory consumption is an issue but can be handled
> * there is an issue with PDFBox hitting an infinite loop which cau
Subject: Re: pdfs
Hi folks,
for a small customer project I'm running SOLR with embedded Tikka.
* memory consumption is an issue but can be handled
* there is an issue with PDFBox hitting an infinite loop which causes
excessive CPU usage - requires SOLR restart but happens only once
withing 40
Hi folks,
for a small customer project I'm running SOLR with embedded Tikka.
* memory consumption is an issue but can be handled
* there is an issue with PDFBox hitting an infinite loop which causes
excessive CPU usage - requires SOLR restart but happens only once
withing 400.000 documents (PD
Yeah, PDF extraction has always been at least somewhat problematic. It has
improved over the years, but still not likely to be perfect.
That said, I'm not aware of any specific PDF extraction issue that would
bring down Solr - as opposed to causing a 500 status with an exception in
PDF extract
Run Tika in a client instead? Or as a standalone server listening over
TCP socket). Ship only extractions to Solr. This is more efficient as
well.
I suspect, there would always be PDFs that cause strange behaviour,
even if just based on memory requirements (e.g. embedded images). If
that becomes a
10 matches
Mail list logo