Re: Solr dih extract text from inline images in pdf

Erick Erickson Wed, 07 Mar 2018 07:41:20 -0800

You're missing Charlie's point, and if you read the blog I pointed you
to that point is reiterated.

DIH does the Tika processing on the Solr node that is _also_ indexing
documents and satisfying queries. Parsing a semi-structured document
(PDF in this case) consumes CPU cycles and memory, all _within_ the
Solr process. You can easily create an OOM problem on the Solr node if
someone drops, say, a 2G file in your directory structure and you
blithely send it to Solr via DIH.

Additionally there are so many variants of, say, the PDF "standard"
that some edge case somewhere can (and has) caused Tika to blow it's
brains out. The Tika folks have done a marvelous job of fixing these
when they come up, but it's a never-ending battle.

If you do the Tika processing in your own Java process you isolate
your Solr's from these issues.

Up to you of course.
Erick

On Wed, Mar 7, 2018 at 5:39 AM, lala <labisha...@gmail.com> wrote:
> I dont' know what is the problem, when posting the message, the xml format
> inside the   is not correct, it should contain ["<"param
> name="extractInlineImages" type="bool">true] AND ["<"param
> name="sortByPosition" type="bool">true]...
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr dih extract text from inline images in pdf

Reply via email to