You're missing Charlie's point, and if you read the blog I pointed you to that point is reiterated.
DIH does the Tika processing on the Solr node that is _also_ indexing documents and satisfying queries. Parsing a semi-structured document (PDF in this case) consumes CPU cycles and memory, all _within_ the Solr process. You can easily create an OOM problem on the Solr node if someone drops, say, a 2G file in your directory structure and you blithely send it to Solr via DIH. Additionally there are so many variants of, say, the PDF "standard" that some edge case somewhere can (and has) caused Tika to blow it's brains out. The Tika folks have done a marvelous job of fixing these when they come up, but it's a never-ending battle. If you do the Tika processing in your own Java process you isolate your Solr's from these issues. Up to you of course. Erick On Wed, Mar 7, 2018 at 5:39 AM, lala <labisha...@gmail.com> wrote: > I dont' know what is the problem, when posting the message, the xml format > inside the is not correct, it should contain ["<"param > name="extractInlineImages" type="bool">true] AND ["<"param > name="sortByPosition" type="bool">true]... > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html