On 8/2/2018 5:30 AM, Thomas Scheffler wrote:
my final verdict is the upgrade to Tika 1.17. If I downgrade the libraries just 
for tika back to 1.16 and keep the rest of SOLR 7.4.0 the heap usage after 
about 85 % of the index process and manual trigger of the garbage collector is 
about 60-70 MB (That low!!!)

My problem now is that we have several setups that triggers this reliably but 
there is no simple test case that „fails“ if Tika 1.17 or 1.18 is used. I also 
do not know if the error is inside Tika or inside the glue code that makes Tika 
usable in SOLR.

If downgrading Tika fixes the issue, then it doesn't seem (to me) very likely that Solr's glue code for ERH has a problem. If it's not Solr's code that has the problem, there will be nothing we can do about it other than change the Tika library included with Solr.

Before filing an issue, you should discuss this with the Tika project on their mailing list.  They'll want to make sure that they can fix the problem in a future version.  It might not be an actual memory leak ... it could just be that one of the documents you're trying to index is one that Tika requires a huge amount of memory to handle.  But it could be a memory leak.

If you know which document is being worked on when it runs out of memory, can you try not including that document in your indexing, to see if it still has a problem?

Please note that it is strongly recommended that you do not use the Extracting Request Handler in production.  Tika is prone to many problems, and those problems will generally affect Solr if Tika is being run inside Solr.  Because of this, it is recommended that you write a separate program using Tika that handles extracting information from documents and sending that data to Solr.  If that program crashes, Solr remains operational.

There is already an issue to upgrade Tika to the latest version in Solr, but you've said that you tried 1.18 already with no change to the problem.  So whatever the problem is, it will need to be solved in 1.19 or later.

Thanks,
Shawn

Reply via email to