On 8/2/2018 5:30 AM, Thomas Scheffler wrote:
my final verdict is the upgrade to Tika 1.17. If I downgrade the libraries just
for tika back to 1.16 and keep the rest of SOLR 7.4.0 the heap usage after
about 85 % of the index process and manual trigger of the garbage collector is
about 60-70 MB (That low!!!)
My problem now is that we have several setups that triggers this reliably but
there is no simple test case that „fails“ if Tika 1.17 or 1.18 is used. I also
do not know if the error is inside Tika or inside the glue code that makes Tika
usable in SOLR.
If downgrading Tika fixes the issue, then it doesn't seem (to me) very
likely that Solr's glue code for ERH has a problem. If it's not Solr's
code that has the problem, there will be nothing we can do about it
other than change the Tika library included with Solr.
Before filing an issue, you should discuss this with the Tika project on
their mailing list. They'll want to make sure that they can fix the
problem in a future version. It might not be an actual memory leak ...
it could just be that one of the documents you're trying to index is one
that Tika requires a huge amount of memory to handle. But it could be a
memory leak.
If you know which document is being worked on when it runs out of
memory, can you try not including that document in your indexing, to see
if it still has a problem?
Please note that it is strongly recommended that you do not use the
Extracting Request Handler in production. Tika is prone to many
problems, and those problems will generally affect Solr if Tika is being
run inside Solr. Because of this, it is recommended that you write a
separate program using Tika that handles extracting information from
documents and sending that data to Solr. If that program crashes, Solr
remains operational.
There is already an issue to upgrade Tika to the latest version in Solr,
but you've said that you tried 1.18 already with no change to the
problem. So whatever the problem is, it will need to be solved in 1.19
or later.
Thanks,
Shawn