Does this script also saves a memory dump of jvm? Ciao, Vincenzo
-- mobile: 3498513251 skype: free.dev > On 2 Aug 2018, at 17:53, Erick Erickson <erickerick...@gmail.com> wrote: > > Thomas: > > You've obviously done a lot of work to track this, but maybe you can > do even more ;). > > Here's a link to a program that uses Tika to parse docs _on the client_: > https://lucidworks.com/2012/02/14/indexing-with-solrj/ > > If you take out all the DB and Solr parts, you're left with something > that just parses docs with Tika. My idea here is to feed it your docs > and see if there are these noticeable memory differences between the > versions of Tika. A heap dump if there are would help the Tika folks > enormously in tracking this down. > > And if there's no memory creep, that points toward the glue code in Solr. > > I also have to add that this kind of thing is one of the reasons we > generally recommend that production systems do not use > ExtractingRequestHandler. There are other reasons outlined in the link > above.... > > Best, > Erick > > On Thu, Aug 2, 2018 at 4:30 AM, Thomas Scheffler > <thomas.scheff...@uni-jena.de> wrote: >> Hi, >> >> my final verdict is the upgrade to Tika 1.17. If I downgrade the libraries >> just for tika back to 1.16 and keep the rest of SOLR 7.4.0 the heap usage >> after about 85 % of the index process and manual trigger of the garbage >> collector is about 60-70 MB (That low!!!) >> >> My problem now is that we have several setups that triggers this reliably >> but there is no simple test case that „fails“ if Tika 1.17 or 1.18 is used. >> I also do not know if the error is inside Tika or inside the glue code that >> makes Tika usable in SOLR. >> >> Should I file an issue for this? >> >> kind regards, >> >> Thomas >> >> >>> Am 02.08.2018 um 12:06 schrieb Thomas Scheffler >>> <thomas.scheff...@uni-jena.de>: >>> >>> Hi, >>> >>> we noticed a memory leak in a rather small setup. 40.000 metadata documents >>> with nearly as much files that have „literal.*“ fields with it. While 7.2.1 >>> has brought some tika issues (due to a beta version) the real problems >>> started to appear with version 7.3.0 which are currently unresolved in >>> 7.4.0. Memory consumption is out-of-roof. Where previously 512MB heap was >>> enough, now 6G aren’t enough to index all files. >>> I am now to a point where I can track this down to the libraries in >>> solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries >>> shipped with 7.2.1 the problem disappears. As most files are PDF documents >>> I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the >>> problem. I will next try to downgrade these single libraries back to 2.0.6 >>> and 1.16 to see if these are the source of the memory leak. >>> >>> In the mean time I would like to know if anybody else experienced the same >>> problems? >>> >>> kind regards, >>> >>> Thomas >> >>