Does this script also saves a memory dump of jvm?

Ciao,
Vincenzo

--
mobile: 3498513251
skype: free.dev

> On 2 Aug 2018, at 17:53, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> Thomas:
> 
> You've obviously done a lot of work to track this, but maybe you can
> do even more ;).
> 
> Here's a link to a program that uses Tika to parse docs _on the client_:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
> 
> If you take out all the DB and Solr parts, you're left with something
> that just parses docs with Tika. My idea here is to feed it your docs
> and see if there are these noticeable memory differences between the
> versions of Tika.  A heap dump if there are would help the Tika folks
> enormously in tracking this down.
> 
> And if there's no memory creep, that points toward the glue code in Solr.
> 
> I also have to add that this kind of thing is one of the reasons we
> generally recommend that production systems do not use
> ExtractingRequestHandler. There are other reasons outlined in the link
> above....
> 
> Best,
> Erick
> 
> On Thu, Aug 2, 2018 at 4:30 AM, Thomas Scheffler
> <thomas.scheff...@uni-jena.de> wrote:
>> Hi,
>> 
>> my final verdict is the upgrade to Tika 1.17. If I downgrade the libraries 
>> just for tika back to 1.16 and keep the rest of SOLR 7.4.0 the heap usage 
>> after about 85 % of the index process and manual trigger of the garbage 
>> collector is about 60-70 MB (That low!!!)
>> 
>> My problem now is that we have several setups that triggers this reliably 
>> but there is no simple test case that „fails“ if Tika 1.17 or 1.18 is used. 
>> I also do not know if the error is inside Tika or inside the glue code that 
>> makes Tika usable in SOLR.
>> 
>> Should I file an issue for this?
>> 
>> kind regards,
>> 
>> Thomas
>> 
>> 
>>> Am 02.08.2018 um 12:06 schrieb Thomas Scheffler 
>>> <thomas.scheff...@uni-jena.de>:
>>> 
>>> Hi,
>>> 
>>> we noticed a memory leak in a rather small setup. 40.000 metadata documents 
>>> with nearly as much files that have „literal.*“ fields with it. While 7.2.1 
>>> has brought some tika issues (due to a beta version) the real problems 
>>> started to appear with version 7.3.0 which are currently unresolved in 
>>> 7.4.0. Memory consumption is out-of-roof. Where previously 512MB heap was 
>>> enough, now 6G aren’t enough to index all files.
>>> I am now to a point where I can track this down to the libraries in 
>>> solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries 
>>> shipped with 7.2.1 the problem disappears. As most files are PDF documents 
>>> I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the 
>>> problem. I will next try to downgrade these single libraries back to 2.0.6 
>>> and 1.16 to see if these are the source of the memory leak.
>>> 
>>> In the mean time I would like to know if anybody else experienced the same 
>>> problems?
>>> 
>>> kind regards,
>>> 
>>> Thomas
>> 
>> 

Reply via email to