I completely agree on the impulse, and for the vast majority of the time 
(regular catchable exceptions), that'll work.  And, by vast majority, aside 
from oom on very large files, we aren't seeing these problems any more in our 3 
million doc corpus (y, I know, small by today's standards) from govdocs1 and 
Common Crawl over on our Rackspace vm. 

Given my focus on Tika, I'm overly sensitive to the worst case scenarios.  I 
find it encouraging, Erick, that you haven't seen these types of problems, that 
users aren't complaining too often about catastrophic failures of Tika within 
Solr Cell, and that this thread is not yet swamped with integrators agreeing 
with me. :)

However, because oom can leave memory in a corrupted state (right?), because 
you can't actually kill a thread for a permanent hang and because Tika is a 
kitchen sink and we can't prevent memory leaks in our dependencies, one needs 
to be aware that bad things can happen...if only very, very rarely.  For a 
fellow traveler who has run into these issues on massive data sets, see also 
[0].

Configuring Hadoop to work around these types of problems is not too difficult 
-- it has to be done with some thought, though.  On conventional single box 
setups, the ForkParser within Tika is one option, tika-batch is another.  Hand 
rolling your own parent/child process is non-trivial and is not necessary for 
the vast majority of use cases.


[0] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
 



-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, February 09, 2016 10:05 PM
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: How is Tika used with Solr

My impulse would be to _not_ run Tika in its own JVM, just catch any exceptions 
in my code and "do the right thing". I'm not sure I see any real benefit in yet 
another JVM.

FWIW,
Erick

On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. <talli...@mitre.org> wrote:
> I have one answer here [0], but I'd be interested to hear what Solr 
> users/devs/integrators have experienced on this topic.
>
> [0] 
> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CCY1P
> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.outlo
> ok.com%3E
>
> -----Original Message-----
> From: Steven White [mailto:swhite4...@gmail.com]
> Sent: Tuesday, February 09, 2016 6:33 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How is Tika used with Solr
>
> Thank you Erick and Alex.
>
> My main question is with a long running process using Tika in the same JVM as 
> my application.  I'm running my file-system-crawler in its own JVM (not 
> Solr's).  On Tika mailing list, it is suggested to run Tika's code in it's 
> own JVM and invoke it from my file-system-crawler using 
> Runtime.getRuntime().exec().
>
> I fully understand from Alex suggestion and link provided by Erick to use 
> Tika outside Solr.  But what about using Tika within the same JVM as my 
> file-system-crawler application or should I be making a system call to invoke 
> another JAR, that runs in its own JVM to extract the raw text?  Are there 
> known issues with Tika when used in a long running process?
>
> Steve
>
>

Reply via email to