Y, and you can't actually kill a thread. You can ask nicely via Thread.interrupt(), but some of our dependencies don't bother to listen for that. So, you're pretty much left with a separate process as the only robust solution.
So, we did the parent-child process thing for directory-> directory processing in tika-app via tika-batch. The next step is to harden tika-server and to kick that off in a child process in a similar way. For those who want to test their Tika harnesses (whether on single box, Hadoop/Spark etc), we added a MockParser that will do whatever you tell it when it hits an "application/xml+mock" file...full set of options: <mock> <!-- action can be "add" or "set" --> <metadata action="add" name="author">Nikolai Lobachevsky</metadata> <!-- element is the name of the sax event to write, p=paragraph if the element is not specified, the default is <p> --> <write element="p">some content</write> <!-- write something to System.out --> <print_out>writing to System.out</print_out> <!-- write something to System.err --> <print_err>writing to System.err</print_err> <!-- hang millis: how many milliseconds to pause. The actual hang time will probably be a bit longer than the value specified. heavy: whether or not the hang should do something computationally expensive. If the value is false, this just does a Thread.sleep(millis). This attribute is optional, with default of heavy=false. pulse_millis: (required if "heavy" is true), how often to check to see whether the thread was interrupted or that the total hang time exceeded the millis interruptible: whether or not the parser will check to see if its thread has been interrupted; this attribute is optional with default of true --> <hang millis="100" heavy="true" pulse_millis="10" interruptible="true" /> <!-- throw an exception or error; optionally include a message or not --> <throw class="java.io.IOException">not another IOException</throw> <!-- perform a genuine OutOfMemoryError --> <oom/> </mock> -----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, February 11, 2016 7:46 PM To: solr-user <solr-user@lucene.apache.org> Subject: Re: How is Tika used with Solr Well, I'd imagine you could spawn threads and monitor/kill them as necessary, although that doesn't deal with OOM errors.... FWIW, Erick On Thu, Feb 11, 2016 at 3:08 PM, xavi jmlucjav <jmluc...@gmail.com> wrote: > For sure, if I need heavy duty text extraction again, Tika would be > the obvious choice if it covers dealing with hangs. I never used > tika-server myself (not sure if it existed at the time) just used tika from > my own jvm. > > On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. > <talli...@mitre.org> > wrote: > >> x-post to Tika user's >> >> Y and n. If you run tika app as: >> >> java -jar tika-app.jar <input_dir> <output_dir> >> >> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302). >> This creates a parent and child process, if the child process notices >> a hung thread, it dies, and the parent restarts it. Or if your OS >> gets upset with the child process and kills it out of self >> preservation, the parent restarts the child, or if there's an >> OOM...and you can configure how often the child shuts itself down >> (with parental restarting) to mitigate memory leaks. >> >> So, y, if your use case allows <input_dir> <output_dir>, then we now >> have that in Tika. >> >> I've been wanting to add a similar watchdog to tika-server ... any >> interest in that? >> >> >> -----Original Message----- >> From: xavi jmlucjav [mailto:jmluc...@gmail.com] >> Sent: Thursday, February 11, 2016 2:16 PM >> To: solr-user <solr-user@lucene.apache.org> >> Subject: Re: How is Tika used with Solr >> >> I have found that when you deal with large amounts of all sort of >> files, in the end you find stuff (pdfs are typically nasty) that will hang >> tika. >> That is even worse that a crash or OOM. >> We used aperture instead of tika because at the time it provided a >> watchdog feature to kill what seemed like a hanged extracting thread. >> That feature is super important for a robust text extracting >> pipeline. Has Tika gained such feature already? >> >> xavier >> >> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson >> <erickerick...@gmail.com> >> wrote: >> >> > Timothy's points are absolutely spot-on. In production scenarios, >> > if you use the simple "run Tika in a SolrJ program" approach you >> > _must_ abort the program on OOM errors and the like and figure out >> > what's going on with the offending document(s). Or record the name >> > somewhere and skip it next time 'round. Or........ >> > >> > How much you have to build in here really depends on your use case. >> > For "small enough" >> > sets of documents or one-time indexing, you can get by with dealing >> > with errors one at a time. >> > For robust systems where you have to have indexing available at all >> > times and _especially_ where you don't control the document corpus, >> > you have to build something far more tolerant as per Tim's comments. >> > >> > FWIW, >> > Erick >> > >> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. >> > <talli...@mitre.org> >> > wrote: >> > > I completely agree on the impulse, and for the vast majority of >> > > the time >> > (regular catchable exceptions), that'll work. And, by vast >> > majority, aside from oom on very large files, we aren't seeing >> > these problems any more in our 3 million doc corpus (y, I know, >> > small by today's >> > standards) from >> > govdocs1 and Common Crawl over on our Rackspace vm. >> > > >> > > Given my focus on Tika, I'm overly sensitive to the worst case >> > scenarios. I find it encouraging, Erick, that you haven't seen >> > these types of problems, that users aren't complaining too often >> > about catastrophic failures of Tika within Solr Cell, and that this >> > thread is not yet swamped with integrators agreeing with me. :) >> > > >> > > However, because oom can leave memory in a corrupted state >> > > (right?), >> > because you can't actually kill a thread for a permanent hang and >> > because Tika is a kitchen sink and we can't prevent memory leaks in >> > our dependencies, one needs to be aware that bad things can >> > happen...if only very, very rarely. For a fellow traveler who has >> > run into these issues on massive data sets, see also [0]. >> > > >> > > Configuring Hadoop to work around these types of problems is not >> > > too >> > difficult -- it has to be done with some thought, though. On >> > conventional single box setups, the ForkParser within Tika is one >> > option, tika-batch is another. Hand rolling your own parent/child >> > process is non-trivial and is not necessary for the vast majority >> > of use >> cases. >> > > >> > > >> > > [0] >> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterisin >> > g-w >> > eb-content-nanite/ >> > > >> > > >> > > >> > > -----Original Message----- >> > > From: Erick Erickson [mailto:erickerick...@gmail.com] >> > > Sent: Tuesday, February 09, 2016 10:05 PM >> > > To: solr-user <solr-user@lucene.apache.org> >> > > Subject: Re: How is Tika used with Solr >> > > >> > > My impulse would be to _not_ run Tika in its own JVM, just catch >> > > any >> > exceptions in my code and "do the right thing". I'm not sure I see >> > any real benefit in yet another JVM. >> > > >> > > FWIW, >> > > Erick >> > > >> > > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. >> > > <talli...@mitre.org> >> > wrote: >> > >> I have one answer here [0], but I'd be interested to hear what >> > >> Solr >> > users/devs/integrators have experienced on this topic. >> > >> >> > >> [0] >> > >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/% >> > >> 3CC >> > >> Y1P >> > >> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod >> > >> .ou >> > >> tlo >> > >> ok.com%3E >> > >> >> > >> -----Original Message----- >> > >> From: Steven White [mailto:swhite4...@gmail.com] >> > >> Sent: Tuesday, February 09, 2016 6:33 PM >> > >> To: solr-user@lucene.apache.org >> > >> Subject: Re: How is Tika used with Solr >> > >> >> > >> Thank you Erick and Alex. >> > >> >> > >> My main question is with a long running process using Tika in >> > >> the same >> > JVM as my application. I'm running my file-system-crawler in its >> > own JVM (not Solr's). On Tika mailing list, it is suggested to run >> > Tika's code in it's own JVM and invoke it from my >> > file-system-crawler using Runtime.getRuntime().exec(). >> > >> >> > >> I fully understand from Alex suggestion and link provided by >> > >> Erick to >> > use Tika outside Solr. But what about using Tika within the same >> > JVM as my file-system-crawler application or should I be making a >> > system call to invoke another JAR, that runs in its own JVM to >> > extract the raw text? Are there known issues with Tika when used >> > in a long running >> process? >> > >> >> > >> Steve >> > >> >> > >> >> > >>