Tim, In my case, I have to use Tika as follows:
java -jar tika-app.jar -t <input_file> I will be invoking the above command from my Java app using Runtime.getRuntime().exec(). I will capture stdout and stderr to get back the raw text i need. My app use case will not allow me to use a <input_dir> <output_dir>, it is out of the question. Reading your summary, it looks like I won't get this watch-dog monitoring and thus I have to implement my own. Can you confirm? Thanks Steve On Thu, Feb 11, 2016 at 2:45 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > x-post to Tika user's > > Y and n. If you run tika app as: > > java -jar tika-app.jar <input_dir> <output_dir> > > It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302). This > creates a parent and child process, if the child process notices a hung > thread, it dies, and the parent restarts it. Or if your OS gets upset with > the child process and kills it out of self preservation, the parent > restarts the child, or if there's an OOM...and you can configure how often > the child shuts itself down (with parental restarting) to mitigate memory > leaks. > > So, y, if your use case allows <input_dir> <output_dir>, then we now have > that in Tika. > > I've been wanting to add a similar watchdog to tika-server ... any > interest in that? > > > -----Original Message----- > From: xavi jmlucjav [mailto:jmluc...@gmail.com] > Sent: Thursday, February 11, 2016 2:16 PM > To: solr-user <solr-user@lucene.apache.org> > Subject: Re: How is Tika used with Solr > > I have found that when you deal with large amounts of all sort of files, > in the end you find stuff (pdfs are typically nasty) that will hang tika. > That is even worse that a crash or OOM. > We used aperture instead of tika because at the time it provided a > watchdog feature to kill what seemed like a hanged extracting thread. That > feature is super important for a robust text extracting pipeline. Has Tika > gained such feature already? > > xavier > > On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > > > Timothy's points are absolutely spot-on. In production scenarios, if > > you use the simple "run Tika in a SolrJ program" approach you _must_ > > abort the program on OOM errors and the like and figure out what's > > going on with the offending document(s). Or record the name somewhere > > and skip it next time 'round. Or........ > > > > How much you have to build in here really depends on your use case. > > For "small enough" > > sets of documents or one-time indexing, you can get by with dealing > > with errors one at a time. > > For robust systems where you have to have indexing available at all > > times and _especially_ where you don't control the document corpus, > > you have to build something far more tolerant as per Tim's comments. > > > > FWIW, > > Erick > > > > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. > > <talli...@mitre.org> > > wrote: > > > I completely agree on the impulse, and for the vast majority of the > > > time > > (regular catchable exceptions), that'll work. And, by vast majority, > > aside from oom on very large files, we aren't seeing these problems > > any more in our 3 million doc corpus (y, I know, small by today's > > standards) from > > govdocs1 and Common Crawl over on our Rackspace vm. > > > > > > Given my focus on Tika, I'm overly sensitive to the worst case > > scenarios. I find it encouraging, Erick, that you haven't seen these > > types of problems, that users aren't complaining too often about > > catastrophic failures of Tika within Solr Cell, and that this thread > > is not yet swamped with integrators agreeing with me. :) > > > > > > However, because oom can leave memory in a corrupted state (right?), > > because you can't actually kill a thread for a permanent hang and > > because Tika is a kitchen sink and we can't prevent memory leaks in > > our dependencies, one needs to be aware that bad things can > > happen...if only very, very rarely. For a fellow traveler who has run > > into these issues on massive data sets, see also [0]. > > > > > > Configuring Hadoop to work around these types of problems is not too > > difficult -- it has to be done with some thought, though. On > > conventional single box setups, the ForkParser within Tika is one > > option, tika-batch is another. Hand rolling your own parent/child > > process is non-trivial and is not necessary for the vast majority of use > cases. > > > > > > > > > [0] > > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w > > eb-content-nanite/ > > > > > > > > > > > > -----Original Message----- > > > From: Erick Erickson [mailto:erickerick...@gmail.com] > > > Sent: Tuesday, February 09, 2016 10:05 PM > > > To: solr-user <solr-user@lucene.apache.org> > > > Subject: Re: How is Tika used with Solr > > > > > > My impulse would be to _not_ run Tika in its own JVM, just catch any > > exceptions in my code and "do the right thing". I'm not sure I see any > > real benefit in yet another JVM. > > > > > > FWIW, > > > Erick > > > > > > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. > > > <talli...@mitre.org> > > wrote: > > >> I have one answer here [0], but I'd be interested to hear what Solr > > users/devs/integrators have experienced on this topic. > > >> > > >> [0] > > >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CC > > >> Y1P > > >> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.ou > > >> tlo > > >> ok.com%3E > > >> > > >> -----Original Message----- > > >> From: Steven White [mailto:swhite4...@gmail.com] > > >> Sent: Tuesday, February 09, 2016 6:33 PM > > >> To: solr-user@lucene.apache.org > > >> Subject: Re: How is Tika used with Solr > > >> > > >> Thank you Erick and Alex. > > >> > > >> My main question is with a long running process using Tika in the > > >> same > > JVM as my application. I'm running my file-system-crawler in its own > > JVM (not Solr's). On Tika mailing list, it is suggested to run Tika's > > code in it's own JVM and invoke it from my file-system-crawler using > > Runtime.getRuntime().exec(). > > >> > > >> I fully understand from Alex suggestion and link provided by Erick > > >> to > > use Tika outside Solr. But what about using Tika within the same JVM > > as my file-system-crawler application or should I be making a system > > call to invoke another JAR, that runs in its own JVM to extract the > > raw text? Are there known issues with Tika when used in a long running > process? > > >> > > >> Steve > > >> > > >> > > >