There is a low-level memory "leak" (really an unfortunate retention) in Lucene which can cause OOMs when using the Tika tools on large files like PDF. A patch will be in the trunk sometime soon.
http://markmail.org/thread/lhr7wodw4ctsekik https://issues.apache.org/jira/browse/LUCENE-2387 -- Lance Norskog goks...@gmail.com