Thank you Erick and Alex. My main question is with a long running process using Tika in the same JVM as my application. I'm running my file-system-crawler in its own JVM (not Solr's). On Tika mailing list, it is suggested to run Tika's code in it's own JVM and invoke it from my file-system-crawler using Runtime.getRuntime().exec().
I fully understand from Alex suggestion and link provided by Erick to use Tika outside Solr. But what about using Tika within the same JVM as my file-system-crawler application or should I be making a system call to invoke another JAR, that runs in its own JVM to extract the raw text? Are there known issues with Tika when used in a long running process? Steve On Tue, Feb 9, 2016 at 5:53 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Here's a writeup that should help.... > > https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ > > On Tue, Feb 9, 2016 at 2:49 PM, Alexandre Rafalovitch > <arafa...@gmail.com> wrote: > > Solr uses Tika directly. And not in the most efficient way. It is > > there mostly for convenience rather than performance. > > > > So, for performance, Solr recommendation is also to run Tika > > separately and only send Solr the processed documents. > > > > Regards, > > Alex. > > ---- > > Newsletter and resources for Solr beginners and intermediates: > > http://www.solr-start.com/ > > > > > > On 10 February 2016 at 09:46, Steven White <swhite4...@gmail.com> wrote: > >> Hi folks, > >> > >> I'm writing a file-system-crawler that will index files. The file > system > >> is going to be very busy an I anticipate on average 10 new updates per > >> min. My application checks for new or updated files once every 1 min. > I > >> use Tika to extract the raw-text off those files and send them over to > Solr > >> for indexing. My application will be running 24x7xN-days. It will not > >> recycle unless if the OS is restarted. > >> > >> Over at Tika mailing list, I was told the following: > >> > >> "As a side note, if you are handling a bunch of files from the wild in a > >> production environment, I encourage separating Tika into a separate jvm > vs > >> tying it into any post processing – consider tika-batch and writing > >> separate text files for each file processed (not so efficient, but > >> exceedingly robust). If this is demo code or you know your document set > >> well enough, you should be good to go with keeping Tika and your > >> postprocessing steps in the same jvm." > >> > >> My question is, how does Solr utilize Tika? Does it run Tika in its own > >> JVM as an out-of-process application or does it link with Tika JARs > >> directly? If it links in directly, are there known issues with Solr > >> integrated with Tika because of Tika issues? > >> > >> Thanks > >> > >> Steve >