Hi folks, I'm writing a file-system-crawler that will index files. The file system is going to be very busy an I anticipate on average 10 new updates per min. My application checks for new or updated files once every 1 min. I use Tika to extract the raw-text off those files and send them over to Solr for indexing. My application will be running 24x7xN-days. It will not recycle unless if the OS is restarted.
Over at Tika mailing list, I was told the following: "As a side note, if you are handling a bunch of files from the wild in a production environment, I encourage separating Tika into a separate jvm vs tying it into any post processing – consider tika-batch and writing separate text files for each file processed (not so efficient, but exceedingly robust). If this is demo code or you know your document set well enough, you should be good to go with keeping Tika and your postprocessing steps in the same jvm." My question is, how does Solr utilize Tika? Does it run Tika in its own JVM as an out-of-process application or does it link with Tika JARs directly? If it links in directly, are there known issues with Solr integrated with Tika because of Tika issues? Thanks Steve