Hi folks,

I'm writing a file-system-crawler that will index files.  The file system
is going to be very busy an I anticipate on average 10 new updates per
min.  My application checks for new or updated files once every 1 min.  I
use Tika to extract the raw-text off those files and send them over to Solr
for indexing.  My application will be running 24x7xN-days.  It will not
recycle unless if the OS is restarted.

Over at Tika mailing list, I was told the following:

"As a side note, if you are handling a bunch of files from the wild in a
production environment, I encourage separating Tika into a separate jvm vs
tying it into any post processing – consider tika-batch and writing
separate text files for each file processed (not so efficient, but
exceedingly robust).  If this is demo code or you know your document set
well enough, you should be good to go with keeping Tika and your
postprocessing steps in the same jvm."

My question is, how does Solr utilize Tika?  Does it run Tika in its own
JVM as an out-of-process application or does it link with Tika JARs
directly?  If it links in directly, are there known issues with Solr
integrated with Tika because of Tika issues?

Thanks

Steve

Reply via email to