On 09/02/2016 22:49, Alexandre Rafalovitch wrote:
Solr uses Tika directly. And not in the most efficient way. It is
there mostly for convenience rather than performance.
So, for performance, Solr recommendation is also to run Tika
separately and only send Solr the processed documents.
Absolutely. It's entirely possible to kill Tika with a bad PDF or
something, bringing down your Solr instance.
Here's something a colleague wrote to wrap Tika in a server, maybe you
can use it:
https://github.com/mattflax/dropwizard-tika-server
Cheers
Charlie
Regards,
Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/
On 10 February 2016 at 09:46, Steven White <swhite4...@gmail.com> wrote:
Hi folks,
I'm writing a file-system-crawler that will index files. The file system
is going to be very busy an I anticipate on average 10 new updates per
min. My application checks for new or updated files once every 1 min. I
use Tika to extract the raw-text off those files and send them over to Solr
for indexing. My application will be running 24x7xN-days. It will not
recycle unless if the OS is restarted.
Over at Tika mailing list, I was told the following:
"As a side note, if you are handling a bunch of files from the wild in a
production environment, I encourage separating Tika into a separate jvm vs
tying it into any post processing – consider tika-batch and writing
separate text files for each file processed (not so efficient, but
exceedingly robust). If this is demo code or you know your document set
well enough, you should be good to go with keeping Tika and your
postprocessing steps in the same jvm."
My question is, how does Solr utilize Tika? Does it run Tika in its own
JVM as an out-of-process application or does it link with Tika JARs
directly? If it links in directly, are there known issues with Solr
integrated with Tika because of Tika issues?
Thanks
Steve
--
Charlie Hull
Flax - Open Source Enterprise Search
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk