On 08/02/2018 11:47, Frederik Van Hoyweghen wrote:
Hey everyone,
What are your experiences on making (in production) use of Solr's
ExtractingRequestHandler?
I've been reading some mixed remarks so I was wondering what your actual
experiences with it are.
Personally, I feel like setting up a separate service which is solely
responsible for parsing file contents (to be indexed by Solr later on in
the process) using Tika is a safer approach, so we can use whatever Tika
version we want along with other things we might want to add.
Yes, do this. It's entirely possible to bring down Tika with a nasty
PDF, or end up consuming lots of resources in the extraction step and
have these impact your Solr server. Run it separately and you can
monitor it/kill it if necessary.
You might like my colleague Matt Pearce's DropWizard wrapper for Tika
https://github.com/mattflax/dropwizard-tika-server
Cheers
Charlie
Looking forward to your response!
Kind regards,
Frederik
--
Charlie Hull
Flax - Open Source Enterprise Search
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk