Ha. Spoke too soon about this thread not getting swamped. Will add the dropwizard-tika-server to our wiki page. Thank you for the link!
As a side note, I'll submit a pull request to update the AbstractTikaResource to avoid a potential NPE if the mime type can't be parsed...we just fixed this over in our tika-server. -----Original Message----- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Wednesday, February 10, 2016 3:55 AM To: solr-user@lucene.apache.org Subject: Re: How is Tika used with Solr On 09/02/2016 22:49, Alexandre Rafalovitch wrote: > Solr uses Tika directly. And not in the most efficient way. It is > there mostly for convenience rather than performance. > > So, for performance, Solr recommendation is also to run Tika > separately and only send Solr the processed documents. Absolutely. It's entirely possible to kill Tika with a bad PDF or something, bringing down your Solr instance. Here's something a colleague wrote to wrap Tika in a server, maybe you can use it: https://github.com/mattflax/dropwizard-tika-server Cheers Charlie > > Regards, > Alex. > ---- > Newsletter and resources for Solr beginners and intermediates: > http://www.solr-start.com/ > > > On 10 February 2016 at 09:46, Steven White <swhite4...@gmail.com> wrote: >> Hi folks, >> >> I'm writing a file-system-crawler that will index files. The file >> system is going to be very busy an I anticipate on average 10 new >> updates per min. My application checks for new or updated files once >> every 1 min. I use Tika to extract the raw-text off those files and >> send them over to Solr for indexing. My application will be running >> 24x7xN-days. It will not recycle unless if the OS is restarted. >> >> Over at Tika mailing list, I was told the following: >> >> "As a side note, if you are handling a bunch of files from the wild >> in a production environment, I encourage separating Tika into a >> separate jvm vs tying it into any post processing – consider >> tika-batch and writing separate text files for each file processed >> (not so efficient, but exceedingly robust). If this is demo code or >> you know your document set well enough, you should be good to go with >> keeping Tika and your postprocessing steps in the same jvm." >> >> My question is, how does Solr utilize Tika? Does it run Tika in its >> own JVM as an out-of-process application or does it link with Tika >> JARs directly? If it links in directly, are there known issues with >> Solr integrated with Tika because of Tika issues? >> >> Thanks >> >> Steve -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk