Ha.  Spoke too soon about this thread not getting swamped.

Will add the dropwizard-tika-server to our wiki page.  Thank you for the link!

As a side note, I'll submit a pull request to update the AbstractTikaResource 
to avoid a potential NPE if the mime type can't be parsed...we just fixed this 
over in our tika-server.

-----Original Message-----
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Wednesday, February 10, 2016 3:55 AM
To: solr-user@lucene.apache.org
Subject: Re: How is Tika used with Solr

On 09/02/2016 22:49, Alexandre Rafalovitch wrote:
> Solr uses Tika directly. And not in the most efficient way. It is 
> there mostly for convenience rather than performance.
>
> So, for performance, Solr recommendation is also to run Tika 
> separately and only send Solr the processed documents.

Absolutely. It's entirely possible to kill Tika with a bad PDF or something, 
bringing down your Solr instance.

Here's something a colleague wrote to wrap Tika in a server, maybe you can use 
it:
https://github.com/mattflax/dropwizard-tika-server

Cheers

Charlie
>
> Regards,
>      Alex.
> ----
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 10 February 2016 at 09:46, Steven White <swhite4...@gmail.com> wrote:
>> Hi folks,
>>
>> I'm writing a file-system-crawler that will index files.  The file 
>> system is going to be very busy an I anticipate on average 10 new 
>> updates per min.  My application checks for new or updated files once 
>> every 1 min.  I use Tika to extract the raw-text off those files and 
>> send them over to Solr for indexing.  My application will be running 
>> 24x7xN-days.  It will not recycle unless if the OS is restarted.
>>
>> Over at Tika mailing list, I was told the following:
>>
>> "As a side note, if you are handling a bunch of files from the wild 
>> in a production environment, I encourage separating Tika into a 
>> separate jvm vs tying it into any post processing – consider 
>> tika-batch and writing separate text files for each file processed 
>> (not so efficient, but exceedingly robust).  If this is demo code or 
>> you know your document set well enough, you should be good to go with 
>> keeping Tika and your postprocessing steps in the same jvm."
>>
>> My question is, how does Solr utilize Tika?  Does it run Tika in its 
>> own JVM as an out-of-process application or does it link with Tika 
>> JARs directly?  If it links in directly, are there known issues with 
>> Solr integrated with Tika because of Tika issues?
>>
>> Thanks
>>
>> Steve


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Reply via email to