Thank you Erick and Alex.

My main question is with a long running process using Tika in the same JVM
as my application.  I'm running my file-system-crawler in its own JVM (not
Solr's).  On Tika mailing list, it is suggested to run Tika's code in it's
own JVM and invoke it from my file-system-crawler using
Runtime.getRuntime().exec().

I fully understand from Alex suggestion and link provided by Erick to use
Tika outside Solr.  But what about using Tika within the same JVM as my
file-system-crawler application or should I be making a system call to
invoke another JAR, that runs in its own JVM to extract the raw text?  Are
there known issues with Tika when used in a long running process?

Steve


On Tue, Feb 9, 2016 at 5:53 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Here's a writeup that should help....
>
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> On Tue, Feb 9, 2016 at 2:49 PM, Alexandre Rafalovitch
> <arafa...@gmail.com> wrote:
> > Solr uses Tika directly. And not in the most efficient way. It is
> > there mostly for convenience rather than performance.
> >
> > So, for performance, Solr recommendation is also to run Tika
> > separately and only send Solr the processed documents.
> >
> > Regards,
> >     Alex.
> > ----
> > Newsletter and resources for Solr beginners and intermediates:
> > http://www.solr-start.com/
> >
> >
> > On 10 February 2016 at 09:46, Steven White <swhite4...@gmail.com> wrote:
> >> Hi folks,
> >>
> >> I'm writing a file-system-crawler that will index files.  The file
> system
> >> is going to be very busy an I anticipate on average 10 new updates per
> >> min.  My application checks for new or updated files once every 1 min.
> I
> >> use Tika to extract the raw-text off those files and send them over to
> Solr
> >> for indexing.  My application will be running 24x7xN-days.  It will not
> >> recycle unless if the OS is restarted.
> >>
> >> Over at Tika mailing list, I was told the following:
> >>
> >> "As a side note, if you are handling a bunch of files from the wild in a
> >> production environment, I encourage separating Tika into a separate jvm
> vs
> >> tying it into any post processing – consider tika-batch and writing
> >> separate text files for each file processed (not so efficient, but
> >> exceedingly robust).  If this is demo code or you know your document set
> >> well enough, you should be good to go with keeping Tika and your
> >> postprocessing steps in the same jvm."
> >>
> >> My question is, how does Solr utilize Tika?  Does it run Tika in its own
> >> JVM as an out-of-process application or does it link with Tika JARs
> >> directly?  If it links in directly, are there known issues with Solr
> >> integrated with Tika because of Tika issues?
> >>
> >> Thanks
> >>
> >> Steve
>

Reply via email to