> I can tell you that Tika is quite the resource hog. It is likely chewing up
> CPU and memory
> resources at an incredible rate, slowing down your Solr server. You
> would probably see better performance than ERH if you incorporate Tika
> and SolrJ into a client indexing program that runs on a different machine
> than Solr.
+1
It'd be interesting to see what happens if you use standalone tika-batch to see
what the performance is.
java -jar tika-app.jar -i <input_dir> -o <output_dir>
and if you're feeling adventurous:
java -jar tika-app.jar -i <input_dir> -o <output_dir> -J -t
You can specify the number of threads with -numConsumers 5 (don't use many more
than # of cpus!)
Content extraction with Tika is usually slower (sometimes far slower) than the
indexing step. If you have any crazily slow docs, open an issue on Tika's JIRA.
Cheers,
Tim
-----Original Message-----
From: Zheng Lin Edwin Yeo [mailto:[email protected]]
Sent: Thursday, April 21, 2016 12:13 AM
To: [email protected]
Subject: Re: Overall large size in Solr across collections
Hi Shawn,
Yes, I'm using the Extracting Request Handler.
The 0.7GB/hr is the indexing rate at which the size of the original documents
which get ingested into Solr. This means that for every hour, only 0.7GB of my
documents gets ingested into Solr. It will require 10 hours just to index
documents which are of 7GB in size.
Regards,
Edwin