Hi Parmeshwor,
2 hours for 3 gb of data seems too slow. We scale up to PBs in such a way:
1) Ignore all commits from client
via IgnoreCommitOptimizeUpdateProcessorFactory
2) Heavy processes are done on external Tika server instead of Solr Cell
with embedded Tika feature.
3) Adjust autocommit, sof
Here’s some sample SolrJ code using TIka outside of Solr’s Extracting Request
Handler, along with some info about why loading Solr with the job of extracting
text is not optimal speed wise:
https://lucidworks.com/post/indexing-with-solrj/
> On Aug 13, 2019, at 12:15 PM, Jan Høydahl wrote:
>
>
You May want to review
https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-SlowIndexing
for some hints.
Make sure to index with multiple parallel threads. Also remember that using
/extract on the solr side is resource intensive and may make your clus