I'd seriously consider a SolrJ program rather than posting, posting files is really intended to be a simple way to get started, when it comes to indexing large volumes it's not very efficient.
As a comparison, I index 3-4K docs/second (Wikipedia dump) on my macbook pro. Note that if each of your businesses has that many documents, you're talking 12 billion, hope you're sharding! Here's some SolrJ to get you started. Note you'll pretty much throw out the Tika and RDBMS in favor of constructing the SolrInputDocuments from parsing your data with your favorite JSON parser. https://lucidworks.com/2012/02/14/indexing-with-solrj/ Then you can rack N of these SolrJ programs (each presumably working on a separate subset of the data) to get your indexing speed up to what you need. 95% of the time slow indexing is because of the ETL pipeline. One key is to check the CPU usage on your Solr server and see if it's running hot or not. If not, then you aren't feeding docs fast enough to Solr. Do batch docs together as in the program, I typically start with batches of 1,000 docs. Best, Erick On Tue, May 8, 2018 at 8:25 PM, Raymond Xie <xie3208...@gmail.com> wrote: > I have a huge amount of JSON files to be indexed in Solr, it costs me 22 > minutes to index 300,000 JSON files which were generated from 1 single bz2 > file, this is only 0.25% of the total amount of data from the same business > flow, there are 100+ business flow to be index'ed. > > I absolutely need a good solution on this, at the moment I use the post.jar > to work on folder and I am running the post.jar in single thread. > > I wonder what is the best practice to do multi-threading indexing? Can > anyone provide detailed example? > > > > *------------------------------------------------* > *Sincerely yours,* > > > *Raymond*