Indexing over a WAN will be slow, limited by the bandwidth of the pipe. I think you will be better served to move the data in bulk to the same LAN as your target solr instances. I would suggest ZIP+scp ... or your favorite file system replication/synchronization tool.
It's true, if you are using blocking I/O over a high latency LAN, then a few threads will let you make use of all the available bandwidth. But, typically, it takes very few threads to keep the pipe full. But, after that point, more threads do no good. But this is a general sort of thing that scp (or your favorite tool) will handle for you. No need to roll your own. Further, I don't think threading in the client buys you all that much compared to bulk updates. If you load 1000 documents at a time using SolrJ, it will do a good job of spreading out the load over the shards. If you find it takes a bit of time to build each update request document (with no indexing happening meanwhile), then you might prepare these in a background thread and place into a request queue. Thus, the foreground thread is always fetching the next request, sending it or waiting for a response. The synchronization cost on a request queue will be negligible. If you find the foreground thread is waiting too much, make the batch size bigger. If you find the queue length growing too large, put the background thread to sleep until the queue length drops down to a reasonable length. All of this complexity may buy you a few % improvement in indexing speed. Probably not worth the development cost ... -----Original Message----- From: Ali Nazemian [mailto:alinazem...@gmail.com] Sent: Wednesday, July 22, 2015 2:21 AM To: solr-user@lucene.apache.org Subject: Optimizing Solr indexing over WAN Dears, Hi, I know that there are lots of tips about how to make the Solr indexing faster. Probably some of the most important ones which are considered in client side are choosing batch indexing and multi-thread indexing. There are other important factors that are server side which I dont want to mentioned here. Anyway my question would be is there any best practice for number of client threads and the size of batch available over WAN network? Since the client and servers are connected over WAN network probably some of the performance conditions such as network latency, bandwidth and etc. are different from LAN network. Another think that is matter for me is the fact that document sizes are might be different in diverse scenarios. For example when you want to index web-pages the size of document might be from 1KB to 200KB. In such case choosing batch size according to the number of documents is probably not the best way of optimizing index performance. Probably choosing based on the size of batch size in KB/MB would be better from the network point of view. However, from the Solr side document numbers matter. So if I want to summarize my questions here what am I looking for: 1- Is there any best practice available for Solr client side performance tuning over WAN network for the purpose of indexing/reindexing/updating? Does it different from LAN network? 2- Which one is matter: number of documents or the total size of documents in batch? Best regards. -- A.Nazemian ************************************************************************* This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *************************************************************************