Indexing over a WAN will be slow, limited by the bandwidth of the pipe.

I think you will be better served to move the data in bulk to the same LAN as 
your target solr instances.    I would suggest ZIP+scp ... or your favorite 
file system replication/synchronization tool.

It's true, if you are using blocking I/O over a high latency LAN, then a few 
threads will let you make use of all the available bandwidth.  But, typically, 
it takes very few threads to keep the pipe full.   But, after that point, more 
threads do no good.   But this is a general sort of thing that scp (or your 
favorite tool) will handle for you.   No need to roll your own.

Further, I don't think threading in the client buys you all that much compared 
to bulk updates.  If you load 1000 documents at a time using SolrJ, it will do 
a good job of spreading out the load over the shards.   

If you find it takes a bit of time to build each update request document (with 
no indexing happening meanwhile), then you might prepare these in a background 
thread and place into a request queue.  Thus, the foreground thread is always 
fetching the next request, sending it or waiting for a response.   The 
synchronization cost on a request queue will be negligible.    If you find the 
foreground thread is waiting too much, make the batch size bigger.   If you 
find the queue length growing too large, put the background thread to sleep 
until the queue length drops down to a reasonable length.   All of this 
complexity may buy you a few % improvement in indexing speed.  Probably not 
worth the development cost ...

-----Original Message-----
From: Ali Nazemian [mailto:alinazem...@gmail.com] 
Sent: Wednesday, July 22, 2015 2:21 AM
To: solr-user@lucene.apache.org
Subject: Optimizing Solr indexing over WAN

Dears,
Hi,
I know that there are lots of tips about how to make the Solr indexing faster. 
Probably some of the most important ones which are considered in client side 
are choosing batch indexing and multi-thread indexing. There are other 
important factors that are server side which I dont want to mentioned here. 
Anyway my question would be is there any best practice for number of client 
threads and the size of batch available over WAN network?
Since the client and servers are connected over WAN network probably some of 
the performance conditions such as network latency, bandwidth and etc.
are different from LAN network. Another think that is matter for me is the fact 
that document sizes are might be different in diverse scenarios. For example 
when you want to index web-pages the size of document might be from 1KB to 
200KB. In such case choosing batch size according to the number of documents is 
probably not the best way of optimizing index performance.
Probably choosing based on the size of batch size in KB/MB would be better from 
the network point of view. However, from the Solr side document numbers matter.
So if I want to summarize my questions here what am I looking for:
1- Is there any best practice available for Solr client side performance tuning 
over WAN network for the purpose of indexing/reindexing/updating?
Does it different from LAN network?
2- Which one is matter: number of documents or the total size of documents in 
batch?

Best regards.

--
A.Nazemian

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*************************************************************************

Reply via email to