Hello - Depening on size differences between source data and indexed data, you 
can gzip/bzip2 your source json/xml, then transfer it over WAN, and index it 
locally. This is the fastest method in every case we encountered.
 
-----Original message-----
> From:Reitzel, Charles <charles.reit...@tiaa-cref.org>
> Sent: Wednesday 22nd July 2015 17:43
> To: solr-user@lucene.apache.org
> Subject: RE: Optimizing Solr indexing over WAN
> 
> Indexing over a WAN will be slow, limited by the bandwidth of the pipe.
> 
> I think you will be better served to move the data in bulk to the same LAN as 
> your target solr instances.    I would suggest ZIP+scp ... or your favorite 
> file system replication/synchronization tool.
> 
> It's true, if you are using blocking I/O over a high latency LAN, then a few 
> threads will let you make use of all the available bandwidth.  But, 
> typically, it takes very few threads to keep the pipe full.   But, after that 
> point, more threads do no good.   But this is a general sort of thing that 
> scp (or your favorite tool) will handle for you.   No need to roll your own.
> 
> Further, I don't think threading in the client buys you all that much 
> compared to bulk updates.  If you load 1000 documents at a time using SolrJ, 
> it will do a good job of spreading out the load over the shards.   
> 
> If you find it takes a bit of time to build each update request document 
> (with no indexing happening meanwhile), then you might prepare these in a 
> background thread and place into a request queue.  Thus, the foreground 
> thread is always fetching the next request, sending it or waiting for a 
> response.   The synchronization cost on a request queue will be negligible.   
>  If you find the foreground thread is waiting too much, make the batch size 
> bigger.   If you find the queue length growing too large, put the background 
> thread to sleep until the queue length drops down to a reasonable length.   
> All of this complexity may buy you a few % improvement in indexing speed.  
> Probably not worth the development cost ...
> 
> -----Original Message-----
> From: Ali Nazemian [mailto:alinazem...@gmail.com] 
> Sent: Wednesday, July 22, 2015 2:21 AM
> To: solr-user@lucene.apache.org
> Subject: Optimizing Solr indexing over WAN
> 
> Dears,
> Hi,
> I know that there are lots of tips about how to make the Solr indexing 
> faster. Probably some of the most important ones which are considered in 
> client side are choosing batch indexing and multi-thread indexing. There are 
> other important factors that are server side which I dont want to mentioned 
> here. Anyway my question would be is there any best practice for number of 
> client threads and the size of batch available over WAN network?
> Since the client and servers are connected over WAN network probably some of 
> the performance conditions such as network latency, bandwidth and etc.
> are different from LAN network. Another think that is matter for me is the 
> fact that document sizes are might be different in diverse scenarios. For 
> example when you want to index web-pages the size of document might be from 
> 1KB to 200KB. In such case choosing batch size according to the number of 
> documents is probably not the best way of optimizing index performance.
> Probably choosing based on the size of batch size in KB/MB would be better 
> from the network point of view. However, from the Solr side document numbers 
> matter.
> So if I want to summarize my questions here what am I looking for:
> 1- Is there any best practice available for Solr client side performance 
> tuning over WAN network for the purpose of indexing/reindexing/updating?
> Does it different from LAN network?
> 2- Which one is matter: number of documents or the total size of documents in 
> batch?
> 
> Best regards.
> 
> --
> A.Nazemian
> 
> *************************************************************************
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately 
> and then delete it.
> 
> TIAA-CREF
> *************************************************************************

Reply via email to