Hello - Depening on size differences between source data and indexed data, you
can gzip/bzip2 your source json/xml, then transfer it over WAN, and index it
locally. This is the fastest method in every case we encountered.
-----Original message-----
> From:Reitzel, Charles <charles.reit...@tiaa-cref.org>
> Sent: Wednesday 22nd July 2015 17:43
> To: solr-user@lucene.apache.org
> Subject: RE: Optimizing Solr indexing over WAN
>
> Indexing over a WAN will be slow, limited by the bandwidth of the pipe.
>
> I think you will be better served to move the data in bulk to the same LAN as
> your target solr instances. I would suggest ZIP+scp ... or your favorite
> file system replication/synchronization tool.
>
> It's true, if you are using blocking I/O over a high latency LAN, then a few
> threads will let you make use of all the available bandwidth. But,
> typically, it takes very few threads to keep the pipe full. But, after that
> point, more threads do no good. But this is a general sort of thing that
> scp (or your favorite tool) will handle for you. No need to roll your own.
>
> Further, I don't think threading in the client buys you all that much
> compared to bulk updates. If you load 1000 documents at a time using SolrJ,
> it will do a good job of spreading out the load over the shards.
>
> If you find it takes a bit of time to build each update request document
> (with no indexing happening meanwhile), then you might prepare these in a
> background thread and place into a request queue. Thus, the foreground
> thread is always fetching the next request, sending it or waiting for a
> response. The synchronization cost on a request queue will be negligible.
> If you find the foreground thread is waiting too much, make the batch size
> bigger. If you find the queue length growing too large, put the background
> thread to sleep until the queue length drops down to a reasonable length.
> All of this complexity may buy you a few % improvement in indexing speed.
> Probably not worth the development cost ...
>
> -----Original Message-----
> From: Ali Nazemian [mailto:alinazem...@gmail.com]
> Sent: Wednesday, July 22, 2015 2:21 AM
> To: solr-user@lucene.apache.org
> Subject: Optimizing Solr indexing over WAN
>
> Dears,
> Hi,
> I know that there are lots of tips about how to make the Solr indexing
> faster. Probably some of the most important ones which are considered in
> client side are choosing batch indexing and multi-thread indexing. There are
> other important factors that are server side which I dont want to mentioned
> here. Anyway my question would be is there any best practice for number of
> client threads and the size of batch available over WAN network?
> Since the client and servers are connected over WAN network probably some of
> the performance conditions such as network latency, bandwidth and etc.
> are different from LAN network. Another think that is matter for me is the
> fact that document sizes are might be different in diverse scenarios. For
> example when you want to index web-pages the size of document might be from
> 1KB to 200KB. In such case choosing batch size according to the number of
> documents is probably not the best way of optimizing index performance.
> Probably choosing based on the size of batch size in KB/MB would be better
> from the network point of view. However, from the Solr side document numbers
> matter.
> So if I want to summarize my questions here what am I looking for:
> 1- Is there any best practice available for Solr client side performance
> tuning over WAN network for the purpose of indexing/reindexing/updating?
> Does it different from LAN network?
> 2- Which one is matter: number of documents or the total size of documents in
> batch?
>
> Best regards.
>
> --
> A.Nazemian
>
> *************************************************************************
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately
> and then delete it.
>
> TIAA-CREF
> *************************************************************************