Well coalesce does require shuffle and network, however in most cases it is 
less than repartition as it moves the data (through the network) to already 
existing executors.
However as you see and others confirm: for high peformance you don’t need high 
parallelism on the ingestion side, but you can load the data in batches with a 
low parallelism. Tuning with some parameters (commit interval, merge segment 
size) can, but only if needed, deliver even more performance. If you then still 
need more performance you can increase the number of Solr nodes and shards.

> Am 23.10.2019 um 22:01 schrieb Nicolas Paris <nicolas.pa...@riseup.net>:
> 
> 
>> 
>> With Spark-Solr additional complexity comes. You could have too many
>> executors for your Solr instance(s), ie a too high parallelism.
> 
> I have been reducing the parallelism of spark-solr part by 5. I had 40
> executors loading 4 shards. Right now only 8 executors loading 4 shards.
> As a result, I can see a 10 times update improvement, and I suspect the
> update process had been overhelmed by spark.
> 
> I have been able to keep 40 executor for document preprocessing and
> reducing to 8 executors within the same spark job by using the
> "dataframe.coalesce" feature which does not shuffle the data at all and
> keeps both spark cluster and solr quiet in term of network.
> 
> Thanks
> 
>> On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote:
>> Maybe you need to give more details. I recommend always to try and test 
>> yourself as you know your own solution best. Depending on your spark process 
>> atomic updates  could be faster.
>> 
>> With Spark-Solr additional complexity comes. You could have too many 
>> executors for your Solr instance(s), ie a too high parallelism.
>> 
>> Probably the most important question is:
>> What performance do your use car needs and what is your current performance?
>> 
>> Once this is clear further architecture aspects can be derived, such as 
>> number of spark executors, number of Solr instances, sharding, replication, 
>> commit timing etc.
>> 
>>>> Am 19.10.2019 um 21:52 schrieb Nicolas Paris <nicolas.pa...@riseup.net>:
>>> 
>>> Hi community,
>>> 
>>> Any advice to speed-up updates ?
>>> Is there any advice on commit, memory, docvalues, stored or any tips to
>>> faster things ?
>>> 
>>> Thanks
>>> 
>>> 
>>>> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
>>>> Hi
>>>> 
>>>> I am looking for a way to faster the update of documents.
>>>> 
>>>> In my context, the update replaces one of the many existing indexed
>>>> fields, and keep the others as is.
>>>> 
>>>> Right now, I am building the whole document, and replacing the existing
>>>> one by id.
>>>> 
>>>> I am wondering if **atomic update feature** would faster the process.
>>>> 
>>>> From one hand, using this feature would save network because only a
>>>> small subset of the document would be send from the client to the
>>>> server. 
>>>> On the other hand, the server will have to collect the values from the
>>>> disk and reindex them. In addition, this implies to store the values for
>>>> every fields (I am not storing every fields) and use more space.
>>>> 
>>>> Also I have read about the ConcurrentUpdateSolrServer class might be an
>>>> optimized way of updating documents.
>>>> 
>>>> I am using spark-solr library to deal with solr-cloud. If something
>>>> exist to faster the process, I would be glad to implement it in that
>>>> library.
>>>> Also, I have split the collection over multiple shard, and I admit this
>>>> faster the update process, but who knows ?
>>>> 
>>>> Thoughts ?
>>>> 
>>>> -- 
>>>> nicolas
>>>> 
>>> 
>>> -- 
>>> nicolas
>> 
> 
> -- 
> nicolas

Reply via email to