On 2/25/2019 11:15 AM, Ganesh Sethuraman wrote:
We are using Solr Cloud 7.2.1. We are using Solr CSV update handler to do bulk update (several Millions of docs) in to multiple collections. When we make a call to the CSV update handler using curl command line (as below), we are pointing to single server in Solr. During the problem time, when one of the Solr server goes down this approach could fail. Is there any way that we do this to send the write to the leader, like how the solrj does, through the simple curl command(s) line?
The SolrJ client named CloudSolrClient is able to do this because it is a full ZooKeeper client that has instant access to the clusterstate maintained by your Solr servers.
To get that capability in any other client would require that the client is aware of the ZooKeeper ensemble in the same way. Curl cannot do this.
In the request below for some reason, if the SOLR1-SERVER is down, the request will fail, even though the new leader say SOLR2-SERVER is up. curl 'http://<<SOLR1-SERVER>>:8983/solr/my_collection/update?commit=true' --data-binary @example/exampledocs/books.csv -H 'Content-type:application/csv' 1. I can create load balancer / ALB infront of solr, but that may not still identify the Leader for efficiency.
A load balancer won't be able to identify the leader unless it is capable of talking to ZooKeeper and knows how Solr represents data in ZK. Have you measured the efficiency improvement that comes from sending to the leader? If that improvement is small, it's probably not worth implementing something that talks to ZooKeeper. I know there are people who don't try to send to leaders that are achieving very fast indexing rates ... I suspect that the improvement obtained by sending to leaders is relatively small.
2. I can write a solrj client to update, but i am not sure if i will get the efficiency of bulk update? not sure about the simplicity of the curl as well.
SolrJ is probably more efficient than something like curl, because it utilizes a compact binary format for data transfer in both directions, called javabin. With curl, you would most likely be using a text format like json, xml, or csv.
SolrJ clients are fully thread-safe. Which means you can use a single instance to send updates in parallel with multiple threads. That is the best way to achieve good indexing performance with Solr.
Thanks, Shawn