Re: How to perform delta-import on SolrCloud mode through a scheduler?

Shawn Heisey Tue, 12 Dec 2017 13:47:47 -0800

On 12/8/2017 2:40 AM, Sabeer Hussain wrote:
> I am using Solr 7.1 version and deployed it in standalone mode. I have
> created a scheduler in my application itself to perform delta-import
> operation based on a pre-configured frequency. I have used the following
> lines of code (in java) to invoke delta-import operation


When the language is Java, I would use SolrJ.  The code tends to be
easier to write and easier to read than code like you've written which
uses HTTP functionality built into Java.  The response objects have a
lot of sugar methods, and the entire response is available as a Java
object that's easy to use in code -- you don't have to worry about
parsing the response into Java objects.

> Now, I want to deploy the application in SolrCloud mode and for each core,
> there will be 2 more replicas.

Most things in SolrCloud should be done at the collection level --
replacing "corename" with "collectionname" in the URL you have in your
code.  But DIH (the dataimport handler) is not one of them.

Using DIH at the collection level is possible, but you'll find that the
requests are load-balanced across the cloud, so you are likely to get a
status from a different replica than you sent the import to.  So, even
though most of the time I would recommend using CloudSolrClient from
SolrJ when running SolrCloud, for the dataimport handler, you should
actually use HttpSolrClient.

If the index has only one shard, or you are using the compositeId router
for automatic distribution of data between multiple shards, then running
an import on *ANY* core in the collection will distribute and replicate
data as you would expect across the entire collection.  If you're using
the implicit router and there are multiple shards, then things get a lot
more tricky, but SolrCloud will still do all the replication for you. 
I'm not going to go into detail about shards in this message.

Here's some SolrJ code to start an import and print the response.  The
example code includes a possible core name for the "foo" collection in
SolrCloud.  A specific core should be used for DIH so that you can be
sure that all requests are sent to the same place.  The query I've built
in the example doesn't have all the parameters you included, but you
should be able to see how to add anything you need.  One thing I'm not
clear on is whether the distrib=false parameter is required to disable
the load balancing.

  /*
   * By using a base URL without a core/collection name, one client
   * object can be used for requests to multiple indexes hosted on the
   * server side.
   */
  String baseUrl = "http://host:port/solr";;
  String coreName = "foo_shard1_replica_n1";
  SolrClient client = new HttpSolrClient.Builder(baseUrl).build();

  SolrQuery startQuery = new SolrQuery();
  startQuery.setRequestHandler("/dataimport");
  startQuery.set("command", "delta-import");
  startQuery.set("clean", "false");

  try{
    QueryResponse response = client.query(coreName, startQuery);
    System.out.println(response.getResponse().toString());
  }
  catch (SolrServerException e){
    // TODO Auto-generated catch block
    e.printStackTrace();
  }
  catch (IOException e){
    // TODO Auto-generated catch block
    e.printStackTrace();
  }

As I mentioned above, for most types of requests against SolrCloud
(other than DIH), you should use CloudSolrClient, not HttpSolrClient,
and send requests to the collection instead of a specific core.  The
cloud client is initialized using ZooKeeper info rather than a URL.  It
is fully aware of the entire cloud at all times.  For DIH though, you
don't want to send things to the collection, because of SolrCloud's
inherent load balancing.

The difficulties of getting a program to deal with a DIH status response
are a whole separate discussion.

Thanks,
Shawn

Re: How to perform delta-import on SolrCloud mode through a scheduler?

Reply via email to