I'd suggest using CloudSolrClient. It uses ConcurrentUpdateSolrClient under
the hood and is zk aware so it would route the documents from the Client to
your Solr nodes correctly, saving you an extra hop.
Another thing to remember here is to reuse the Solr client as it is
thread-safe.

Reading up about commits would also be useful and this blog by Erick
Erickson is a good place to learn about that:
https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

In terms of running SolrJ on each node, you could just run a single
multi-threaded indexer that gets data from your database and injects it
into Solr. This process would run outside of Solr and could potentially run
anywhere.

As far as routing goes, I suggest you just try the default composite id
router unless you hit issues there. If you do you could read up about how
routing in SolrCloud works here:
https://lucidworks.com/blog/2013/06/13/solr-cloud-document-routing/

and also about advanced concepts here:
https://lucidworks.com/blog/2014/01/06/multi-level-composite-id-routing-solrcloud/



On Thu, Feb 18, 2016 at 2:08 PM, Colin Freas <cfr...@stsci.edu> wrote:

>
> Thanks for the info, Anshum.
>
> Writing up a SolrJ program to do this is entirely within my wheelhouse.
>
> Read through some of the SolrJ docs and found some examples to start.
>
> A handful of questions if anyone has some pointers.
>
> 1. From a performance perspective, is it worth it to use
> ConcurrentUpdateSolrServer? Also, documentation says best for updates;
> does that include adding documents?
>
> 2. When I run the importer via my SolrJ program to distribute the
> indexing, I¹ll create some kind of Solr client within SolrJ and point them
> at zookeeper.  But the communication with the SQL Server db is independent
> of the communication with zookeeper, right?  In that case, is it
> possible/does it make sense to run the SolrJ program on each node, so that
> each node communicates with the DB but they¹re both communicating with zk?
>
> One more question: for document routing to specific shards, the particular
> documents I have don¹t really have a natural way for routing.  Even if
> they did, my intuition is that I want the documents randomly and evenly
> distributed across all the machines in the cluster that will perform the
> querying.  Or is that intuition wrong, and it¹s better to have documents
> that fit a search criteria sorted in some way and placed near each other
> on a single or small number of machines?
>
> Any insights much appreciated!
>
> -Colin
>
>
>
> On 2/18/16, 2:01 AM, "Anshum Gupta" <ans...@anshumgupta.net> wrote:
>
> >Hi Colin,
> >
> >As per when I last checked, DIH works with SolrCloud but has it's
> >limitations. It was designed for the non-cloud mode and is single
> >threaded.
> >It runs on whatever node you set it up on and that node might not host the
> >leader for the shard a document belongs to, adding an extra hop for those
> >documents.
> >
> >SolrCloud is designed for multi-threaded indexing and I'd highly recommend
> >you to use SolrJ to speed up your indexing. Yes, that would involve
> >writing
> >some code but it would speed things up considerably.
> >
> >
> >On Wed, Feb 17, 2016 at 10:51 PM, Colin Freas <cfr...@stsci.edu> wrote:
> >
> >>
> >> I just set up a SolrCloud instance with 2 Solr nodes & another machine
> >> running zookeeper.
> >>
> >> I¹ve imported 200M records from a SQL Server database, and those records
> >> are split nicely between the 2 nodes.  Everything seems ok.
> >>
> >> I did the data import via the admin ui.  It took not quite 8 hours,
> >>which
> >> I guess is fine.  So, in the middle of the import I checked to see what
> >>was
> >> connected to the SQL Server machine.  It turned out that only the node
> >>that
> >> I had started the import on was actually connected to my database
> >>server.
> >>
> >> Is that the expected behavior?  Is there any way to have all nodes of a
> >> SolrCloud index communicate with the database during the indexing?
> >>Would
> >> that speed up indexing?  Maybe this isn¹t a bottleneck I should be
> >>worried
> >> about.
> >>
> >> Thanks,
> >> -Colin
> >>
> >
> >
> >
> >--
> >Anshum Gupta
>
>


-- 
Anshum Gupta

Reply via email to