Re: Do all SolrCloud nodes communicate with the database when indexing a collection?

Joel Bernstein Sat, 20 Feb 2016 09:23:54 -0800

In Solr 6 there is also the JdbcStream which could be an interesting tool
for loading data from a relational database. Something to keep in mind if
you plan to upgrade to Solr 6, which is coming out very soon.


The JdbcStream queries a relational database and abstracts the results as a
TupleStream. This allows you to wrap it in an UpdateStream and send the
records to a Solr collection. The syntax would look something like this:

update(collectionName, batchSize="1000", jdbc(connection="...", sql="select
..", sort="..."))

If you send this expression to the /stream handler it will execute the
expression for you.

The UpdateStream uses CloudSolrClient to send the documents to Solr.



Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Feb 18, 2016 at 9:32 PM, Anshum Gupta <ans...@anshumgupta.net>
wrote:

> I'd suggest using CloudSolrClient. It uses ConcurrentUpdateSolrClient under
> the hood and is zk aware so it would route the documents from the Client to
> your Solr nodes correctly, saving you an extra hop.
> Another thing to remember here is to reuse the Solr client as it is
> thread-safe.
>
> Reading up about commits would also be useful and this blog by Erick
> Erickson is a good place to learn about that:
>
> https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> In terms of running SolrJ on each node, you could just run a single
> multi-threaded indexer that gets data from your database and injects it
> into Solr. This process would run outside of Solr and could potentially run
> anywhere.
>
> As far as routing goes, I suggest you just try the default composite id
> router unless you hit issues there. If you do you could read up about how
> routing in SolrCloud works here:
> https://lucidworks.com/blog/2013/06/13/solr-cloud-document-routing/
>
> and also about advanced concepts here:
>
> https://lucidworks.com/blog/2014/01/06/multi-level-composite-id-routing-solrcloud/
>
>
>
> On Thu, Feb 18, 2016 at 2:08 PM, Colin Freas <cfr...@stsci.edu> wrote:
>
> >
> > Thanks for the info, Anshum.
> >
> > Writing up a SolrJ program to do this is entirely within my wheelhouse.
> >
> > Read through some of the SolrJ docs and found some examples to start.
> >
> > A handful of questions if anyone has some pointers.
> >
> > 1. From a performance perspective, is it worth it to use
> > ConcurrentUpdateSolrServer? Also, documentation says best for updates;
> > does that include adding documents?
> >
> > 2. When I run the importer via my SolrJ program to distribute the
> > indexing, I¹ll create some kind of Solr client within SolrJ and point
> them
> > at zookeeper.  But the communication with the SQL Server db is
> independent
> > of the communication with zookeeper, right?  In that case, is it
> > possible/does it make sense to run the SolrJ program on each node, so
> that
> > each node communicates with the DB but they¹re both communicating with
> zk?
> >
> > One more question: for document routing to specific shards, the
> particular
> > documents I have don¹t really have a natural way for routing.  Even if
> > they did, my intuition is that I want the documents randomly and evenly
> > distributed across all the machines in the cluster that will perform the
> > querying.  Or is that intuition wrong, and it¹s better to have documents
> > that fit a search criteria sorted in some way and placed near each other
> > on a single or small number of machines?
> >
> > Any insights much appreciated!
> >
> > -Colin
> >
> >
> >
> > On 2/18/16, 2:01 AM, "Anshum Gupta" <ans...@anshumgupta.net> wrote:
> >
> > >Hi Colin,
> > >
> > >As per when I last checked, DIH works with SolrCloud but has it's
> > >limitations. It was designed for the non-cloud mode and is single
> > >threaded.
> > >It runs on whatever node you set it up on and that node might not host
> the
> > >leader for the shard a document belongs to, adding an extra hop for
> those
> > >documents.
> > >
> > >SolrCloud is designed for multi-threaded indexing and I'd highly
> recommend
> > >you to use SolrJ to speed up your indexing. Yes, that would involve
> > >writing
> > >some code but it would speed things up considerably.
> > >
> > >
> > >On Wed, Feb 17, 2016 at 10:51 PM, Colin Freas <cfr...@stsci.edu> wrote:
> > >
> > >>
> > >> I just set up a SolrCloud instance with 2 Solr nodes & another machine
> > >> running zookeeper.
> > >>
> > >> I¹ve imported 200M records from a SQL Server database, and those
> records
> > >> are split nicely between the 2 nodes.  Everything seems ok.
> > >>
> > >> I did the data import via the admin ui.  It took not quite 8 hours,
> > >>which
> > >> I guess is fine.  So, in the middle of the import I checked to see
> what
> > >>was
> > >> connected to the SQL Server machine.  It turned out that only the node
> > >>that
> > >> I had started the import on was actually connected to my database
> > >>server.
> > >>
> > >> Is that the expected behavior?  Is there any way to have all nodes of
> a
> > >> SolrCloud index communicate with the database during the indexing?
> > >>Would
> > >> that speed up indexing?  Maybe this isn¹t a bottleneck I should be
> > >>worried
> > >> about.
> > >>
> > >> Thanks,
> > >> -Colin
> > >>
> > >
> > >
> > >
> > >--
> > >Anshum Gupta
> >
> >
>
>
> --
> Anshum Gupta
>

Re: Do all SolrCloud nodes communicate with the database when indexing a collection?

Reply via email to