In Solr 6 there is also the JdbcStream which could be an interesting tool for loading data from a relational database. Something to keep in mind if you plan to upgrade to Solr 6, which is coming out very soon.
The JdbcStream queries a relational database and abstracts the results as a TupleStream. This allows you to wrap it in an UpdateStream and send the records to a Solr collection. The syntax would look something like this: update(collectionName, batchSize="1000", jdbc(connection="...", sql="select ..", sort="...")) If you send this expression to the /stream handler it will execute the expression for you. The UpdateStream uses CloudSolrClient to send the documents to Solr. Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Feb 18, 2016 at 9:32 PM, Anshum Gupta <ans...@anshumgupta.net> wrote: > I'd suggest using CloudSolrClient. It uses ConcurrentUpdateSolrClient under > the hood and is zk aware so it would route the documents from the Client to > your Solr nodes correctly, saving you an extra hop. > Another thing to remember here is to reuse the Solr client as it is > thread-safe. > > Reading up about commits would also be useful and this blog by Erick > Erickson is a good place to learn about that: > > https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > In terms of running SolrJ on each node, you could just run a single > multi-threaded indexer that gets data from your database and injects it > into Solr. This process would run outside of Solr and could potentially run > anywhere. > > As far as routing goes, I suggest you just try the default composite id > router unless you hit issues there. If you do you could read up about how > routing in SolrCloud works here: > https://lucidworks.com/blog/2013/06/13/solr-cloud-document-routing/ > > and also about advanced concepts here: > > https://lucidworks.com/blog/2014/01/06/multi-level-composite-id-routing-solrcloud/ > > > > On Thu, Feb 18, 2016 at 2:08 PM, Colin Freas <cfr...@stsci.edu> wrote: > > > > > Thanks for the info, Anshum. > > > > Writing up a SolrJ program to do this is entirely within my wheelhouse. > > > > Read through some of the SolrJ docs and found some examples to start. > > > > A handful of questions if anyone has some pointers. > > > > 1. From a performance perspective, is it worth it to use > > ConcurrentUpdateSolrServer? Also, documentation says best for updates; > > does that include adding documents? > > > > 2. When I run the importer via my SolrJ program to distribute the > > indexing, I¹ll create some kind of Solr client within SolrJ and point > them > > at zookeeper. But the communication with the SQL Server db is > independent > > of the communication with zookeeper, right? In that case, is it > > possible/does it make sense to run the SolrJ program on each node, so > that > > each node communicates with the DB but they¹re both communicating with > zk? > > > > One more question: for document routing to specific shards, the > particular > > documents I have don¹t really have a natural way for routing. Even if > > they did, my intuition is that I want the documents randomly and evenly > > distributed across all the machines in the cluster that will perform the > > querying. Or is that intuition wrong, and it¹s better to have documents > > that fit a search criteria sorted in some way and placed near each other > > on a single or small number of machines? > > > > Any insights much appreciated! > > > > -Colin > > > > > > > > On 2/18/16, 2:01 AM, "Anshum Gupta" <ans...@anshumgupta.net> wrote: > > > > >Hi Colin, > > > > > >As per when I last checked, DIH works with SolrCloud but has it's > > >limitations. It was designed for the non-cloud mode and is single > > >threaded. > > >It runs on whatever node you set it up on and that node might not host > the > > >leader for the shard a document belongs to, adding an extra hop for > those > > >documents. > > > > > >SolrCloud is designed for multi-threaded indexing and I'd highly > recommend > > >you to use SolrJ to speed up your indexing. Yes, that would involve > > >writing > > >some code but it would speed things up considerably. > > > > > > > > >On Wed, Feb 17, 2016 at 10:51 PM, Colin Freas <cfr...@stsci.edu> wrote: > > > > > >> > > >> I just set up a SolrCloud instance with 2 Solr nodes & another machine > > >> running zookeeper. > > >> > > >> I¹ve imported 200M records from a SQL Server database, and those > records > > >> are split nicely between the 2 nodes. Everything seems ok. > > >> > > >> I did the data import via the admin ui. It took not quite 8 hours, > > >>which > > >> I guess is fine. So, in the middle of the import I checked to see > what > > >>was > > >> connected to the SQL Server machine. It turned out that only the node > > >>that > > >> I had started the import on was actually connected to my database > > >>server. > > >> > > >> Is that the expected behavior? Is there any way to have all nodes of > a > > >> SolrCloud index communicate with the database during the indexing? > > >>Would > > >> that speed up indexing? Maybe this isn¹t a bottleneck I should be > > >>worried > > >> about. > > >> > > >> Thanks, > > >> -Colin > > >> > > > > > > > > > > > >-- > > >Anshum Gupta > > > > > > > -- > Anshum Gupta >