DIH is also not designed to multi-thread very well. One way I've handled this is to have a DIH XML that breaks-up a database query into multiple processes by taking the modulo of a row, as follows:
<entity name="medsite" dataSource="oltp01_prod" rootEntity="true" query="SELECT * FROM (SELECT t.*, mod(RowNum, 4) threadid FROM your_table t) WHERE threadid = 0" transformer="TemplateTransformer,LogTransformer" logTemplate="topic thread 0" logLevel="debug"> This allows me to do sub-queries within the entity, but it is often better to just write a small program to get this data from the database, and ETL processors such as Pentaho DI (Kettle) and Talend DI do this quite well. If you can express what you want in a database view, even a complicated one, then your best way to get it into Solr IMO is to use logstash with the jdbc input plugin. It can do some transformation, but you'll need your database view to process the data. > -----Original Message----- > From: Shawn Heisey <elyog...@elyograg.org> > Sent: Friday, January 4, 2019 12:25 PM > To: solr-user@lucene.apache.org > Subject: Re: [solr-solrcloud] How does DIH work when there are multiple > nodes? > > On 1/4/2019 1:04 AM, 유정인 wrote: > > The reader was looking for a way to do 'DIH' automatically. > > > > The reason was for HA configuration. > > If you send a DIH request to the collection (as opposed to a specific > core), that request will be load balanced across the cloud. You won't > know which replica/core actually handles it. This means that an import > command may be handled by a different host than a status command. In > that situation, the status command will not know about the import, > because it will be running on a different Solr core. > > When doing DIH on SolrCloud, you should send your requests directly to a > specific core on a specific node. It's the only way to be sure what's > happening. High availability would have to be handled in your application. > > Thanks, > Shawn