Yangrui, Let me clarify - to have multiple data imports run concurrently, my impression is that you must have different requestHandlers declared in your solrconfig.xml By default, Data Import Handler is not multi-threaded; having multiple requestHandlers for it is a workaround to this, not a fix.
I also have to say that I'm trying in newer projects to work with Talend OpenStudio to do the database queries and push data to Solr. Talend OpenStudio allows the same sort of transformations as possible in Data Import Handler, and seems to me more independent of SolrCloud than Data Import Handler. There are many different ways to do it. -----Original Message----- From: Davis, Daniel (NIH/NLM) [C] Sent: Tuesday, April 05, 2016 5:40 PM To: solr-user@lucene.apache.org Subject: RE: Multiple data-config.xml in one collection? Yangrui, Solr will just do one data import. You can have a script invoke more than one, and they will run concurrently. There are some risks with that, depending on what you are doing. If it's just pulling from a database, I think you are all right. I've even had 4 run concurrently to make Data Import Handler be "multi-threaded". My query in one case looks like this: SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM medplus.public_topic_sites_us_v t) WHERE threadid = 0 And then I have 3 other queries in other DIH configurations for threadid 1,2,3. You also have to be careful with the clean parameter - unless a specific delete query is specified using the "preImportDeleteQuery" or "postImportDeleteQuery", then the clean parameter will cause DIH will remove the index data from all data import handlers even though you are only refreshing one. If you configure it carefully, it all works however. These are the use cases for the "source" field I use: - Filter only on documents from one source for the user, by specifying fq=source:health-topics in the query to Solr. - Filter only documents from one source in backend processing, for instance for the preImportDeleteQuery. - Do something different in the application that front-ends Solr depending on the "source" field value. There are some impacts on relevancy from combining them into one collection: When you combine multiple sources into one collection, whether using DIH or some other mechanism, you have to remember that the relevancy calculations of Solr include documents from both sources. Even if documents having different "source" documents are queried independently (through filter queries, such as fq:source=health-topics, the frequency of a word in the entire collection is a factor. However, you can query them together, even if you have to carefully tune weighting of the documents so that a large corpus doesn't dwarf a small one (unless it is appropriate). As always, relevancy gets pretty tricky. Hope this helps, Dan Davis -----Original Message----- From: Yangrui Guo [mailto:guoyang...@gmail.com] Sent: Tuesday, April 05, 2016 3:16 PM To: solr-user@lucene.apache.org Subject: Re: Multiple data-config.xml in one collection? Hi Daniel, So if I implement multiple dataimporthandler and do a full import, does Solr perform import of all handlers at once or can just specify which handler to import? Thank you Yangrui On Tuesday, April 5, 2016, Davis, Daniel (NIH/NLM) [C] <daniel.da...@nih.gov> wrote: > If Shawn is correct, and you are using DIH, then I have done this by > implementing multiple requestHandlers each of them using Data Import > Handler, and have each specify a different XML file for the data config. > Instead of using data-config.xml, I've used a large number of files such as: > health-topics-conf.xml > encyclopedia-conf.xml > ... > I tend to index a single valued, required field named "source" that I > can use in the delete query, and I use the TemplateTranformer to make this > easy: > > <entity name="topic" > ... > transformer="TemplateTransformer"> > <field column="source" template="health-topics" /> > ... > > Hope this helps, > > -Dan > > -----Original Message----- > From: Shawn Heisey [mailto:apa...@elyograg.org <javascript:;>] > Sent: Tuesday, April 05, 2016 10:50 AM > To: solr-user@lucene.apache.org <javascript:;> > Subject: Re: Multiple data-config.xml in one collection? > > On 4/5/2016 8:12 AM, Yangrui Guo wrote: > > I'm using Solr Cloud to index a number of databases. The problem is > > there is unknown number of databases and each database has its own > configuration. > > If I create a single collection for every database the query would > > eventually become insanely long. Is it possible to upload different > > config to zookeeper for each node in a single collection? > > Every shard replica (core) in a collection shares the same > configuration, which it gets from zookeeper. This is one of > SolrCloud's guarantees, to prevent problems found with old-style > sharding when the configuration is different on each machine. > > If you're using the dataimport handler, which you probably are since > you mentioned databases, you can parameterize pretty much everything > in the DIH config file so it comes from URL parameters on the > full-import or delta-import command. > > Below is a link to the DIH config that I'm using, redacted slightly. > I'm not running SolrCloud, but the same thing should work in cloud. > It should give you some idea of how to use variables in your config, > set by parameters on the URL. > > http://apaste.info/jtq > > Thanks, > Shawn > >