Thanks man. I'd love to learn more about the Talend OpenStudio project you're working on. Is it based on Lucene/Solr or a different project?
On Tuesday, April 5, 2016, Davis, Daniel (NIH/NLM) [C] <daniel.da...@nih.gov> wrote: > Yangrui, > > Let me clarify - to have multiple data imports run concurrently, my > impression is that you must have different requestHandlers declared in your > solrconfig.xml > By default, Data Import Handler is not multi-threaded; having multiple > requestHandlers for it is a workaround to this, not a fix. > > I also have to say that I'm trying in newer projects to work with Talend > OpenStudio to do the database queries and push data to Solr. Talend > OpenStudio allows the same sort of transformations as possible in Data > Import Handler, and seems to me more independent of SolrCloud than Data > Import Handler. There are many different ways to do it. > > -----Original Message----- > From: Davis, Daniel (NIH/NLM) [C] > Sent: Tuesday, April 05, 2016 5:40 PM > To: solr-user@lucene.apache.org <javascript:;> > Subject: RE: Multiple data-config.xml in one collection? > > Yangrui, > > Solr will just do one data import. You can have a script invoke more > than one, and they will run concurrently. There are some risks with that, > depending on what you are doing. If it's just pulling from a database, I > think you are all right. I've even had 4 run concurrently to make Data > Import Handler be "multi-threaded". My query in one case looks like this: > > SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM > medplus.public_topic_sites_us_v t) WHERE threadid = 0 > > And then I have 3 other queries in other DIH configurations for threadid > 1,2,3. > > You also have to be careful with the clean parameter - unless a specific > delete query is specified using the "preImportDeleteQuery" or > "postImportDeleteQuery", then the clean parameter will cause DIH will > remove the index data from all data import handlers even though you are > only refreshing one. If you configure it carefully, it all works however. > > These are the use cases for the "source" field I use: > > - Filter only on documents from one source for the user, by specifying > fq=source:health-topics in the query to Solr. > - Filter only documents from one source in backend processing, for > instance for the preImportDeleteQuery. > - Do something different in the application that front-ends Solr depending > on the "source" field value. > > There are some impacts on relevancy from combining them into one > collection: > > When you combine multiple sources into one collection, whether using DIH > or some other mechanism, you have to remember that the relevancy > calculations of Solr include documents from both sources. Even if > documents having different "source" documents are queried independently > (through filter queries, such as fq:source=health-topics, the frequency of > a word in the entire collection is a factor. > > However, you can query them together, even if you have to carefully tune > weighting of the documents so that a large corpus doesn't dwarf a small one > (unless it is appropriate). As always, relevancy gets pretty tricky. > > Hope this helps, > > Dan Davis > > -----Original Message----- > From: Yangrui Guo [mailto:guoyang...@gmail.com <javascript:;>] > Sent: Tuesday, April 05, 2016 3:16 PM > To: solr-user@lucene.apache.org <javascript:;> > Subject: Re: Multiple data-config.xml in one collection? > > Hi Daniel, > > So if I implement multiple dataimporthandler and do a full import, does > Solr perform import of all handlers at once or can just specify which > handler to import? Thank you > > Yangrui > > On Tuesday, April 5, 2016, Davis, Daniel (NIH/NLM) [C] < > daniel.da...@nih.gov <javascript:;>> > wrote: > > > If Shawn is correct, and you are using DIH, then I have done this by > > implementing multiple requestHandlers each of them using Data Import > > Handler, and have each specify a different XML file for the data config. > > Instead of using data-config.xml, I've used a large number of files such > as: > > health-topics-conf.xml > > encyclopedia-conf.xml > > ... > > I tend to index a single valued, required field named "source" that I > > can use in the delete query, and I use the TemplateTranformer to make > this easy: > > > > <entity name="topic" > > ... > > transformer="TemplateTransformer"> > > <field column="source" template="health-topics" /> > > ... > > > > Hope this helps, > > > > -Dan > > > > -----Original Message----- > > From: Shawn Heisey [mailto:apa...@elyograg.org <javascript:;> > <javascript:;>] > > Sent: Tuesday, April 05, 2016 10:50 AM > > To: solr-user@lucene.apache.org <javascript:;> <javascript:;> > > Subject: Re: Multiple data-config.xml in one collection? > > > > On 4/5/2016 8:12 AM, Yangrui Guo wrote: > > > I'm using Solr Cloud to index a number of databases. The problem is > > > there is unknown number of databases and each database has its own > > configuration. > > > If I create a single collection for every database the query would > > > eventually become insanely long. Is it possible to upload different > > > config to zookeeper for each node in a single collection? > > > > Every shard replica (core) in a collection shares the same > > configuration, which it gets from zookeeper. This is one of > > SolrCloud's guarantees, to prevent problems found with old-style > > sharding when the configuration is different on each machine. > > > > If you're using the dataimport handler, which you probably are since > > you mentioned databases, you can parameterize pretty much everything > > in the DIH config file so it comes from URL parameters on the > > full-import or delta-import command. > > > > Below is a link to the DIH config that I'm using, redacted slightly. > > I'm not running SolrCloud, but the same thing should work in cloud. > > It should give you some idea of how to use variables in your config, > > set by parameters on the URL. > > > > http://apaste.info/jtq > > > > Thanks, > > Shawn > > > > >