Yangrui,

Let me clarify - to have multiple data imports run concurrently, my impression 
is that you must have different requestHandlers declared in your solrconfig.xml
By default, Data Import Handler is not multi-threaded; having multiple 
requestHandlers for it is a workaround to this, not a fix.

I also have to say that I'm trying in newer projects to work with Talend 
OpenStudio to do the database queries and push data to Solr.  Talend OpenStudio 
allows the same sort of transformations as possible in Data Import Handler, and 
seems to me more independent of SolrCloud than Data Import Handler.  There are 
many different ways to do it.

-----Original Message-----
From: Davis, Daniel (NIH/NLM) [C] 
Sent: Tuesday, April 05, 2016 5:40 PM
To: solr-user@lucene.apache.org
Subject: RE: Multiple data-config.xml in one collection?

Yangrui,

Solr will just do one data import.    You can have a script invoke more than 
one, and they will run concurrently.   There are some risks with that, 
depending on what you are doing.   If it's just pulling from a database, I 
think you are all right.   I've even had 4 run concurrently to make Data Import 
Handler be "multi-threaded".   My query in one case looks like this:

        SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM 
medplus.public_topic_sites_us_v t) WHERE threadid = 0

And then I have 3 other queries in other DIH configurations for threadid 1,2,3.

You also have to be careful with the clean parameter - unless a specific delete 
query is specified using the "preImportDeleteQuery" or "postImportDeleteQuery", 
then the clean parameter will cause DIH will remove the index data from all 
data import handlers even though you are only refreshing one.   If you 
configure it carefully, it all works however.   

These are the use cases for the "source" field I use:

- Filter only on documents from one source for the user, by specifying 
fq=source:health-topics in the query to Solr.
- Filter only documents from one source in backend processing, for instance for 
the preImportDeleteQuery.
- Do something different in the application that front-ends Solr depending on 
the "source" field value.

There are some impacts on relevancy from combining them into one collection:

When you combine multiple sources into one collection, whether using DIH or 
some other mechanism, you have to remember that the relevancy calculations of 
Solr include documents from both sources.   Even if documents having different 
"source" documents are queried independently (through filter queries, such as 
fq:source=health-topics, the frequency of a word in the entire collection is a 
factor.    

However, you can query them together, even if you have to carefully tune 
weighting of the documents so that a large corpus doesn't dwarf a small one 
(unless it is appropriate).   As always, relevancy gets pretty tricky.

Hope this helps,

Dan Davis

-----Original Message-----
From: Yangrui Guo [mailto:guoyang...@gmail.com]
Sent: Tuesday, April 05, 2016 3:16 PM
To: solr-user@lucene.apache.org
Subject: Re: Multiple data-config.xml in one collection?

Hi Daniel,

So if I implement multiple dataimporthandler and do a full import, does Solr 
perform import of all handlers at once or can just specify which handler to 
import? Thank you

Yangrui

On Tuesday, April 5, 2016, Davis, Daniel (NIH/NLM) [C] <daniel.da...@nih.gov>
wrote:

> If Shawn is correct, and you are using DIH, then I have done this by 
> implementing multiple requestHandlers each of them using Data Import 
> Handler, and have each specify a different XML file for the data config.
> Instead of using data-config.xml, I've used a large number of files such as:
>         health-topics-conf.xml
>         encyclopedia-conf.xml
>         ...
> I tend to index a single valued, required field named "source" that I 
> can use in the delete query, and I use the TemplateTranformer to make this 
> easy:
>
> <entity name="topic"
>     ...
>    transformer="TemplateTransformer">
>    <field column="source" template="health-topics" />
>    ...
>
> Hope this helps,
>
> -Dan
>
> -----Original Message-----
> From: Shawn Heisey [mailto:apa...@elyograg.org <javascript:;>]
> Sent: Tuesday, April 05, 2016 10:50 AM
> To: solr-user@lucene.apache.org <javascript:;>
> Subject: Re: Multiple data-config.xml in one collection?
>
> On 4/5/2016 8:12 AM, Yangrui Guo wrote:
> > I'm using Solr Cloud to index a number of databases. The problem is 
> > there is unknown number of databases and each database has its own
> configuration.
> > If I create a single collection for every database the query would 
> > eventually become insanely long. Is it possible to upload different 
> > config to zookeeper for each node in a single collection?
>
> Every shard replica (core) in a collection shares the same 
> configuration, which it gets from zookeeper.  This is one of 
> SolrCloud's guarantees, to prevent problems found with old-style 
> sharding when the configuration is different on each machine.
>
> If you're using the dataimport handler, which you probably are since 
> you mentioned databases, you can parameterize pretty much everything 
> in the DIH config file so it comes from URL parameters on the 
> full-import or delta-import command.
>
> Below is a link to the DIH config that I'm using, redacted slightly.
> I'm not running SolrCloud, but the same thing should work in cloud.  
> It should give you some idea of how to use variables in your config, 
> set by parameters on the URL.
>
> http://apaste.info/jtq
>
> Thanks,
> Shawn
>
>

Reply via email to