Thanks man. I'd love to learn more about the Talend OpenStudio project
you're working on. Is it based on Lucene/Solr or a different project?

On Tuesday, April 5, 2016, Davis, Daniel (NIH/NLM) [C] <daniel.da...@nih.gov>
wrote:

> Yangrui,
>
> Let me clarify - to have multiple data imports run concurrently, my
> impression is that you must have different requestHandlers declared in your
> solrconfig.xml
> By default, Data Import Handler is not multi-threaded; having multiple
> requestHandlers for it is a workaround to this, not a fix.
>
> I also have to say that I'm trying in newer projects to work with Talend
> OpenStudio to do the database queries and push data to Solr.  Talend
> OpenStudio allows the same sort of transformations as possible in Data
> Import Handler, and seems to me more independent of SolrCloud than Data
> Import Handler.  There are many different ways to do it.
>
> -----Original Message-----
> From: Davis, Daniel (NIH/NLM) [C]
> Sent: Tuesday, April 05, 2016 5:40 PM
> To: solr-user@lucene.apache.org <javascript:;>
> Subject: RE: Multiple data-config.xml in one collection?
>
> Yangrui,
>
> Solr will just do one data import.    You can have a script invoke more
> than one, and they will run concurrently.   There are some risks with that,
> depending on what you are doing.   If it's just pulling from a database, I
> think you are all right.   I've even had 4 run concurrently to make Data
> Import Handler be "multi-threaded".   My query in one case looks like this:
>
>         SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM
> medplus.public_topic_sites_us_v t) WHERE threadid = 0
>
> And then I have 3 other queries in other DIH configurations for threadid
> 1,2,3.
>
> You also have to be careful with the clean parameter - unless a specific
> delete query is specified using the "preImportDeleteQuery" or
> "postImportDeleteQuery", then the clean parameter will cause DIH will
> remove the index data from all data import handlers even though you are
> only refreshing one.   If you configure it carefully, it all works however.
>
> These are the use cases for the "source" field I use:
>
> - Filter only on documents from one source for the user, by specifying
> fq=source:health-topics in the query to Solr.
> - Filter only documents from one source in backend processing, for
> instance for the preImportDeleteQuery.
> - Do something different in the application that front-ends Solr depending
> on the "source" field value.
>
> There are some impacts on relevancy from combining them into one
> collection:
>
> When you combine multiple sources into one collection, whether using DIH
> or some other mechanism, you have to remember that the relevancy
> calculations of Solr include documents from both sources.   Even if
> documents having different "source" documents are queried independently
> (through filter queries, such as fq:source=health-topics, the frequency of
> a word in the entire collection is a factor.
>
> However, you can query them together, even if you have to carefully tune
> weighting of the documents so that a large corpus doesn't dwarf a small one
> (unless it is appropriate).   As always, relevancy gets pretty tricky.
>
> Hope this helps,
>
> Dan Davis
>
> -----Original Message-----
> From: Yangrui Guo [mailto:guoyang...@gmail.com <javascript:;>]
> Sent: Tuesday, April 05, 2016 3:16 PM
> To: solr-user@lucene.apache.org <javascript:;>
> Subject: Re: Multiple data-config.xml in one collection?
>
> Hi Daniel,
>
> So if I implement multiple dataimporthandler and do a full import, does
> Solr perform import of all handlers at once or can just specify which
> handler to import? Thank you
>
> Yangrui
>
> On Tuesday, April 5, 2016, Davis, Daniel (NIH/NLM) [C] <
> daniel.da...@nih.gov <javascript:;>>
> wrote:
>
> > If Shawn is correct, and you are using DIH, then I have done this by
> > implementing multiple requestHandlers each of them using Data Import
> > Handler, and have each specify a different XML file for the data config.
> > Instead of using data-config.xml, I've used a large number of files such
> as:
> >         health-topics-conf.xml
> >         encyclopedia-conf.xml
> >         ...
> > I tend to index a single valued, required field named "source" that I
> > can use in the delete query, and I use the TemplateTranformer to make
> this easy:
> >
> > <entity name="topic"
> >     ...
> >    transformer="TemplateTransformer">
> >    <field column="source" template="health-topics" />
> >    ...
> >
> > Hope this helps,
> >
> > -Dan
> >
> > -----Original Message-----
> > From: Shawn Heisey [mailto:apa...@elyograg.org <javascript:;>
> <javascript:;>]
> > Sent: Tuesday, April 05, 2016 10:50 AM
> > To: solr-user@lucene.apache.org <javascript:;> <javascript:;>
> > Subject: Re: Multiple data-config.xml in one collection?
> >
> > On 4/5/2016 8:12 AM, Yangrui Guo wrote:
> > > I'm using Solr Cloud to index a number of databases. The problem is
> > > there is unknown number of databases and each database has its own
> > configuration.
> > > If I create a single collection for every database the query would
> > > eventually become insanely long. Is it possible to upload different
> > > config to zookeeper for each node in a single collection?
> >
> > Every shard replica (core) in a collection shares the same
> > configuration, which it gets from zookeeper.  This is one of
> > SolrCloud's guarantees, to prevent problems found with old-style
> > sharding when the configuration is different on each machine.
> >
> > If you're using the dataimport handler, which you probably are since
> > you mentioned databases, you can parameterize pretty much everything
> > in the DIH config file so it comes from URL parameters on the
> > full-import or delta-import command.
> >
> > Below is a link to the DIH config that I'm using, redacted slightly.
> > I'm not running SolrCloud, but the same thing should work in cloud.
> > It should give you some idea of how to use variables in your config,
> > set by parameters on the URL.
> >
> > http://apaste.info/jtq
> >
> > Thanks,
> > Shawn
> >
> >
>

Reply via email to