Re: DataImportHandler Questions-Load data in parallel and temp tables

Noble Paul നോബിള്‍ नोब्ळ् Mon, 27 Apr 2009 22:02:42 -0700

there is an issue already to write to the index in a separate thread.

https://issues.apache.org/jira/browse/SOLR-1089


On Tue, Apr 28, 2009 at 4:15 AM, Shalin Shekhar Mangar
<shalinman...@gmail.com> wrote:
> On Tue, Apr 28, 2009 at 3:43 AM, Amit Nithian <anith...@gmail.com> wrote:
>
>> All,
>> I have a few questions regarding the data import handler. We have some
>> pretty gnarly SQL queries to load our indices and our current loader
>> implementation is extremely fragile. I am looking to migrate over to the
>> DIH; however, I am looking to use SolrJ + EmbeddedSolr + some custom stuff
>> to remotely load the indices so that my index loader and main search engine
>> are separated.
>
>
> Currently if you want to use DIH then the Solr master doubles up as the
> index loader as well.
>
>
>>
>> Currently, unless I am missing something, the data gathering from the
>> entity
>> and the data processing (i.e. conversion to a Solr Document) is done
>> sequentially and I was looking to make this execute in parallel so that I
>> can have multiple threads processing different parts of the resultset and
>> loading documents into Solr. Secondly, I need to create temporary tables to
>> store results of a few queries and use them later for inner joins was
>> wondering how to best go about this?
>>
>> I am thinking to add support in DIH for the following:
>> 1) Temporary tables (maybe call it temporary entities)? --Specific only to
>> SQL though unless it can be generalized to other sources.
>
>
> Pretty specific to DBs. However, isn't this something that can be done in
> your database with views?
>
>
>>
>> 2) Parallel support
>
>
> Parallelizing import of root-entities might be the easiest to attempt.
> There's also an issue open to write to Solr (tokenization/analysis) in a
> separate thread. Look at https://issues.apache.org/jira/browse/SOLR-1089
>
> We actually wrote a multi-threaded DIH during the initial iterations. But we
> discarded it because we found that the bottleneck was usually the database
> (too many queries) or Lucene indexing itself (analysis, tokenization) etc.
> The improvement was ~10% but it made the code substantially more complex.
>
> The only scenario in which it helped a lot was when importing from HTTP or a
> remote database (slow networks). But if you think it can help in your
> scenario, I'd say go for it.
>
>
>>
>>  - Including some mechanism to get the number of records (whether it be
>> count or the MAX(custom_id)-MIN(custom_id))
>
>
> Not sure what you mean here.
>
>
>>
>> 3) Support in DIH or Solr to post documents to a remote index (i.e. create
>> a
>> new UpdateHandler instead of DirectUpdateHandler2).
>>
>
> Solrj integration would be helpful to many I think. There's an issue open.
> Look at https://issues.apache.org/jira/browse/SOLR-853
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
--Noble Paul

Re: DataImportHandler Questions-Load data in parallel and temp tables

Reply via email to