there is an issue already to write to the index in a separate thread. https://issues.apache.org/jira/browse/SOLR-1089
On Tue, Apr 28, 2009 at 4:15 AM, Shalin Shekhar Mangar <shalinman...@gmail.com> wrote: > On Tue, Apr 28, 2009 at 3:43 AM, Amit Nithian <anith...@gmail.com> wrote: > >> All, >> I have a few questions regarding the data import handler. We have some >> pretty gnarly SQL queries to load our indices and our current loader >> implementation is extremely fragile. I am looking to migrate over to the >> DIH; however, I am looking to use SolrJ + EmbeddedSolr + some custom stuff >> to remotely load the indices so that my index loader and main search engine >> are separated. > > > Currently if you want to use DIH then the Solr master doubles up as the > index loader as well. > > >> >> Currently, unless I am missing something, the data gathering from the >> entity >> and the data processing (i.e. conversion to a Solr Document) is done >> sequentially and I was looking to make this execute in parallel so that I >> can have multiple threads processing different parts of the resultset and >> loading documents into Solr. Secondly, I need to create temporary tables to >> store results of a few queries and use them later for inner joins was >> wondering how to best go about this? >> >> I am thinking to add support in DIH for the following: >> 1) Temporary tables (maybe call it temporary entities)? --Specific only to >> SQL though unless it can be generalized to other sources. > > > Pretty specific to DBs. However, isn't this something that can be done in > your database with views? > > >> >> 2) Parallel support > > > Parallelizing import of root-entities might be the easiest to attempt. > There's also an issue open to write to Solr (tokenization/analysis) in a > separate thread. Look at https://issues.apache.org/jira/browse/SOLR-1089 > > We actually wrote a multi-threaded DIH during the initial iterations. But we > discarded it because we found that the bottleneck was usually the database > (too many queries) or Lucene indexing itself (analysis, tokenization) etc. > The improvement was ~10% but it made the code substantially more complex. > > The only scenario in which it helped a lot was when importing from HTTP or a > remote database (slow networks). But if you think it can help in your > scenario, I'd say go for it. > > >> >> - Including some mechanism to get the number of records (whether it be >> count or the MAX(custom_id)-MIN(custom_id)) > > > Not sure what you mean here. > > >> >> 3) Support in DIH or Solr to post documents to a remote index (i.e. create >> a >> new UpdateHandler instead of DirectUpdateHandler2). >> > > Solrj integration would be helpful to many I think. There's an issue open. > Look at https://issues.apache.org/jira/browse/SOLR-853 > > -- > Regards, > Shalin Shekhar Mangar. > -- --Noble Paul