Hello, I'm more than happy to contribute to this effort as well. We are still on Solr 3.5 and never got solr 'threads' working properly. I've heard much of this was fixed in 3.6 but still a bit buggy and deprecated in later versions. Fully support in 4.X is a major wish-list item, that I would be willing to sacrafice a few weekends for myself helping write it. Nevertheless and related to this, its a bit interesting to see this topic open because I just started experimenting with simulating a multi-threaded import via the DIH and the following approach and it appears to be working fine with a caveat: Here's are my steps: I created a series of similiar entities, partitioning the data I'm targeting by a logical range (i.e WHERE somefield BETWEEN 'SOME VALUE' AND 'SOME VALUE'. And I have a few of these, but depending on your data, you'll need to experiement.. (You'll need to be careful not to bring your database to its knees! ) Within Solrconfig.xml , I've created a corresponding Data Import Handlers, one for each of these entitites. And when I initiate a nimport, I call each one, similiar to the below (obviously I've stripped out my server & naming conventions. http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting1]&commit=true http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting2]&commit=true ...etc Import seems to run fine with SIGNIFICANT performance gains. The only caveat I haven't figured out yet is one of the threads doesn't appear to commit its data (although it states that it did). I have turned off auto-commit as well.. but still playing with it. Either way an out of the box is much preferred. Cheers! Mike
From: Mikhail Khludnev <mkhlud...@griddynamics.com> To: solr-user <solr-user@lucene.apache.org> Sent: Friday, June 14, 2013 9:15 AM Subject: Re: The 'threads' parameter in DIH - SOLR 4.3.0 Hello, Most times users end-up with coding multithread SolrJ indexer that I consider as a sad thing. As 3.x fix contributor I want to share my vision to the problem. While I did that work I realized that join operation itself is too hard and even impossible to make concurrent. I propose to add concurrency into outbound and inbound streams. My plan is: 1. add threads to outbound flow https://issues.apache.org/jira/browse/SOLR-3585it allows to don't wait for Solr. I mostly like that code, but recently I realized that this code implements ConcurrentUpdateSolrServer algorithm, looking forward I prefer to unify some core concurrent code between them or it's kind of using CUSS inside of DIH's SolrWriter 2. The next problem, which we've faced is SQLEntityProcessor. It has two modes, one of them gets miserable performance due to N+1 problem; cached version is not production capable with default heap cache. Our proposal for it https://issues.apache.org/jira/browse/SOLR-4799unfortunately I have no time to polish the patch. 3. After that the only thing which DIH waits for is jdbc. it can be easily boosted by implementing DataSource wrapper with producer thread and bounded queue as a buffer. if we complete this plan, we will never need to code SolrJ indexers. Particular question to you is what you need to speed up? On Thu, Jun 13, 2013 at 11:01 PM, Shawn Heisey <s...@elyograg.org> wrote: > On 6/13/2013 12:08 PM, bbarani wrote: > >> I see that the threads parameter has been removed from DIH from all >> version >> starting SOLR 4.x. Can someone let me know the best way to initiate >> indexing >> in multi threaded mode when using DIH now? Is there a way to do that? >> > > That parameter was removed because it didn't work right, and there was no > apparent way to fix it. The change that went into a later 3.6 version was > a bandaid, not a fix. I don't know all the details. > > There's no way to get multithreading with DIH directly, but you can do it > indirectly: > > Create multiple request handlers with different names, such as > /dataimport1, /dataimport2, etc. Configure each handler with settings that > will pull part of your data source. Start them so they run concurrently. > > Depending on your environment, it may be easier to just write a > multi-threaded indexing application using the Solr API for your language of > choice. > > Thanks, > Shawn > > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <mkhlud...@griddynamics.com>