Hi Mikhail,

Can you please elaborate what do you mean? 
My understanding is that there is no multi-threading support in DIH. For some 
reasons, it won't have. Am I correct?

Regarding apache flume, how it can be dih replacement? Can I index rich 
documents on my disk using flume? Can I fetch documents from 
wikipedia,jira,twitter,dropbox,rdbms,rss,file system by using it?

Ahmet



On Monday, February 17, 2014 10:41 AM, Mikhail Khludnev 
<mkhlud...@griddynamics.com> wrote:
On Sat, Feb 15, 2014 at 1:07 PM, Shawn Heisey <s...@elyograg.org> wrote:

> On 2/14/2014 10:45 PM, William Bell wrote:
> > On virtual cores the DIH handler is really slow. On a 12 core box it only
> > uses 1 core while indexing.
> >
> > Does anyone know how to do Java threading from a SQL query into Solr?
> > Examples?
> >
> > I can use SolrJ to do it, or I might be able to modify DIH to enable
> > threading.
> >
> > At some point in 3.x threading was enabled in DIH, but it was removed
> since
> > people where having issues with it (we never did).
>
> If you know how to fix DIH so it can do multiple indexing threads
> safely, please open an issue and upload a patch.
>
Please! Don't do it. Never again!
https://issues.apache.org/jira/browse/SOLR-3011

As far as I understand the general idea is to find the DIH successor
https://issues.apache.org/jira/browse/SOLR-4799?focusedCommentId=13738424&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13738424



>
> I'm still using DIH for full rebuilds, but I'd actually like to replace
> it with a rebuild routine written in SolrJ.  I currently achieve decent
> speed by running DIH on all my shards at the same time.
>
> I do use SolrJ for once-a-minute index maintenance, but the code that
> I've written to pull data out of SQL and write it to Solr is not able to
> index millions of documents in a single thread as fast as DIH does.  I
> have been building a multithreaded design in my head, but I haven't had
> a chance to write real code and see whether it's actually a good design.
>
> For me, the bottleneck is definitely Solr, not the database.  I recently
> wrote a test program that uses my current SolrJ indexing method.  If I
> skip the "server.add(docs)" line, it can read all 91 million docs from
> the database and build SolrInputDocument objects for them in 2.5 hours
> or less, all with a single thread.  When I do a real rebuild with DIH,
> it takes a little more than 4.5 hours -- and that is inherently
> multithreaded, because it's doing all the shards simultaneously.  I have
> no idea how long it would take with a single-threaded SolrJ program.
>
> Thanks,
> Shawn
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mkhlud...@griddynamics.com
>

Reply via email to