Re: DIH

Alexandre Rafalovitch Mon, 17 Feb 2014 01:19:14 -0800

I haven't tried Apache Flume but the manual seems to suggest 'yes' to
a large number of your checklist items:
http://flume.apache.org/FlumeUserGuide.html


When you say 'rich document' indexing, the keyword you are looking for
is (Apache) Tika, as that's what actually doing the job under the
covers.

Whether it can replicate your specific requirements, is a question
only you can answer for yourself of course. When you do, maybe let us
know, so we can learn too. :-)

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Mon, Feb 17, 2014 at 8:11 PM, Ahmet Arslan <[email protected]> wrote:
> Hi Mikhail,
>
> Can you please elaborate what do you mean?
> My understanding is that there is no multi-threading support in DIH. For some 
> reasons, it won't have. Am I correct?
>
> Regarding apache flume, how it can be dih replacement? Can I index rich 
> documents on my disk using flume? Can I fetch documents from 
> wikipedia,jira,twitter,dropbox,rdbms,rss,file system by using it?
>
> Ahmet
>
>
>
> On Monday, February 17, 2014 10:41 AM, Mikhail Khludnev 
> <[email protected]> wrote:
> On Sat, Feb 15, 2014 at 1:07 PM, Shawn Heisey <[email protected]> wrote:
>
>> On 2/14/2014 10:45 PM, William Bell wrote:
>> > On virtual cores the DIH handler is really slow. On a 12 core box it only
>> > uses 1 core while indexing.
>> >
>> > Does anyone know how to do Java threading from a SQL query into Solr?
>> > Examples?
>> >
>> > I can use SolrJ to do it, or I might be able to modify DIH to enable
>> > threading.
>> >
>> > At some point in 3.x threading was enabled in DIH, but it was removed
>> since
>> > people where having issues with it (we never did).
>>
>> If you know how to fix DIH so it can do multiple indexing threads
>> safely, please open an issue and upload a patch.
>>
> Please! Don't do it. Never again!
> https://issues.apache.org/jira/browse/SOLR-3011
>
> As far as I understand the general idea is to find the DIH successor
> https://issues.apache.org/jira/browse/SOLR-4799?focusedCommentId=13738424&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13738424
>
>
>
>>
>> I'm still using DIH for full rebuilds, but I'd actually like to replace
>> it with a rebuild routine written in SolrJ.  I currently achieve decent
>> speed by running DIH on all my shards at the same time.
>>
>> I do use SolrJ for once-a-minute index maintenance, but the code that
>> I've written to pull data out of SQL and write it to Solr is not able to
>> index millions of documents in a single thread as fast as DIH does.  I
>> have been building a multithreaded design in my head, but I haven't had
>> a chance to write real code and see whether it's actually a good design.
>>
>> For me, the bottleneck is definitely Solr, not the database.  I recently
>> wrote a test program that uses my current SolrJ indexing method.  If I
>> skip the "server.add(docs)" line, it can read all 91 million docs from
>> the database and build SolrInputDocument objects for them in 2.5 hours
>> or less, all with a single thread.  When I do a real rebuild with DIH,
>> it takes a little more than 4.5 hours -- and that is inherently
>> multithreaded, because it's doing all the shards simultaneously.  I have
>> no idea how long it would take with a single-threaded SolrJ program.
>>
>> Thanks,
>> Shawn
>>
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> <[email protected]
>>
>

Re: DIH

Reply via email to