Re: How to do parallel indexing on files (not on HDFS)

Rahul Singh Thu, 24 May 2018 08:23:56 -0700

Resending to list to help more people..

This is an architectural pattern to solve the same issue that arises over and 
over again.. The queue can be anything — a table in a database, even a 
collection solr.

And yes I have implemented it —  I did it in C# before using a SQL Server table 
based queue -- (http://github.com/appleseed/search-stack) — and then made the 
indexer be able to write to lucene, elastic or solr depending config. Im not 
actively maintaining this right now ,but will consider porting it to Kafka + 
Spark + Kafka Connect based system when I find time.

In Kafka however, you have a lot of potential with Kafka Connect . Here is an 
example using Cassandra..
But the premise is the same Kafka Connect has libraries of connectors for 
different source / sinks … may not work for files but for pure raw data, Kafka 
Connect is good.

Here’s a project that may guide you best.

http://saumitra.me/blog/tweet-search-and-analysis-with-kafka-solr-cassandra/

I dont know where this guys code went.. but the content is there with code 
samples.

--

On May 23, 2018, 8:37 PM -0500, Raymond Xie <xie3208...@gmail.com>, wrote:
> Thank you Rahul despite that's very high level.
>
> With no offense, do you have a successful implementation or it is just your 
> unproven idea? I never used Rabbit nor Kafka before but would be very 
> interested in knowing more detail on the Kafka idea as Kafka is available in 
> my environment.
>
> Thank you again and look forward to hearing more from you or anyone in this 
> Solr community.
>
>
> ------------------------------------------------
> Sincerely yours,
>
>
> Raymond
>
> > On Wed, May 23, 2018 at 8:15 AM, Rahul Singh <rahul.xavier.si...@gmail.com> 
> > wrote:
> > > Enumerate the file locations (map) , put them in a queue like rabbit or 
> > > Kafka (Persist the map), have a bunch of threads , workers, containers, 
> > > whatever pop off the queue , process the item (reduce).
> > >
> > >
> > > --
> > > Rahul Singh
> > > rahul.si...@anant.us
> > >
> > > Anant Corporation
> > >
> > > On May 20, 2018, 7:24 AM -0400, Raymond Xie <xie3208...@gmail.com>, wrote:
> > > > I know how to do indexing on file system like single file or folder, but
> > > > how do I do that in a parallel way? The data I need to index is of huge
> > > > volume and can't be put on HDFS.
> > > >
> > > > Thank you
> > > >
> > > > *------------------------------------------------*
> > > > *Sincerely yours,*
> > > >
> > > >
> > > > *Raymond*
>

Re: How to do parallel indexing on files (not on HDFS)

Reply via email to