Resending to list to help more people.. This is an architectural pattern to solve the same issue that arises over and over again.. The queue can be anything — a table in a database, even a collection solr.
And yes I have implemented it — I did it in C# before using a SQL Server table based queue -- (http://github.com/appleseed/search-stack) — and then made the indexer be able to write to lucene, elastic or solr depending config. Im not actively maintaining this right now ,but will consider porting it to Kafka + Spark + Kafka Connect based system when I find time. In Kafka however, you have a lot of potential with Kafka Connect . Here is an example using Cassandra.. But the premise is the same Kafka Connect has libraries of connectors for different source / sinks … may not work for files but for pure raw data, Kafka Connect is good. Here’s a project that may guide you best. http://saumitra.me/blog/tweet-search-and-analysis-with-kafka-solr-cassandra/ I dont know where this guys code went.. but the content is there with code samples. -- On May 23, 2018, 8:37 PM -0500, Raymond Xie <xie3208...@gmail.com>, wrote: > Thank you Rahul despite that's very high level. > > With no offense, do you have a successful implementation or it is just your > unproven idea? I never used Rabbit nor Kafka before but would be very > interested in knowing more detail on the Kafka idea as Kafka is available in > my environment. > > Thank you again and look forward to hearing more from you or anyone in this > Solr community. > > > ------------------------------------------------ > Sincerely yours, > > > Raymond > > > On Wed, May 23, 2018 at 8:15 AM, Rahul Singh <rahul.xavier.si...@gmail.com> > > wrote: > > > Enumerate the file locations (map) , put them in a queue like rabbit or > > > Kafka (Persist the map), have a bunch of threads , workers, containers, > > > whatever pop off the queue , process the item (reduce). > > > > > > > > > -- > > > Rahul Singh > > > rahul.si...@anant.us > > > > > > Anant Corporation > > > > > > On May 20, 2018, 7:24 AM -0400, Raymond Xie <xie3208...@gmail.com>, wrote: > > > > I know how to do indexing on file system like single file or folder, but > > > > how do I do that in a parallel way? The data I need to index is of huge > > > > volume and can't be put on HDFS. > > > > > > > > Thank you > > > > > > > > *------------------------------------------------* > > > > *Sincerely yours,* > > > > > > > > > > > > *Raymond* >