Re: How to do parallel indexing on files (not on HDFS)

2018-05-24 Thread Rahul Singh
Right, That’s why you need a place to persist the task list / graph. If you use a table, you can set “processed” / “unprocessed” value … or a queue, then its delivered only once .. otherwise you have to check indexed date from solr, and waste a solr call. -- Rahul Singh rahul.si...@anant.us An

Re: How to do parallel indexing on files (not on HDFS)

2018-05-24 Thread Adhyan Arizki
You will still need to devise a way to partition the data source even if you are scheduling multiple jobs otherwise, you might end up digesting the same data again and again. On Fri, May 25, 2018 at 12:46 AM, Raymond Xie wrote: > Thank you all for the suggestions. I'm now tending to not using a

Re: How to do parallel indexing on files (not on HDFS)

2018-05-24 Thread Raymond Xie
Thank you all for the suggestions. I'm now tending to not using a traditional parallel indexing my data are json files with meta data extracted from raw data received and archived into our data server cluster. Those data come in various flows and reside in their respective folders, splitting them m

Re: How to do parallel indexing on files (not on HDFS)

2018-05-24 Thread Rahul Singh
Resending to list to help more people.. This is an architectural pattern to solve the same issue that arises over and over again.. The queue can be anything — a table in a database, even a collection solr. And yes I have implemented it —  I did it in C# before using a SQL Server table based qu

Re: How to do parallel indexing on files (not on HDFS)

2018-05-24 Thread Adhyan Arizki
Raymond, Running parallel index might be trickier than it looks if the scale is big. For instance, you can easily partition your data (let's say into 5 chunks) and run 5 processes to index them. However, you will need to be aware if there will be choke in the pipeline along the way (e.g. I/O of da

Re: How to do parallel indexing on files (not on HDFS)

2018-05-23 Thread Raymond Xie
Thank you Rahul despite that's very high level. With no offense, do you have a successful implementation or it is just your unproven idea? I never used Rabbit nor Kafka before but would be very interested in knowing more detail on the Kafka idea as Kafka is available in my environment. Thank you

Re: How to do parallel indexing on files (not on HDFS)

2018-05-23 Thread Rahul Singh
Enumerate the file locations (map) , put them in a queue like rabbit or Kafka (Persist the map), have a bunch of threads , workers, containers, whatever pop off the queue , process the item (reduce). -- Rahul Singh rahul.si...@anant.us Anant Corporation On May 20, 2018, 7:24 AM -0400, Raymond