Right,
That’s why you need a place to persist the task list / graph. If you use a
table, you can set “processed” / “unprocessed” value … or a queue, then its
delivered only once .. otherwise you have to check indexed date from solr, and
waste a solr call.
--
Rahul Singh
rahul.si...@anant.us
An
You will still need to devise a way to partition the data source even if
you are scheduling multiple jobs otherwise, you might end up digesting the
same data again and again.
On Fri, May 25, 2018 at 12:46 AM, Raymond Xie wrote:
> Thank you all for the suggestions. I'm now tending to not using a
Thank you all for the suggestions. I'm now tending to not using a
traditional parallel indexing my data are json files with meta data
extracted from raw data received and archived into our data server cluster.
Those data come in various flows and reside in their respective folders,
splitting them m
Resending to list to help more people..
This is an architectural pattern to solve the same issue that arises over and
over again.. The queue can be anything — a table in a database, even a
collection solr.
And yes I have implemented it — I did it in C# before using a SQL Server table
based qu
Raymond,
Running parallel index might be trickier than it looks if the scale is big.
For instance, you can easily partition your data (let's say into 5 chunks)
and run 5 processes to index them. However, you will need to be aware if
there will be choke in the pipeline along the way (e.g. I/O of da
Thank you Rahul despite that's very high level.
With no offense, do you have a successful implementation or it is just your
unproven idea? I never used Rabbit nor Kafka before but would be very
interested in knowing more detail on the Kafka idea as Kafka is available
in my environment.
Thank you
Enumerate the file locations (map) , put them in a queue like rabbit or Kafka
(Persist the map), have a bunch of threads , workers, containers, whatever pop
off the queue , process the item (reduce).
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On May 20, 2018, 7:24 AM -0400, Raymond