Dear all, I have many jobs (900k) to run on many machines (4k) . All jobs are independent, particularly, they use the same algorithm, but the input is different. If I could build a single cluster with 4k machines, I can simple submit all my jobs using a shell script. Critically, the jobs will execute in a sequential fashion; latter jobs will wait until all jobs before it finish.
Here comes the problem. Because these machines are on different datacenters, I can not build a single cluster and submit all jobs in the above way. To start simple, I built many clusters, and each machine is a cluster running in pseudo mode (I do not want to use standalone because all machine are multi-core ones). Now, I want to submit jobs to clusters from one machine, dynamically. I.E., I first submit 4k jobs to the 4k clusters; then depending on which one finishes, I will submit a new job to that cluster; and keep going in this way until all 900k jobs are finished. I need your advice on the following difficulty I am having: how do I know a job j(i) on machine m(j) has finished? Note: Since I want to submit jobs dynamically, no static job ordering should assumed. Thanks in advance! -Sam
