jimmylao commented on issue #46:
URL: 
https://github.com/apache/incubator-bluemarlin/issues/46#issuecomment-1036274916


   @Bimlesh759-AI 
   In the long run, for the topic of parallel training deep learning model, you 
may want to make use of all possible resource, including parallel (multiple 
GPUs) as well as distributed (multiple servers with GPUs) computing. So far, 
there may be 2 options that support both parallel and distributed training
   1. Tensorflow 2
   2. Uber's open source project - Horovod
   
   Since the code for DIN model uses tensorflow 1.x, there are some effort need 
to be done using either TF 2 or Horovod.
   For TF2, current TF 1.x code need to be upgraded to TF 2
   For Horovod, it supports TF 1.x, however, you need to take some effort to 
make it work. - it's the option for Amazon SageMaker
   
   Most important, you may compare how much gain (speed-up) can be achieved by 
each of this approach. Conceptually, both of them should work for parallel 
training, in practice, quantitative evaluation of speed-up performance need to 
be compared.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to