Thanks Zhankun for the clarification. Also, is my understanding correct on --checkpoint_path as I mentioned earlier in the thread..?? Quoting the comment again in this thread.
[There is another argument called *--checkpoint_path* which acts as a path where all the outputs (models or datasets) which are resulted as part of the execution of the worker code inside the docker container. Hence, *--input_path* acts as entry point which will be localized and *--checkpoint_path *acts as exit point, where both these paths are hdfs paths which runs outside the docker container.] Will continue my exercise with Submarine and would love to discuss more. On Mon, Feb 25, 2019 at 4:21 PM zhankun tang <[email protected]> wrote: > Hi Vinay, > > IIRC, YARN will have the host's Hadoop environments set in container > launch script by default. And in the submarine case, the user's worker > command is used to generate a worker script which is invoked in the > container launch script. If submarine doesn't override the default Hadoop > environment variable, the HDFS read/write in the container might fail due > to not found or incorrect Hadoop location. > So even a Docker image is built with correct Hadoop environment set, it > seems also needs this override to use HDFS library in a container. This > seems caused by YARN's Docker support and the submarine is doing a > workaround here. > > The submarine is evolving rapidly, please share your thoughts if it's > uncomfortable > for you. > > Thanks, > Zhankun > > On Mon, 25 Feb 2019 at 12:22, Vinay Kashyap <[email protected]> wrote: > >> Hi Zhankun, >> Thanks for the reply. >> >> Regarding Question 1 : Okay.. I understand, Let me try configuring >> multiple input path place holders and refer the same in the worker launch >> command. >> >> Regarding Question 2 : >> What I did not understand is why YARN has to set anything related to >> Hadoop which runs inside the container. The Hadoop environment and the >> worker code to read the same is completely isolated to the docker >> container. In that case, the worker scripts should know where the >> HADOOP_HOME is inside the container right.? There is another argument >> called *--checkpoint_path* which acts as a path where all the outputs >> (models or datasets) which are resulted as part of the execution of the >> worker code inside the docker container. Hence, *--input_path* acts as >> entry point which will be localized and *--checkpoint_path *acts as exit >> point, where both these paths are hdfs paths which runs outside the docker >> container. So why YARN should know the hadoop configuration which is inside >> the container.? >> >> Thanks and regards >> Vinay Kashyap >> >> On Fri, Feb 22, 2019 at 7:39 PM zhankun tang <[email protected]> >> wrote: >> >>> Hi Vinay, >>> >>> For question one, IIRC, we cannot set multiple "*--input_path" *flag at >>> present. The "--input_path" is designed originally as a placeholder to >>> store a path and then the path is used to replace "%input_path%" in worker >>> command like "python worker.sh -input %input_path% ..". >>> So from this perspective, you can directly append the other input paths >>> to your worker command in your own way. >>> >>> For question two, because YARN might set a wrong HADOOP_COMMON_HOME by >>> default. So submarine provides the environment variable to be set in the >>> worker's launch script if the worker wants to access HDFS. >>> And there's no data plane relation between outside Hadoop and the >>> container except YARN will localize resources for the container. >>> >>> Hope this can answer your questions. >>> >>> Best Regards, >>> Zhankun >>> >>> On Fri, 22 Feb 2019 at 15:35, Vinay Kashyap <[email protected]> wrote: >>> >>>> Hi all, >>>> >>>> I am using Hadoop 3.2.0. I am trying few examples using Submarine to >>>> run TensorFlow jobs in a docker container. >>>> I would like to understand few details regarding Read/Write HDFS data >>>> during/after application launch/execution. Have highlighted the questions >>>> line. >>>> >>>> When launching the application which reads input from HDFS, we >>>> configure *--input_path* to a hdfs path, as mentioned in the standard >>>> example. >>>> >>>> yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \ >>>> --name tf-job-001 --docker_image <your docker image> \ >>>> --input_path hdfs://default/dataset/cifar-10-data \ >>>> --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \ >>>> --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \ >>>> --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \ >>>> --num_workers 2 \ >>>> --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd >>>> for worker ..." \ >>>> --num_ps 2 \ >>>> --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \ >>>> >>>> *Question 1 : What if I have more than 1 dataset in a separate HDFS >>>> paths? Can --input_path take multiple paths in any fashion or is it >>>> expected to maintain all the datasets under one path.?* >>>> >>>> "DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image" >>>> and "DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker >>>> image". >>>> >>>> *Question 2 : What is the exact expectation here.? In the sense, is >>>> there any relation/connection with the Hadoop running outside the docker >>>> container.? I guess read HDFS data into the docker container happens during >>>> Container localization, but how does output data write back happens to HDFS >>>> running outside the docker container.?* >>>> >>>> Assuming a scenario where Application 1 creates a model and Application >>>> 2 performs scoring. Both the applications run in a separate docker >>>> containers. I would like the understand how does the data read and write >>>> across applications happen in this case. >>>> Would be of great help if anyone can be guide me understanding this or >>>> direct me to a blog or write up which explains the above. >>>> >>>> *Thanks and regards* >>>> *Vinay Kashyap* >>>> >>> >> >> -- >> *Thanks and regards* >> *Vinay Kashyap* >> > -- *Thanks and regards* *Vinay Kashyap*
