Hi everyone, Happy new year,
We are trying to setup a Hadoop cluster with two different AD domains. One AD domain consists of SPNs for Hadoop and UPNs for a branch of company (e.g. @BRATISLAVA.TRIVIADATA.COM). Other AD domain contains only UPNs for a different branch of the company (e.g. @WIEN.TRIVIADATA.COM). What the problem is: by normalizing the UPN full name - for example "[email protected]" into a Hadoop username "michalklempa" we run into trouble. There may be another person, working at different branch of the company with same same and surname. There is distinct UPN for this person: [email protected]. What happens at Hadoop level is, that both these employees end up with same Hadoop username: michalklempa. But these are 2 different persons from different domains. Of course we want to avoid security issues, that have been mentioned here https://blog.samoylenko.me/2015/04/15/hadoop-security-auth_to_local-examples/, so we cannot just cut off domain part from UPN. At a Linux level we came up with a solution of setting up SSSD to use "name@domain" as the user name and the HOME directory is set up in SSSD as /home/$domain/$user. This works well. So first question is, are there any troubles with using full UPN in form of user@domain (e.g. [email protected]) in Hadoop/YARN? Now the practical part. After Kerberizing Hadoop with AD, Hadoop is using only `user` portion of the UPN. For this reason, we cannot properly submit YARN jobs to YARN cluster using user [email protected]. After obtaining Kerberos ticket using kinit, we submitted TeraSort job and after moment ShuffleError occurred. It seems to us that the Hadoop/YARN tries to use the `michalklempa` portion as the Linux account on worker nodes and SSSD does not recognize account `michalklempa`. SSSD is only aware of accounts named in form user@domain. This is our environment: HDFS 2.7.3 YARN 2.7.3 MapReduce2 2.7.3 ``` $ sssd --version 1.13.4 ``` ``` $ klist -V Kerberos 5 version 1.13.2 ``` ``` $ uname -a Linux master-01.triviadata.local 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux ``` ``` $ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=16.04 DISTRIB_CODENAME=xenial DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS" ``` TeraGen part of job ended properly, without exceptions. After that we checked logs on nodemanager, where IOException occurred: ``` Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#2 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:366) at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:288) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:354) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193) Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143. ``` Moreover, while extracting logs from YARN, this error occurs: ``` 2017-12-21 12:38:23,938 ERROR ifile.LogAggregationIndexedFileController (LogAggregationIndexedFileController.java:logErrorMessage(1011)) - Error aggregating log file. Log file : /data/00/yarn/log/application_1513161004473_0023/container_e06_1513161004473_0023_01_000438/prelaunch.out. Owner '[email protected]' for path /data/00/yarn/log/application_1513161004473_0023/container_e06_1513161004473_0023_01_000438/prelaunch.out did not match expected owner 'michalklempa' java.io.IOException: Owner '[email protected]' for path /data/00/yarn/log/application_1513161004473_0023/container_e06_1513161004473_0023_01_000438/prelaunch.out did not match expected owner 'michalklempa' at org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:285) at org.apache.hadoop.io.SecureIOUtils.forceSecureOpenForRead(SecureIOUtils.java:219) at org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:204) at org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.write(LogAggregationIndexedFileController.java:365) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl$ContainerLogAggregator.doContainerLogAggregation(AppLogAggregatorImpl.java:470) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:222) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:328) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:284) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` We tried to run tcpdump for port 13465 (http shuffle) on a worker node, during the job execution. This is what we found inside the HTTP exchange: ``` GET /mapOutput?job=job_1513161004473_0026&reduce=0&map=attempt_1513161004473_0026_m_000059_0 HTTP/1.1 UrlHash: vb4FT2RJdu+2p+/xWP7FB66SQtw= name: mapreduce version: 1.0.0 User-Agent: Java/1.8.0_151 Host: worker-01.triviadata.local:13562 Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2 Connection: keep-alive HTTP/1.1 200 OK ReplyHash: 7isHCuWdJyZhpJtDlaV5oPTBcls= name: mapreduce version: 1.0.0 HTTP/1.1 500 Internal Server Error Content-Type: text/plain; charset=UTF-8 name: mapreduce version: 1.0.0 Error Reading IndexFileOwner '[email protected]' for path /data/01/yarn/local/usercache/michalklempa/appcache/application_1513161004473_0026/output/attempt_1513161004473_0026_m_000059_0/file.out.index did not match expected owner 'michalklempa' ``` (Yes, it seems like the HTTP header is sent twice in the reply). The latter reply clearly states the error is related to our question regarding the domain part of the username on Linux. We use HDP and after Kerberizing cluster we have not changed default value for hadoop.security.auth_to_local, so we expected '[email protected]' to be mapped just to 'michalklempa' user, but for some reason, YARN container's data are created under user '[email protected]' (may be SSSD is too clever and uses the default domain if no user found, that would be fine after all, but Hadoop crashes when reading the data). Does anyone have any suggestions, how to solve this issue, so we will be able to add users from second domain to our Hadoop cluster? Are there any recommendations on how to setup UPN translation to Linux user name in such a case we encounter? Thanks, Best regards Michal Klempa
