Re: Using Kerberized Hadoop with two AD domains and ShuffleError

Michal Klempa Thu, 11 Jan 2018 05:52:30 -0800

Hi,

we were able to upgrade our auth_to_local from


RULE:[1:$1@$0](.*@BRATISLAVA.TRIVIADATA.COM)s/@.*//
to
RULE:[1:$1@$0](.*@BRATISLAVA.TRIVIADATA.COM)s/(.*)/\1/L

the former one is checking for a correct domain, but basically the
substitution works same as DEFAULT - taking only the first component of
principal.
the latter one is checking for a correct domain, but basically just
doing lowercase conversion on a principal name as a whole.
As the AD is not case sensitive and SSSD uses the same format
(https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/sssd-user-ids).
We should be lowercasing only the domain part, however.

Using this setup, we were able to submit YARN job with a INFO log:
Non-simple name [email protected] after
auth_to_local rule ...

Searching this yielded
https://issues.apache.org/jira/browse/HADOOP-12751
which is going to be fixed in 2.8.0. We run 2.7.2 (Hortonworks, which
pulled the fix
https://github.com/hortonworks/hadoop-release/commit/a7c5663096509236eb3b4c05160be90e43005a0b#diff-684c0d8f3fe52fd7cdcaefa69205d51d)

What is troubling our minds now,will other tools in the ecosystem handle
the @ in usernames properly?

We understand this list is intended only for Hadoop(YARN,HDFS), so the
question will be discussed on different lists.
If we got any news on this, we may push the information here thereafter.
For now, thank you Daryn.

Regards,
Michal Klempa

On 09.01.2018 17:03, Daryn Sharp wrote:
> > So first question is, are there any troubles with using full UPN in
> form of user@domain (e.g. [email protected]
> <mailto:[email protected]>) in Hadoop/YARN?
> Hdfs should be fine.  Yarn should be if the system is fine with a
> realm augmented user for the container executor's setuid.
>
> > We use HDP and after Kerberizing cluster we have not changed default
> value for hadoop.security.auth_to_local, so we expected
> '[email protected]
> <mailto:[email protected]>' to be mapped just
> to 'michalklempa' user, but for some reason, YARN container's data are
> created under user '[email protected]
> <mailto:[email protected]>' (may be SSSD is too
> clever and uses the default domain if no user found, that would be
> fine after all, but Hadoop crashes when reading the data).
>
> You definitely must change the auth_to_local rules on all server
> configs to not strip the realm.  Yes, the job runs using the short
> name w/o realm and linux/SSSD appears to assume the default realm
> during setuid.  The failure is the file stat owner includes the realm
> (which you want) but does not match the default auth_to_local's
> realm-less short user.
>
> Daryn
>
> On Tue, Jan 9, 2018 at 7:22 AM, Michal Klempa <[email protected]
> <mailto:[email protected]>> wrote:
>
>     Hi everyone, Happy new year,
>
>
>     We are trying to setup a Hadoop cluster with two different AD domains.
>
>     One AD domain consists of SPNs for Hadoop and UPNs for a branch of
>     company (e.g. @BRATISLAVA.TRIVIADATA.COM
>     <http://BRATISLAVA.TRIVIADATA.COM>).
>
>     Other AD domain contains only UPNs for a different branch of the
>     company (e.g. @WIEN.TRIVIADATA.COM <http://WIEN.TRIVIADATA.COM>).
>
>
>     What the problem is: by normalizing the UPN full name - for
>     example "[email protected]"
>     <mailto:[email protected]> into a Hadoop
>     username "michalklempa" we run into trouble.
>
>     There may be another person, working at different branch of the
>     company with same same and surname.
>
>     There is distinct UPN for this person:
>     [email protected]
>     <mailto:[email protected]>.
>
>     What happens at Hadoop level is, that both these employees end up
>     with same Hadoop username: michalklempa.
>
>     But these are 2 different persons from different domains. Of
>     course we want to avoid security issues, that have been mentioned
>     here
>     
> https://blog.samoylenko.me/2015/04/15/hadoop-security-auth_to_local-examples/
>     
> <https://blog.samoylenko.me/2015/04/15/hadoop-security-auth_to_local-examples/>,
>     so we cannot just cut off domain part from UPN.
>
>     At a Linux level we came up with a solution of setting up SSSD to
>     use "name@domain" as the user name and the HOME directory is set
>     up in SSSD as /home/$domain/$user.
>
>     This works well.
>
>
>     So first question is, are there any troubles with using full UPN
>     in form of user@domain (e.g. [email protected]
>     <mailto:[email protected]>) in Hadoop/YARN?
>
>
>     Now the practical part.
>
>     After Kerberizing Hadoop with AD, Hadoop is using only `user`
>     portion of the UPN.
>
>     For this reason, we cannot properly submit YARN jobs to YARN
>     cluster using user [email protected]
>     <mailto:[email protected]>. After obtaining
>     Kerberos ticket using kinit, we submitted TeraSort job and after
>     moment ShuffleError occurred.
>
>     It seems to us that the Hadoop/YARN tries to use the
>     `michalklempa` portion as the Linux account on worker nodes and
>     SSSD does not recognize account `michalklempa`.
>
>     SSSD is only aware of accounts named in form user@domain.
>
>
>     This is our environment:
>
>
>     HDFS 2.7.3
>
>     YARN 2.7.3
>
>     MapReduce2 2.7.3
>
>
>     ```
>
>     $ sssd --version
>
>     1.13.4
>
>     ```
>
>
>     ```
>
>     $ klist -V
>
>     Kerberos 5 version 1.13.2
>
>     ```
>
>
>     ```
>
>     $ uname -a
>
>     Linux master-01.triviadata.local 4.4.0-98-generic #121-Ubuntu SMP
>     Tue Oct 10 14:24:03 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>
>     ```
>
>
>     ```
>
>     $ cat /etc/lsb-release
>
>     DISTRIB_ID=Ubuntu
>
>     DISTRIB_RELEASE=16.04
>
>     DISTRIB_CODENAME=xenial
>
>     DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"
>
>     ```
>
>
>     TeraGen part of job ended properly, without exceptions. After that
>     we checked logs on nodemanager, where IOException occurred:
>
>     ```
>
>     Error:
>     org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError:
>     error in shuffle in fetcher#2
>
>     at
>     org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>
>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>
>     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
>
>     at java.security.AccessController.doPrivileged(Native Method)
>
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>
>     at
>     
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
>
>     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
>
>     Caused by: java.io.IOException: Exceeded
>     MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>
>     at
>     
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:366)
>
>     at
>     
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:288)
>
>     at
>     
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:354)
>
>     at
>     org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
>
>
>     Container killed by the ApplicationMaster.
>
>     Container killed on request. Exit code is 143
>
>     Container exited with a non-zero exit code 143.
>
>     ```
>
>
>     Moreover, while extracting logs from YARN, this error occurs:
>
>     ```
>
>     2017-12-21 12:38:23,938 ERROR
>     ifile.LogAggregationIndexedFileController
>     (LogAggregationIndexedFileController.java:logErrorMessage(1011)) -
>     Error aggregating log file. Log file :
>     
> /data/00/yarn/log/application_1513161004473_0023/container_e06_1513161004473_0023_01_000438/prelaunch.out.
>     Owner '[email protected]
>     <mailto:[email protected]>' for path
>     
> /data/00/yarn/log/application_1513161004473_0023/container_e06_1513161004473_0023_01_000438/prelaunch.out
>     did not match expected owner 'michalklempa'
>
>     java.io.IOException: Owner
>     '[email protected]
>     <mailto:[email protected]>' for path
>     
> /data/00/yarn/log/application_1513161004473_0023/container_e06_1513161004473_0023_01_000438/prelaunch.out
>     did not match expected owner 'michalklempa'
>
>     at org.apache.hadoop.io
>     
> <http://org.apache.hadoop.io>.SecureIOUtils.checkStat(SecureIOUtils.java:285)
>
>     at org.apache.hadoop.io
>     
> <http://org.apache.hadoop.io>.SecureIOUtils.forceSecureOpenForRead(SecureIOUtils.java:219)
>
>     at org.apache.hadoop.io
>     
> <http://org.apache.hadoop.io>.SecureIOUtils.openForRead(SecureIOUtils.java:204)
>
>     at
>     
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.write(LogAggregationIndexedFileController.java:365)
>
>     at
>     
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl$ContainerLogAggregator.doContainerLogAggregation(AppLogAggregatorImpl.java:470)
>
>     at
>     
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:222)
>
>     at
>     
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:328)
>
>     at
>     
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:284)
>
>     at
>     
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:262)
>
>     at
>     
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
>     at
>     
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
>     at java.lang.Thread.run(Thread.java:748)
>
>     ```
>
>
>     We tried to run tcpdump for port 13465 (http shuffle) on a worker
>     node, during the job execution. This is what we found inside the
>     HTTP exchange:
>
>     ```
>
>     GET
>     
> /mapOutput?job=job_1513161004473_0026&reduce=0&map=attempt_1513161004473_0026_m_000059_0
>     HTTP/1.1
>
>     UrlHash: vb4FT2RJdu+2p+/xWP7FB66SQtw=
>
>     name: mapreduce
>
>     version: 1.0.0
>
>     User-Agent: Java/1.8.0_151
>
>     Host: worker-01.triviadata.local:13562
>
>     Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
>
>     Connection: keep-alive
>
>
>
>     HTTP/1.1 200 OK
>
>     ReplyHash: 7isHCuWdJyZhpJtDlaV5oPTBcls=
>
>     name: mapreduce
>
>     version: 1.0.0
>
>
>
>     HTTP/1.1 500 Internal Server Error
>
>     Content-Type: text/plain; charset=UTF-8
>
>     name: mapreduce
>
>     version: 1.0.0
>
>
>     Error Reading IndexFileOwner
>     '[email protected]
>     <mailto:[email protected]>' for path
>     
> /data/01/yarn/local/usercache/michalklempa/appcache/application_1513161004473_0026/output/attempt_1513161004473_0026_m_000059_0/file.out.index
>     did not match expected owner 'michalklempa'
>
>     ```
>
>     (Yes, it seems like the HTTP header is sent twice in the reply).
>     The latter reply clearly states the error is related to our
>     question regarding the domain part of the username on Linux.
>
>
>     We use HDP and after Kerberizing cluster we have not changed
>     default value for hadoop.security.auth_to_local, so we expected
>     '[email protected]
>     <mailto:[email protected]>' to be mapped
>     just to 'michalklempa' user, but for some reason, YARN container's
>     data are created under user
>     '[email protected]
>     <mailto:[email protected]>' (may be SSSD is
>     too clever and uses the default domain if no user found, that
>     would be fine after all, but Hadoop crashes when reading the data).
>
>
>     Does anyone have any suggestions, how to solve this issue, so we
>     will be able to add users from second domain to our Hadoop cluster?
>
>     Are there any recommendations on how to setup UPN translation to
>     Linux user name in such a case we encounter?
>
>
>     Thanks,
>
>     Best regards Michal Klempa
>
>

Re: Using Kerberized Hadoop with two AD domains and ShuffleError

Reply via email to