Using Kerberized Hadoop with two AD domains and ShuffleError

Michal Klempa Tue, 09 Jan 2018 05:22:31 -0800

Hi everyone, Happy new year,


We are trying to setup a Hadoop cluster with two different AD domains.

One AD domain consists of SPNs for Hadoop and UPNs for a branch of
company (e.g. @BRATISLAVA.TRIVIADATA.COM).

Other AD domain contains only UPNs for a different branch of the company
(e.g. @WIEN.TRIVIADATA.COM).


What the problem is: by normalizing the UPN full name - for example
"[email protected]" into a Hadoop username
"michalklempa" we run into trouble.

There may be another person, working at different branch of the company
with same same and surname.

There is distinct UPN for this person: [email protected].

What happens at Hadoop level is, that both these employees end up with
same Hadoop username: michalklempa.

But these are 2 different persons from different domains. Of course we
want to avoid security issues, that have been mentioned here
https://blog.samoylenko.me/2015/04/15/hadoop-security-auth_to_local-examples/,
so we cannot just cut off domain part from UPN.

At a Linux level we came up with a solution of setting up SSSD to use
"name@domain" as the user name and the HOME directory is set up in SSSD
as /home/$domain/$user.

This works well.


So first question is, are there any troubles with using full UPN in form
of user@domain (e.g. [email protected]) in Hadoop/YARN?


Now the practical part.

After Kerberizing Hadoop with AD, Hadoop is using only `user` portion of
the UPN.

For this reason, we cannot properly submit YARN jobs to YARN cluster
using user [email protected]. After obtaining
Kerberos ticket using kinit, we submitted TeraSort job and after moment
ShuffleError occurred.

It seems to us that the Hadoop/YARN tries to use the `michalklempa`
portion as the Linux account on worker nodes and SSSD does not recognize
account `michalklempa`.

SSSD is only aware of accounts named in form user@domain.


This is our environment:


HDFS 2.7.3

YARN 2.7.3

MapReduce2 2.7.3


```

$ sssd --version

1.13.4

```


```

$ klist -V

Kerberos 5 version 1.13.2

```


```

$ uname -a

Linux master-01.triviadata.local 4.4.0-98-generic #121-Ubuntu SMP Tue
Oct 10 14:24:03 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

```


```

$ cat /etc/lsb-release

DISTRIB_ID=Ubuntu

DISTRIB_RELEASE=16.04

DISTRIB_CODENAME=xenial

DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"

```


TeraGen part of job ended properly, without exceptions. After that we
checked logs on nodemanager, where IOException occurred:

```

Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError:
error in shuffle in fetcher#2

at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)

Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES;
bailing-out.

at
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:366)

at
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:288)

at
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:354)

at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)


Container killed by the ApplicationMaster.

Container killed on request. Exit code is 143

Container exited with a non-zero exit code 143.

```


Moreover, while extracting logs from YARN, this error occurs:

```

2017-12-21 12:38:23,938 ERROR ifile.LogAggregationIndexedFileController
(LogAggregationIndexedFileController.java:logErrorMessage(1011)) - Error
aggregating log file. Log file :
/data/00/yarn/log/application_1513161004473_0023/container_e06_1513161004473_0023_01_000438/prelaunch.out.
Owner '[email protected]' for path
/data/00/yarn/log/application_1513161004473_0023/container_e06_1513161004473_0023_01_000438/prelaunch.out
did not match expected owner 'michalklempa'

java.io.IOException: Owner '[email protected]'
for path
/data/00/yarn/log/application_1513161004473_0023/container_e06_1513161004473_0023_01_000438/prelaunch.out
did not match expected owner 'michalklempa'

at org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:285)

at
org.apache.hadoop.io.SecureIOUtils.forceSecureOpenForRead(SecureIOUtils.java:219)

at org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:204)

at
org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.write(LogAggregationIndexedFileController.java:365)

at
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl$ContainerLogAggregator.doContainerLogAggregation(AppLogAggregatorImpl.java:470)

at
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:222)

at
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:328)

at
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:284)

at
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:262)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

```


We tried to run tcpdump for port 13465 (http shuffle) on a worker node,
during the job execution. This is what we found inside the HTTP exchange:

```

GET
/mapOutput?job=job_1513161004473_0026&reduce=0&map=attempt_1513161004473_0026_m_000059_0
HTTP/1.1

UrlHash: vb4FT2RJdu+2p+/xWP7FB66SQtw=

name: mapreduce

version: 1.0.0

User-Agent: Java/1.8.0_151

Host: worker-01.triviadata.local:13562

Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2

Connection: keep-alive



HTTP/1.1 200 OK

ReplyHash: 7isHCuWdJyZhpJtDlaV5oPTBcls=

name: mapreduce

version: 1.0.0



HTTP/1.1 500 Internal Server Error

Content-Type: text/plain; charset=UTF-8

name: mapreduce

version: 1.0.0


Error Reading IndexFileOwner '[email protected]'
for path
/data/01/yarn/local/usercache/michalklempa/appcache/application_1513161004473_0026/output/attempt_1513161004473_0026_m_000059_0/file.out.index
did not match expected owner 'michalklempa'

```

(Yes, it seems like the HTTP header is sent twice in the reply). The
latter reply clearly states the error is related to our question
regarding the domain part of the username on Linux.


We use HDP and after Kerberizing cluster we have not changed default
value for hadoop.security.auth_to_local, so we expected
'[email protected]' to be mapped just to
'michalklempa' user, but for some reason, YARN container's data are
created under user '[email protected]' (may be
SSSD is too clever and uses the default domain if no user found, that
would be fine after all, but Hadoop crashes when reading the data).


Does anyone have any suggestions, how to solve this issue, so we will be
able to add users from second domain to our Hadoop cluster?

Are there any recommendations on how to setup UPN translation to Linux
user name in such a case we encounter?


Thanks,

Best regards Michal Klempa

Using Kerberized Hadoop with two AD domains and ShuffleError

Reply via email to