We have an existing CDH 5.5.1 cluster with simple authentication and no authorization. We are building out a new cluster and plan to move to CDH 5.8.2 wiith Kerberos based authentication. We have an existing MIT Kerberos infrastructure which we sucessfully use for a variety of services. (ssh,apache,postfix)
I am very confident that out /etc/krb5.conf and name resolution is working. I have even used HadoopDNSVerifier-1.0.jar to verify that java sees the same name canonicalization that we see. I have built and test cluster and closely followed the instructions on the secure hadoop install doc from the clodera site making sure that all the conf files are properly edited and all the Kerberos keytabs contain the correct principals and have the correct permissions. We are using HA namenodes with Quorm based journalmanagers I am running into a persistent problem with many hadoop compents when they need to talk securely to remote servers. The two example that I post here are the namenode needing to talk to remote journalnodes and command line hdfs client needing to speak to a remote namenode. Both give the same error Server has invalid Kerberos principal: hdfs/[email protected]; Host Details : local host is: "aw1hdnn001.tnbsound.com/10.132.8.19"; destination host is: "aw1hdnn002.tnbsound.com":8020; There is not much on the inter-webs about this and the error that is showing up is leading me to belive that the issue is aroung the kerberos realm being used in one place and not the other. I just can not seem to figure out what is going on here as I know these are vaild principals. I have added a snippet at the end where I have enabled kerberos debugging to see if that helps at all The weird part is that this error applies only to remote daemons. The local namenode and journal node does not have the issue. We can “speak” locally but not remotely. All and Any help is greatly appreciated # # This is me with hdfs kerberos credentials trying to run hdfs dfsadmin -refreshServiceAcl # hdfs@aw1hdnn001 /var/lib/hadoop-hdfs 53$ klist Ticket cache: FILE:/tmp/krb5cc_115 Default principal: hdfs/[email protected] Valid starting Expires Service principal 10/20/2016 15:34:49 10/21/2016 15:34:49 krbtgt/[email protected] renew until 10/27/2016 15:34:49 hdfs@aw1hdnn001 /var/lib/hadoop-hdfs 54$ hdfs dfsadmin -refreshServiceAcl Refresh service acl successful for aw1hdnn001.tnbsound.com/10.132.8.19:8020 refreshServiceAcl: Failed on local exception: java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hdfs/[email protected]; Host Details : local host is: "aw1hdnn001.tnbsound.com/10.132.8.19"; destination host is: "aw1hdnn002.tnbsound.com":8020; # # This is the namenode trying to start up and contant and off server jornalnode # 2016-10-20 16:51:40,703 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs/[email protected] (auth:KERBEROS) cause:java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hdfs/[email protected] 10.132.8.21:8485: Failed on local exception: java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hdfs/[email protected]; Host Details : local host is: "aw1hdnn001.tnbsound.com/10.132.8.19"; destination host is: "aw1hdrm001.tnbsound.com":8485; # # This is me with hdfs kerberos credentials trying to run hdfs dfsadmin -refreshServiceAcl with debug into # hdfs@aw1hdnn001 /var/lib/hadoop-hdfs 46$ HADOOP_OPTS="-Dsun.security.krb5.debug=true" hdfs dfsadmin -refreshServiceAcl Java config name: null Native config name: /etc/krb5.conf Loaded from native config >>>KinitOptions cache name is /tmp/krb5cc_115 >>>DEBUG <CCacheInputStream> client principal is >>>hdfs/[email protected] >>>DEBUG <CCacheInputStream> server principal is >>>krbtgt/[email protected] >>>DEBUG <CCacheInputStream> key type: 18 >>>DEBUG <CCacheInputStream> auth time: Thu Oct 20 16:55:42 UTC 2016 >>>DEBUG <CCacheInputStream> start time: Thu Oct 20 16:55:42 UTC 2016 >>>DEBUG <CCacheInputStream> end time: Fri Oct 21 16:55:42 UTC 2016 >>>DEBUG <CCacheInputStream> renew_till time: Thu Oct 27 16:55:42 UTC 2016 >>> CCacheInputStream: readFlags() FORWARDABLE; PROXIABLE; RENEWABLE; INITIAL; >>> PRE_AUTH; >>>DEBUG <CCacheInputStream> client principal is >>>hdfs/[email protected] >>>DEBUG <CCacheInputStream> server principal is >>>X-CACHECONF:/krb5_ccache_conf_data/fast_avail/krbtgt/[email protected] >>>DEBUG <CCacheInputStream> key type: 0 >>>DEBUG <CCacheInputStream> auth time: Thu Jan 01 00:00:00 UTC 1970 >>>DEBUG <CCacheInputStream> start time: null >>>DEBUG <CCacheInputStream> end time: Thu Jan 01 00:00:00 UTC 1970 >>>DEBUG <CCacheInputStream> renew_till time: null >>> CCacheInputStream: readFlags() >>>DEBUG <CCacheInputStream> client principal is >>>hdfs/[email protected] >>>DEBUG <CCacheInputStream> server principal is >>>X-CACHECONF:/krb5_ccache_conf_data/pa_type/krbtgt/[email protected] >>>DEBUG <CCacheInputStream> key type: 0 >>>DEBUG <CCacheInputStream> auth time: Thu Jan 01 00:00:00 UTC 1970 >>>DEBUG <CCacheInputStream> start time: null >>>DEBUG <CCacheInputStream> end time: Thu Jan 01 00:00:00 UTC 1970 >>>DEBUG <CCacheInputStream> renew_till time: null >>> CCacheInputStream: readFlags() Found ticket for hdfs/[email protected] to go to krbtgt/[email protected] expiring on Fri Oct 21 16:55:42 UTC 2016 Entered Krb5Context.initSecContext with state=STATE_NEW Found ticket for hdfs/[email protected] to go to krbtgt/[email protected] expiring on Fri Oct 21 16:55:42 UTC 2016 Service ticket not found in the subject >>> Credentials acquireServiceCreds: same realm Using builtin default etypes for default_tgs_enctypes default etypes for default_tgs_enctypes: 18 17 16 23 1 3. >>> CksumType: sun.security.krb5.internal.crypto.RsaMd5CksumType >>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType >>> KdcAccessibility: reset >>> KrbKdcReq send: kdc=dc1util003.tnbsound.com UDP:88, timeout=30000, number >>> of retries =3, #bytes=734 >>> KDCCommunication: kdc=dc1util003.tnbsound.com UDP:88, timeout=30000,Attempt >>> =1, #bytes=734 >>> KrbKdcReq send: #bytes read=721 >>> KdcAccessibility: remove dc1util003.tnbsound.com >>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType >>> KrbApReq: APOptions are 00100000 00000000 00000000 00000000 >>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType Krb5Context setting mySeqNumber to: 561537595 Created InitSecContextToken: 0000: 01 00 6E 82 02 7F 30 82 02 7B A0 03 02 01 05 A1 ..n...0......... 0010: 03 02 01 0E A2 07 03 05 00 20 00 00 00 A3 82 01 ......... ...... 0020: 7A 61 82 01 76 30 82 01 72 A0 03 02 01 05 A1 0E za..v0..r....... 0030: 1B 0C 54 4E 42 53 4F 55 4E 44 2E 43 4F 4D A2 2A ..TNBSOUND.COM.* 0040: 30 28 A0 03 02 01 00 A1 21 30 1F 1B 04 68 64 66 0(......!0...hdf 0050: 73 1B 17 61 77 31 68 64 6E 6E 30 30 31 2E 74 6E s..aw1hdnn001.tn 0060: 62 73 6F 75 6E 64 2E 63 6F 6D A3 82 01 2D 30 82 bsound.com...-0. 0070: 01 29 A0 03 02 01 12 A1 03 02 01 01 A2 82 01 1B .).............. 0080: 04 82 01 17 04 6E 26 46 08 EA 9C 61 08 80 B8 4B .....n&F...a...K 0090: AF 7C D2 CD 5E 47 19 3D A1 FB CD 8D 41 F4 C9 49 ....^G.=....A..I 00A0: 09 95 1C C7 9A D8 1B 92 0F 3C E0 5F 41 BF 99 96 .........<._A... 00B0: 42 A9 2D 17 D6 F0 AB 41 72 3E 7E F7 13 33 E2 0A B.-....Ar>...3.. 00C0: 2D F5 71 AD 97 9A 9D 7F E0 EA 1A 29 7C D4 47 AB -.q........)..G. 00D0: B4 7E C1 A1 C5 28 DD 46 F1 C4 17 0B FC DB C9 D3 .....(.F........ 00E0: F4 4D C2 1F 6C 59 A6 C4 9E 9D FD 56 E3 B0 31 E6 .M..lY.....V..1. 00F0: C6 6E 50 44 2C 07 44 91 40 F7 C8 6E AD 1E FB 26 .nPD,[email protected]...& 0100: EC 6D E4 ED BC F8 15 17 0B 31 B6 4B 68 64 03 E4 .m.......1.Khd.. 0110: 28 9B A5 9D AE 2A DF 1B BD 0F B2 AE B3 BB E0 4D (....*.........M 0120: 14 D1 9C E0 AC 99 59 1B B6 28 22 E2 B5 55 52 58 ......Y..("..URX 0130: D2 61 39 DE 8F C8 3F E6 6F EB 41 5D E1 F2 43 40 .a9...?.o.A]..C@ 0140: 8F AC 78 C8 09 35 7B BA 39 6B CD C6 01 7B 90 0B ..x..5..9k...... 0150: 20 0C 49 0D 8B E5 2B F1 E6 6F 38 4E EA DF 5C A9 .I...+..o8N..\. 0160: 40 AE 11 75 AE B2 E2 35 13 A8 CE CF E7 F5 92 CB @..u...5........ 0170: A5 66 53 47 92 5A EF 31 CD 60 CD 67 46 D0 B7 0D .fSG.Z.1.`.gF... 0180: B6 76 FE 09 B1 03 16 FE B8 57 6E 08 9A E6 DD F8 .v.......Wn..... 0190: D3 AA 00 54 6C D4 70 61 95 08 CF A4 81 E7 30 81 ...Tl.pa......0. 01A0: E4 A0 03 02 01 12 A2 81 DC 04 81 D9 4E 48 9E 35 ............NH.5 01B0: 57 7C 7C 54 1C 9F 41 FE F3 C0 94 07 E2 D8 EE 38 W..T..A........8 01C0: BA 4A DA 97 43 04 B5 96 F6 A9 34 FD 54 FF 7B 96 .J..C.....4.T... 01D0: DA DD A9 6F C4 7B A5 E4 50 9F 9E 1A 62 D3 F3 3C ...o....P...b..< 01E0: 50 50 E9 02 05 F2 37 52 4D BC 86 D8 2B A4 9F FE PP....7RM...+... 01F0: 97 4C 01 7F E6 B4 8B 66 1F 6E 63 FD 3F EF 57 E9 .L.....f.nc.?.W. 0200: 04 E9 BE 28 4C 03 BC 26 EB EF EC DC 8C 48 C0 51 ...(L..&.....H.Q 0210: 7B 2B 5B 0F 16 7C 83 D0 73 F9 2A 94 CF 67 F2 F8 .+[.....s.*..g.. 0220: 11 CC 2B E9 0D FE 95 F5 7E 2B C4 40 19 FE FE 6F [email protected] 0230: B7 C4 B8 7E 87 D1 0A 98 8A F2 B0 1A DF FA 27 24 ..............'$ 0240: C2 EE 06 FE 3F 36 57 3D 6C B9 F3 18 98 19 D6 A1 ....?6W=l....... 0250: F4 49 57 5D 58 6E 88 C9 2E 1F FA 7D 53 24 B9 67 .IW]Xn......S$.g 0260: 02 85 C2 2C 01 25 18 BA BF 0E 64 A2 C3 06 7D AC ...,.%....d..... 0270: D6 11 A6 F4 ED 47 71 22 CC D4 E8 54 08 17 51 E6 .....Gq"...T..Q. 0280: EE 6F FE 31 37 .o.17 Entered Krb5Context.initSecContext with state=STATE_IN_PROCESS >>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType Krb5Context setting peerSeqNumber to: 374605590 Krb5Context.unwrap: token=[05 04 01 ff 00 0c 00 00 00 00 00 00 16 54 07 16 01 01 00 00 c5 67 32 c5 74 d0 68 ef 82 46 a8 85 ] Krb5Context.unwrap: data=[01 01 00 00 ] Krb5Context.wrap: data=[01 01 00 00 ] Krb5Context.wrap: token=[05 04 00 ff 00 0c 00 00 00 00 00 00 21 78 62 3b 01 01 00 00 a1 51 c9 92 95 bd cd 88 66 59 b7 49 ] Refresh service acl successful for aw1hdnn001.tnbsound.com/10.132.8.19:8020 refreshServiceAcl: Failed on local exception: java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hdfs/[email protected]; Host Details : local host is: "aw1hdnn001.tnbsound.com/10.132.8.19"; destination host is: "aw1hdnn002.tnbsound.com":8020; # # hdfs-site.xml # <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- --> <!-- HDFS security --> <!-- --> <property> <name>dfs.block.access.token.enable</name> <value>true</value> </property> <!-- --> <!-- HA namespace --> <!-- --> <property> <name>dfs.nameservices</name> <value>nbs-aw1-test</value> </property> <!-- --> <!-- HA namenodes --> <!-- --> <property> <name>dfs.ha.namenodes.nbs-aw1-test</name> <value>nn1,nn2</value> </property> <property> <name>dfs.namenode.rpc-address.nbs-aw1-test.nn1</name> <value>aw1hdnn001.tnbsound.com:8020</value> </property> <property> <name>dfs.namenode.http-address.nbs-aw1-test.nn1</name> <value>aw1hdnn001.tnbsound.com:50070</value> </property> <property> <name>dfs.namenode.rpc-address.nbs-aw1-test.nn2</name> <value>aw1hdnn002.tnbsound.com:8020</value> </property> <property> <name>dfs.namenode.http-address.nbs-aw1-test.nn2</name> <value>aw1hdnn002.tnbsound.com:50070</value> </property> <!-- --> <!-- FS image dir --> <!-- --> <property> <name>dfs.namenode.name.dir</name> <value>/var/lib/hadoop-hdfs/dfs/name</value> </property> <!-- --> <!-- QJM config --> <!-- --> <property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://aw1hdnn001.tnbsound.com:8485;aw1hdnn002.tnbsound.com:8485;aw1hdrm001.tnbsound.com:8485/nbs-aw1-test</value> </property> <property> <name>dfs.journalnode.edits.dir</name> <value>/var/lib/hadoop-hdfs/dfs/journal</value> </property> <!-- --> <!-- JournalNode security --> <!-- --> <property> <name>dfs.journalnode.keytab.file</name> <value>/etc/krb5/hdfs.keytab</value> </property> <property> <name>dfs.journalnode.kerberos.principal</name> <value>hdfs/[email protected]</value> </property> <property> <name>dfs.journalnode.kerberos.internal.spnego.principal</name> <value>HTTP/[email protected]</value> </property> <!-- --> <!-- Namenode failover --> <!-- --> <property> <name>dfs.client.failover.proxy.provider.nbs-aw1-test</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property> <property> <name>dfs.ha.fencing.methods</name> <value>sshfence shell(/bin/true)</value> </property> <property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/var/lib/hadoop-hdfs/.ssh/id_rsa</value> </property> <property> <name>dfs.ha.fencing.ssh.connect-timeout</name> <value>3000</value> </property> <property> <name>ha.zookeeper.quorum</name> <value>aw1zook001.tnbsound.com:2181,aw1zook002.tnbsound.com:2181,aw1zook003.tnbsound.com:2181</value> </property> <!-- --> <!-- NameNode security --> <!-- --> <property> <name>dfs.namenode.keytab.file</name> <value>/etc/krb5/hdfs.keytab</value> </property> <property> <name>dfs.namenode.kerberos.principal</name> <value>hdfs/[email protected]</value> </property> <property> <name>dfs.namenode.kerberos.internal.spnego.principal</name> <value>HTTP/[email protected]</value> </property> <!-- --> <!-- Datanode --> <!-- --> <property> <name>dfs.datanode.data.dir</name> <value>/data01/hadoop-hdfs/dfs/data,/data02/hadoop-hdfs/dfs/data,/data03/hadoop-hdfs/dfs/data,/data04/hadoop-hdfs/dfs/data</value> </property> <property> <name>dfs.datanode.failed.volumes.tolerated</name> <value>0</value> </property> <property> <name>dfs.datanode.fsdataset.volume.choosing.policy</name> <value>org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy</value> </property> <property> <name>dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold</name> <value>107374182400</value> </property> <property> <name>dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction</name> <value>0.75</value> </property> <!-- --> <!-- DataNode security --> <!-- --> <property> <name>dfs.datanode.data.dir.perm</name> <value>700</value> </property> <property> <name>dfs.datanode.keytab.file</name> <value>/etc/krb5/hdfs.keytab</value> </property> <property> <name>dfs.datanode.kerberos.principal</name> <value>hdfs/[email protected]</value> </property> <property> <name>dfs.datanode.address</name> <value>0.0.0.0:1004</value> </property> <property> <name>dfs.datanode.http.address</name> <value>0.0.0.0:1006</value> </property> <!-- --> <!-- Misc --> <!-- --> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.permissions.superusergroup</name> <value>hadoop</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> <property> <name>dfs.hosts.exclude</name> <value>/etc/hadoop/conf/hosts.exclude</value> <final>true</final> </property> <!-- From O'Reilly Hadoop Operations: A general guideline for setting dfs.namenode.handler.count is to make it the natural logarithm of the number of cluster nodes times 20 (as a whole number). python -c 'import math ; print int(math.log(num_of_nodes) * 20)' --> <property> <name>dfs.namenode.handler.count</name> <value>24</value> </property> <!-- --> <!-- Web security --> <!-- --> <property> <name>dfs.web.authentication.kerberos.keytab</name> <value>/etc/krb5/hdfs.keytab</value> </property> <property> <name>dfs.web.authentication.kerberos.principal</name> <value>HTTP/[email protected]</value> </property> <property> <name>dfs.http.policy</name> <value>HTTP_ONLY</value> </property> </configuration>
