Azure HDInsight Premium

This blog post discusses HDInsight premium which is currently in preview. HDInsight Premium adds the ability to domain join HDInsight clusters and Apache Ranger which can then be used to control access to databases/tables on HDInsight.

At the time of writing the documentation for HDInsight very poor and there are number of different limitations and issues with HDInsight Premium, most of which are not documented so I hope this post will help others.


Overview


HDInsight Premium allows you to join clusters to Azure AD Domain Services (AAD DS) domains. This then allows you to use accounts in your on-premise domain (provided you are synchronising users/groups via AAD Connect and have enabled password hash synchronisation) in HDInsight. Furthermore, you can then configure role based access control for Hive using Apache Ranger.

At the time of writing HDInsight is currently in Preview and has not GA’d – this means it is not backed by a full SLA. The Premium SKU is only available for "Hadoop" clusters – which do not come with Spark. However, HDInsight Premium with Spark clusters is available in private preview to a limited number of customers.

The domain-joining feature relies on Azure AD Domain Services (AADDS) – which provisions a Microsoft managed read-only domain controller. Until recently it was only possible to deploy AAD DS to a classic VNET which then required a VNET peering connection to the ARM VNET containing your HDInsight cluster (this obviously requires your VNETs are in the same region).


AD Connect and Password Synchronisation

In order to use accounts in your on-premise domain to authenticate with HDInsight you need two things:
  • Firstly you must use Azure AD Connect to synchronise users and groups to Azure AD
  • Secondly you need to enable password synchronisation.
Since HDInsight Premium implements authentication using Kerberos, this requires that Azure AD Domain Services holds the users passwords. This in turn requires that we synchronise password hashes from the on-premise domain to our Azure AD directory.

It should be noted that:
  • Password synchronisation will apply to all users that are being synchronised to Azure AD.
  • Synchronisation traffic uses HTTPS 
  • When synchronizing passwords, the plain-text version of your password is not exposed to the password synchronization feature, to Azure AD, or any of the associated services. 
  • The original hash is not transmitted to Azure AD. Instead, the SHA256 hash of the original MD5 hash is transmitted. As a result, if the hash stored in Azure AD is obtained, it cannot be used in an on-premises pass-the-hash attack.

Accounts are synchronised from the on-premise Active Directory to Azure AD, the AD objects are then synchronised to the Azure AD Domain Services instance. The synchronization process from Azure AD to Azure AD Domain Services is one-way/unidirectional in nature. Your managed domain is largely read-only except for any custom OUs you create. Therefore, you cannot make changes to user attributes, user passwords, or group memberships within the managed domain. As a result, there is no reverse synchronization of changes from your managed domain back to your Azure AD tenant. 

  • On-Premise to Azure AD Syncrhonisation: this is usually on an hourly basis unless you have a newer version of Azure AD Connect and have customised the sychronisation interval.
  • Azure AD to AAD DS: the documentation states this takes 20 minutes, but in my experience this usually takes closer to 1 hour.
What if you don't want to synchronise the password hash (e.g. if your security department objects)? In this case you can use cloud only users and AD groups instead.

Azure AD Domain Services

Create an Azure AD Domain Services (AAD DS) from the Azure portal. Once the AAD DS instance is created you will receive two IP addresses which are the domain controllers. 

Note that it may take 10-20 minutes before the AAD DS IP addresses are available. 

VNET DNS


The ARM VNET that contains the HDInsight cluster and the VNET that contains the AAD DS instance will need to be reconfigured to use the two IPs as DNS servers - this is required otherwise the cluster creation will fail.

When you create your Azure AD DS instance the actual domain used will match the domain that you have set as primary in Azure AD. If the primary domain is of the form: <MyAADTenant>.onmicrosoft.com - then this is the domain that will be used. As we will see later this has some implications in terms of LDAPS configuration.

Enabling SSL/TLS for AAD DS

HDInsight requires that you enable LDAPS for AAD DS. If you have a public domain configure as your primary in Azure AD then you can obtain a public certificate from public CA such as Symantec or DigiTrust. However, if your primary is using the default Microsoft provided domain <MyAADTenant>.onmicrosoft.com, then since you don't own onmicrosoft.com you will need to use a self-signed certificate and request an exception by raising a support case with Microsoft.

Next  an SSL certificate needs to be uploaded in PFX format with the private key (you will also need the password) via the Azure portal and enable Secure LDAP.
Ensure that "Allow secure LDAP access over the internet" is  (which is the default). 

Management Server


You cannot RDP to the two IP address or otherwise log on directly to the domain controllers. So how do you manage AAD DS? 

The answer is a management Windows Server  2012 R2 VM should be created within the VNET that contains the AAD DS instance and then  using an account that is a member of the "AAD DC Administrators" AD group (created when AAD DS instance is created) join the server to the domain.

Next install the RSAT and DNS management tools.

OUs

Although the Microsoft documentation does not mention this it is my recommendation that you create a HDInsight OU and then OUs under that for each HDInsight cluster. This will make it easy to find the computer, account and SPN objects for each cluster.

Cluster Domain Join Account

When creating a HDInsight Premium cluster, you must specify a "domain account" which is used by the cluster to join the node to the AAD DS instance. The account will require the following permissions:
  • Permissions to join machines to the domain 
  • Permissions to place the machines into the OU created for HDInsight clusters 
  • Permissions to create service principals within the OU Create reverse DNS entries
The Microsoft documentation appears to give an example of using an account that is a member of "AAD DC Administrators". 

However, given the account used to domain join the cluster also becomes the cluster admin (e.g. in Ambari), I would strongly advice against doing this as such an account would have full control over the AAD DS instance. Furthermore, if you then have multiple clusters e.g. dev, test, production or by business group then they would all have admin access to AAD DS.

Therefore a separate account should be used for each cluster since this prevents a compromise of one cluster being used to gain access to another. Using a separate account enables administration of clusters to be delegated to different teams.

The permissions can then be granted as follows:
  • Right-click the OU, select  Delegate Control 
  • Click Next
  • Click Add
  • Select the account to be used for domain joining and click OK
  • Click Next Select , and select . Delegate the following common tasks Create, delete, and manage user accounts 
  • Click Next then click Finish 
  • From ADUC click  >  View Advanced Features
  • Right-click the OU and click Properties 
  • Click the  tab Security
  • Grant the domain join account the following permissions 
    • Read 
    • Write 
    • Create all child objects 
    • Delete all child objects 
The username (samaccountname) must be 15 characters or less and all lowercase - otherwise cluster provisioning using this account will fail. This is not documented by Microsoft - I had to find this out the hard way by digging through log files and looking at how Microsoft had implemented domain joined clusters. Microsoft are doing this using winbind/samba which is where this limitation comes from (that and a combination of compatibility with Win2K). It's not clear to me why Microsoft are not using SSSD and Realmd instead.

DNS

A forward DNS zone will be automatically created upon provisioning Azure AD Domain Services however reverse zones are not. HDInsight Premium relies upon Kerberos for authentication, this requires that reverse DNS entries are created for the nodes in the cluster. As a result we must configure (via the management server) reverse DNS zones for all the subnets that will contain HDInsight Premium clusters and enable secure updates.

The reverse DNS zones need to be configured based on the /8, /16 or /24 boundaries (classless ranges are not supported directly).

You might also want to consider adding conditional forwarding for your on-premise domains if you have connectivity to them.


Issues and Limitations

I've summarised below the main issues and limitations that I have come across (this is based on testing with HDInsight Premium spark clusters):

  • HDInsight is in public preview - which means that it is not subject to any SLAs
  • The synchronisation lag can be quite large - in theory this should be 1 hour 20 minutes from on-premise AD to AAD DS. However, in practice this is more like 2 hours. You need to keep this in mind when troubleshooting permission / access issues.
  • The documentation for HDInsight is pretty bare bones and contains mistakes/errors. 
    • For example, this article https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-domain-joined-configure-use-powershell#run-the-powershell-script links to a repo in GitHub that is supposed to do the AAD DS configuration for you. However, apart from a README.md file it is an empty repo;
    • It does not explain the permissions required to domain join a cluster in enough detail e.g. on the OU, the exact DNS permissions, how to create reverse DNS zones (unless you are a DNS admin you won't know this);
    • There are special requirements for the username of the domain join account but these are not documented anywhere.
  • If you delete a cluster it leaves behind the DNS entries (forward and reverse), computer accounts, as well as the user and service principal objects. This obviously clutters AAD DS but can also cause problems if you want to do CI/CD and the objects already exist.
  • The components that are available with HDInsight are also not well documented e.g.
    • Jupyter is currently not available - presumably because the it's not that trivial to integrate with kerberos. You can use Zeppelin though.
    • The Microsoft provided Hue script action will not work because it does not support kerberos - a significant amount of effort is required to do this. In light of this you would have to use Ambari Hive views.
    • Oozie is not available on the cluster either.
    • Applications are not supported - which means you cannot add edge nodes via an ARM template
  • Other things that are not documented include
    • If you are using Azure Data Factory (ADF) then Hive activities do not work.
    • Spark activities with ADF does work but you have to disable CSRF protection in the livy.conf configuration file (you can do this via Ambari) but this isn't a good idea from a Security standpoint.
  • Ranger policies are only provided for Hive/Spark - they do not cover HDFS. I believe this is because of the limitations with Azure Storage authorisation and authentication listed here https://hadoop.apache.org/docs/current3/hadoop-azure/index.html#Limitations


Comments