HDInsight Cluster Scaling


I recently had to come up with options for scaling a Azure HDInsight cluster out (adding worker nodes) or in (removing worker nodes) based on a time based schedule.    At the time of writing HDInsight doesn't allow you to pause or stop a cluster - you have to delete the cluster if you do not want to incur costs. One way of  potentially reducing costs is to use workload specific clusters with Azure Blob Storage and Azure SQL Database for the metastore, then scaling down the cluster     outside of core hours when there is no processing to be done.

There are a number of ways of doing this including:

  • Using Azure Resource Manager (ARM) template that has a parameter for the number of worker nodes; then simply running the template at a scheduled time with the    appropriate number of worker nodes.
  • Using the PowerShell module to scale the cluster with Set-AzureHDInsightCluster cmdlet and either running the script as a scheduled task from a VM or using Azure Automation


This post is going to show how to use a script that can be run either via Azure Automation or via a scheduled task running on a VM, the solution consists of:


  • An Azure Storage Account containing XML configuration files that includes information on the cluster, the subscription within which it resides, the number of     worker nodes when scaling out the cluster, the number of worker nodes when scaling in the cluster, a list of email addresses of people that are to be notified when a   scaling operation takes place
  • A PowerShell script that uses the Azure PowerShell module to scale the cluster out and send email notifications
  • Optionally an Azure Automation account from which to schedule the script to run


Azure Storage


Create a storage account (a LRS storage account is sufficient) and a private container, in the container for each cluster store a XML configuration file of the following format:


 <ClusterConfiguration>
     <SubscriptionName>MySubscriptionName</SubscriptionName>
     <ResourceGroupName>MY-RG-0001</ResourceGroupName>
     <ClusterName>vjt-hdi1</ClusterName>
     <MinWorkers>1</MinWorkers>
     <MaxWorkers>3</MaxWorkers>
     <Notify>hadoopsupport@acme.com,team@acme.com</Notify>
 </ClusterConfiguration>


 Where:

  • <SubscriptionName> is the subscription containing the cluster to be scaled out/in
  • <ResourceGroupName> is the resource group containing the HDInsight cluster to scale
  • <ClusterName> is the name of the cluster to scale
  • <MinWorkers> is the number of worker nodes in the cluster when performing a scale IN operation
  • <MaxWorkers> is the number of worker nodes in the cluster when performing a scale OUT operation
  • <Notify> is a comma separated list of email addresses to which notifications are to be sent


 NOTE: It is important that the XML configuration file is in UTF-8 format.

While one XML configuration file could conceivably contain the information for every cluster there are a number of benefits to having one configuration file per cluster:

  • No need to write logic to parse the file and find the relevant part for the target cluster
  • I can keep each cluster configuration in version control and separately modify them
  • A mistake in the file is less likely to affect all clusters


PowerShell script


The HDInsight cluster scaling PowerShell script is available in my GitHub repository.

The script will need to be modified to suit your environment but the key elements of the script are described here. Since the script can be either run from Azure Automation or via a scheduled task from a Windows server, it supports sending emails from an internal mail server. The script assumes no authentication is required for the internal mail server (but it is trivial to add authentication support for the internal mail server).


Email Providers


The script supports sending emails via SendGrid, Office365 or via internal / on-premise mail servers. At present the latter is supported via hard-coded mail server details but these could be provided via a XML configuration file or parameters to the script.

Storing credentials on disk


There are other ways of doing this but when the script runs via a scheduled task from a VM it expects that the credentials for the SendGrid / Office365 email account to be stored in encrypted from on disk in an XML file. This is achieved using the Windows Data Protection API.

     1. First open a PowerShell console *as the user under which the scheduled task will run*. E.g.
         runas /user:SVC-PSAutomation powershell.exe
     2. Then read the password in
         $Password = Read-Host -assecurestring "Please enter your password"
         $Password | ConvertFrom-SecureString

     3. Paste it into the password tag of the XML file

      <credentials>
         <credential>
             <username>emailaccount@acme.onmicrosoft.com</username>
             <password>ReplaceWithActualPassword</password>
         </credential>
      </credentials>


Key Elements of the script


Since the script can be executed via Azure Automation or via a scheduled task, if the $PSPrivateMetadata.JobId property exists then the script is running in Azure Automation. following snippet checks whether the job is running in Azure Automation:

 If( -not($PSPrivateMetadata.JobId) )


Next the script contains two try catch blocks one that authenticates using the "Classic" Azure PowerShell module, and the other with the ARM based Azure PowerShell module.

The Get-ClusterScalingConfigurationFile function uses the classic Azure PowerShell module cmdlets to retrieve the storage account key and creates an Azure Storage Account context which is then used to download the blob/file from the storage account uses the classic Azure PowerShell module cmdlets to retrieve the storage account key and creates an Azure Storage Account context   which is then used to download the blob/file from the storage account. Note it is important the the cluster configuration file is in UTF-8 format else when we try to parse the contents of  the file as XML it will fail (especially if it's UTF-8 WITH BOM)

 [byte[]] $myByteArray = New-Object -TypeName byte[] -ArgumentList ($BlobReference.Length)
 $null = $BlobReference.ICloudBlob.DownloadToByteArray($myByteArray,0)
 [xml] $BlobContent = [xml] ([System.Text.Encoding]::UTF8.GetString($myByteArray))


The XML file also should not contain any extra new lines at the start or end of the XML block.

The Get-HDInsightQuota function uses the Get-AzureRmHDInsightProperties cmdlet to obtain the number of cores used and available for HDInsight, the script then uses information when scaling out the cluster.

First we use the Get-AzureRmHDInsightCluster cmdlet to get the cluster's Azure Resource ID, then pass this to the Get-AzureRmResource cmdlet to then obtain the VM sizes currently in use by the cluster - this information is not provided by the Get-AzureRmHDInsightCluster.

 $HDInsightClusterDetails = Get-AzureRmHDInsightCluster | Where-Object {
    $_.Name -eq $ClusterName
  }
  $ClusterSpec = Get-AzureRmResource -ResourceId $HDInsightClusterDetails.Id |
  Select-Object -ExpandProperty Properties |
  Select-Object -ExpandProperty computeProfile

Comments