How to enable LZO compression on HDInsight

This blog post explains how to enable LZO compression on a HDInsight cluster.

ARM Template 


You will need to modify the ARM template configuration and under the clusterDefinition, configuration section:

  •  Add core-site section and specify the codecs and compression codec class
  • Add a mapred-site enable map output compression and the compression codec class


"properties": {
                "clusterVersion": "[parameters('clusterVersion')]",
                "osType": "Linux",
                "clusterDefinition": {
                    "kind": "spark",
                    "configurations": {
                        "gateway": {
                            "restAuthCredential.isEnabled": true,
                            "restAuthCredential.username": "[parameters('clusterLoginUserName')]",
                            "restAuthCredential.password": "[parameters('clusterLoginPassword')]"
                        },
                        "core-site": {
                            "io.compression.codecs": "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzopCodec",
                            "io.compression.codec.lzo.class": "com.hadoop.compression.lzo.LzoCodec"
                        },
                        "mapred-site": {
                            "mapreduce.map.output.compress": "true",
                            "mapreduce.map.output.compression.codec": "com.hadoop.compression.lzo.LzoCodec"
                        },


Install compression libraries on cluster nodes

You will also need to install the compression libraries on the cluster nodes.

apt install -y liblzo2-2 liblzo2-dev hadooplzo hadoop-lzo hadooplzo-native


On the point of compression libraries, if you are using snappy you will need to install the snappy compression libraries with:

apt install -y libsnappy1 libsnappy-dev



Comments