Apache Hadoop Course & Book Reviews

I have recently started to get into Apache Hadoop, an open-source framework for the storage and processing of large data sets. For a good introduction to Hadoop I'd encourage you to visit the Wikipedia page for Hadoop or the project website at Apache.org.

I attended the Cloudera Administrator Training for Apache Hadoop course in London (at the Learning Tree building in Euston) and I am planning on sitting the CCAH exam soon. Overall the course was very good, while you could source most of the content from the official documentation, books, blog articles and the Hadoop mailing lists etc you would have to dig around for most of the information. The material is presented clearly in a logical and structured format and combined with practical exercises where you get to build your own 'real' cluster (albeit using VMs on multiple physical machines, which is difficult to do at home unless you happen to have a PCs lying around). You also have the benefit of being able to ask questions relevant to your deployment needs and the practical advise on planning and architectural considerations is very helpful. Some of the material in this area is covered in Eric Sammer's Hadoop Operations (a digital copy of the book is provided on the course as part of a promotion Cloudera are running) but the course provides some additional insights. The only real downside is that there isn't much detail on MapReduce2 (aka MR2 or "Yarn"), which while it shares some similarities with classical MapReduce - it has a decidedly different architecture. MR2 in alpha release, it is not considered prodcution ready and therefore changes are likely to be made to MR2. On that basis I can understand why it's on touched upon and I expect this will change once MR2 becomes stable.

If you are looking to learn more about Hadoop and are considering buying a book here are my views on some of the books available on Hadoop.

Hadoop Operations - The book is written by Eric Sammer a solution architect at Cloudera. As the title suggests the book covers design considerations (architecture, hardware requirements, OS selection, network design etc), installation, configuration, cluster management, monitoring and troubleshooting. I would highly recommend this book if you want to know about deploying and managing Hadoop clusters.

Hadoop The Definitive Guide - The author Tom White is also a Cloudera employee. The book has a broader focus than just design and management of clusters. There is some overlap with Hadoop operations in that it covers isntallation, configuration and monitoring - but Eric Sammer's book goes into more detail. The book does go into more detail about writing MapReduce jobs, input formats, compression and has a chapter dedicated to other Hadoop ecosystem projects such as HBase, Pig Hive, Zookeeper, Sqoop - that are not covered in Hadoop Operations. The book contains very good explanations of HDFS internals, read/write file operations and how MapReduce jobs are executed.

Hadoop in Practice - This book written by, Alex Holmes is more of a cookbook of problems and solutions relating to running a Hadoop cluster. The book covers topics such as getting data into and out of Hadoop, how to work with different types of data (i.e. file formats), how to deal with small files efficiently (as Hadoop is optimised for large streaming reads/writes),  processing compressed files. It also covers using libraries such as Mahout (for machine learning), R, Hive and Pig. From an administration perspective the book has an excellent chapter on diagnosing and tuning performance problems.This chapter provides a structure approach to troubleshooting performance problems and practical information on detecting problems that nicely complements (and in many ways goes beyond) that provided in Hadoop Operations.

There are other books such as Hadoop in Action, Pro Hadoop as well as books dedicated to Hadoop ecosystem projects such as HBase and Pig etc; but I have not read these books.

The commercial companies in the market such as Cloudera, HortonWorks, MapR etc all maintain there own documentation which can provide more detailed practical information that supplements the Apache documentation. I have found Cloudera and HortonWorks documentation to be the most useful - they have really good information on setting up HA, Federation, Kerberos, as well as hardware and design recommendations.





Comments