Apache Hadoop Architecture

February 04, 2013

Apache Hadoop Architecture

As mentioned in a previous post, I've recently started to get into Apache Hadoop and I am planning to sit the CCAH exam. As part of this I produced some one page diagrams depicting how key components of Hadoop function; such as HDFS, NameNode, DataNode etc.

The diagrams are based on my understanding (which in turn is based on information gleaned from the project docs, books, blog articles, code etc) and interpretation of Haddop architecture. If you spot an error in the diagrams let me know and I will update the diagram accordingly. At some point I may add some explanatory text alongside the diagrams but the intention is that you should be able to understand them without lengthy descriptions. If you want a copy of the original Visio document, let me know. If you do improve the diagrams and make corrections please do let me know.

2013-03-04:

Updated logical component diagram

2013-02-04:

Updated NameNode-SecondaryNameNode diagram
Added diagram of JobTracker and TaskTracker interaction

2013-02-05:

Updated NN, DN and SNN diagrams to reflect CDH4 related changes to directory structure for NN metadata, SNN and DN storage directories.

Hadoop a Logical View

The diagram below provides a logical view of 'Core Hadoop', Hadoop ecosystem projects and additional services that may form part of a Hadoop cluster. There are lots of other projects that I could have included but did not do so such as Mahout, RHIPE etc.

'Hadoop core': is regarded as consisting of the Hadoop Distributed File System and the MapReduce distributed computing framework.

Storage of Files in HDFS

As the title suggests the diagram is an overview of how files are stored in the Hadoop Distributed File System (HDFS).

As shown in the diagram

The Hadoop Distributed File System (HDFS) sits on top of the operating system file system.
Files in HDFS are split into blocks, which are stored as files on the OS file system of the DataNode
Metadata about files and directories are stored on the NameNode both persistently on disk and in memory (name, permissions, owner, size, the blocks that make up a file...)
The NameNode maitains an in-memory mapping of blocks to datanodes

HDFS - Reading Files

The diagram illustrates the communication that take place under-the-hood when a file is read from HDFS.

HDFS - Writing Files

NameNode Operation

As the diagram shows write operations are considerably more complex - although all of this complexity is abstracted away from the user / programmer writting MapReduce jobs.

NameNode - Secondary NameNode Operation

NameNode - DataNode Communications

Client - NameNode Communications

TaskTracker & JobTracker Communications

Comments

Anand18 June 2013 at 16:00
Hi Vijay,

Very nice blog post (informative). I see that you have released the images under creative commons license so instead of recreating these images I would like to use them as is in some of my presentations. Would it be possible for you to share them in vector format or in higher resolution?

Thanks
ReplyDelete
Replies

Add comment

Search This Blog

Vijay Thakorlal's Blog