Apache Hadoop Architecture

As mentioned in a previous post, I've recently started to get into Apache Hadoop and I am planning to sit the CCAH exam. As part of this I produced some one page diagrams depicting how key components of Hadoop function; such as HDFS, NameNode, DataNode etc.

The diagrams are based on my understanding (which in turn is based on information gleaned from the project docs, books, blog articles, code etc) and interpretation of Haddop architecture. If you spot an error in the diagrams let me know and I will update the diagram accordingly. At some point I may add some explanatory text alongside the diagrams but the intention is that you should be able to understand them without lengthy descriptions. If you want a copy of the original Visio document, let me know. If you do improve the diagrams and make corrections please do let me know.


2013-03-04:
  • Updated logical component diagram


2013-02-04:
  • Updated NameNode-SecondaryNameNode diagram
  • Added diagram of JobTracker and TaskTracker interaction

2013-02-05:
  • Updated NN, DN and SNN diagrams to reflect CDH4 related changes to directory structure for NN metadata, SNN and DN storage directories.


Hadoop a Logical View

The diagram below provides a logical view of  'Core Hadoop', Hadoop ecosystem projects and additional services that may form part of a Hadoop cluster. There are lots of other projects that I could have included but did not do so such as Mahout, RHIPE etc.






Storage of Files in HDFS

As the title suggests the diagram is an overview of how files are stored in the Hadoop Distributed File System (HDFS).


As shown in the diagram
  • The Hadoop Distributed File System (HDFS) sits on top of the operating system file system.
  • Files in HDFS are split into blocks, which are stored as files on the OS file system of the DataNode
  • Metadata about files and directories are stored on the NameNode both persistently on disk and in memory (name, permissions, owner, size, the blocks that make up a file...)
  • The NameNode maitains an in-memory mapping of blocks to datanodes

HDFS - Reading Files




The diagram illustrates the communication that take place under-the-hood when a file is read from HDFS.



HDFS - Writing Files




NameNode Operation



As the diagram shows write operations are considerably more complex - although all of this complexity is abstracted away from the user / programmer writting MapReduce jobs.

NameNode - Secondary NameNode Operation




NameNode - DataNode Communications




Client - NameNode Communications





TaskTracker & JobTracker Communications




Comments

  1. Hi Vijay,

    Very nice blog post (informative). I see that you have released the images under creative commons license so instead of recreating these images I would like to use them as is in some of my presentations. Would it be possible for you to share them in vector format or in higher resolution?

    Thanks

    ReplyDelete
    Replies
    1. Hi Anand, they were created in Visio so I can export them to .svg or any other format supported by Viso .svgz, PNG, TIFF etc. Let me know if svg is okay and if so how you'd like me to send them to you.
      Vijay

      Delete
    2. I'd love to get these in svg as well. My email is Mike (dot) Pluta (at) gmail (dot) com.

      Thank You!

      Delete

Post a Comment