HDFS Architecture

A Detailed Explanation

HDFS is a distributed file system that stores data across multiple computers in a cluster. Since it spreads the load across multiple machines, it is ideal for large-scale storage.

What is HDFS (Hadoop Distributed File System)?

HDFS Architecture

HDFS is composed of a master-slave architecture that includes various elements.

A master node that handles all blocks on DataNodes. Controls and monitors DataNodes instances, allows access to files, stores all block records on DataNodes, etc.

1. NameNode

A secondary NameNode is activated to perform checkpoint when a NameNode runs out of disk space.

2. Secondary NameNode

Every slave machine that contains data organises a DataNode. DataNodes store data in ext3 or ext4 file formats.

3. DataNode

Generate checkpoints for namespace periodically. FSimage and editslogs are downloaded from active NameNode, merged locally to create a new image which is uploaded back to NameNode.

4. Checkpoint Node

Ensure high availability of data. When an active NameNode or DataNode fails, the backup node can be promoted to active and the active node switched over to the backup node.

5. Backup Node

HDFS splits the file into blocks of data. By default, blocks are 128 MB in size, but they can be changed to any size between 32 and 128 megabytes based on performance requirements.

6. Blocks

HDFS uses nameNode maintenance to maintain copies of blocks on multiple DataNodes. It keep track of under- and over-replicated blocks and add/delete copies accordingly.

Replication Management in HDFS Architecture

Learn about the features and benefits of HDFS architecture.