Hadoop Architecture – Detailed Explanation

Hadoop Architecture

What is Hadoop?

Hadoop is a software framework designed for storing and analyzing massive amounts of data. Created by Google to handle large datasets, it is now widely used by various companies for data storage, processing, and analysis. Hadoop is especially valuable for data scientists, offering a secure and efficient way to manage extensive data sets

Companies can benefit from its ability to store and analyze data more efficiently than traditional methods. Lots of Big Brand Companies today use Hadoop in their organization to handle big data, like FAANG companies, Twitter, Yahoo, LinkedIn, etc.

For example, Facebook uses Hadoop to manage and analyze vast amounts of user data, allowing the platform to enhance user experience, personalize content recommendations, and optimize ad targeting based on comprehensive data analysis.

Now, let’s explore Hadoop architecture and its four major components – HDFS (Hadoop Distributed File System), MapReduce, YARN (Yet Another Resource Negotiator), and Hadoop Common.

Hadoop Architecture

Hadoop stands as a robust platform for storing and processing vast amounts of data. It serves as a key solution for storing and analyzing data from diverse sources, including databases, web servers, and file systems. 

Built on the MapReduce programming algorithm, Hadoop architecture comprises four key components, each playing a crucial role in managing and processing extensive datasets.

  • HDFS (Hadoop Distributed File System)
  • MapReduce
  • YARN (Yet Another Resource Negotiator)
  • Common Utilities or Hadoop Common

Let’s explore each component in detail-

1. HDFS

HDFS, the cornerstone of Hadoop, is a distributed file system designed for scalability, fault tolerance, and high availability. The architecture involves master and slave nodes, specifically the NameNode and DataNode.

HDFS, a fundamental component of Hadoop, serves as a distributed file system designed to scale seamlessly, tolerate faults, and ensure high availability. The system’s structure revolves around master and slave nodes, specifically the NameNode and DataNode.

i) NameNode and DataNode:

The NameNode serves as the master in a Hadoop cluster, overseeing the DataNodes (slaves). Its primary role is to manage metadata, such as transaction logs tracking user activity. The NameNode instructs DataNodes on operations like creation, deletion, and replication.

DataNodes, acting as slaves, are responsible for storing data in the Hadoop cluster. It’s recommended to have DataNodes with high storage capacity to accommodate a large number of file blocks.

High-Level Architecture Of Hadoop

ii) Block in HDFS

In Hadoop, data is stored in blocks, each typically set to a default size of 128 MB or 256 MB. This block size ensures efficient storage and processing of large datasets.

It’s essential to consider the application’s storage consumption to avoid excessive metadata growth. The choice of block size plays a crucial role in this, and HDFS provides flexibility in managing metadata size.

iii) Replication Management

Replication in HDFS ensures data availability and fault tolerance. By default, Hadoop sets a replication factor of 3, creating copies of each file block for backup purposes. The NameNode keeps track of each data node’s block report, adjusting replicas in case of under- or over-replication.

Ensuring high availability of data involves proper backup and restoration mechanisms. In the event of a DataNode failure, data is not lost but is copied to another healthy node.

iv) Rack Awareness

Rack Awareness in Hadoop involves the physical grouping of nodes in the cluster. This information is used by the NameNode to select the closest DataNode, reducing network traffic and optimizing read/write operations.

When a block is deleted or updated, the rack awareness algorithm ensures data consistency and failover capability. This distributed data store architecture allows for parallel processing across multiple DataNodes, enabling high-speed access to massive amounts of stored data.

HDFS (Hadoop Distributed File System) serves as the backbone of Hadoop, providing a robust and scalable storage solution for distributed data processing. Its architecture revolves around two key components: the NameNode and DataNodes.

2. MapReduce

MapReduce is an essential component of the Hadoop architecture that is designed to process and analyze vast amounts of data in a distributed and parallel fashion. Unlike traditional serial processing, which proves inefficient with big data, MapReduce divides its tasks into two main phases: Map and Reduce.

As you can see, the MapReduce process involves a client that submits a job to the Hadoop MapReduce Manager. The job is then divided into job parts (smaller tasks) by the Hadoop MapReduce Master. Input data is processed through Map() and Reduce() functions, resulting in output data. The Map function breaks down data into key-value pairs, which are then further processed by the Reduce function. Multiple clients can continuously submit jobs for processing.

Here is a detailed explanation of the Map Task and the Reduce Task-

Map Task:

  • RecordReader: Responsible for breaking records, and providing key-value pairs to the Map function.
  • Map: A user-defined function processing Tuples obtained from RecordReader.
  • Combiner: Optional component for grouping data in the Map workflow, similar to a local reducer.
  • Partitioner: Fetches key-value pairs generated in the Mapper Phase, generating shards corresponding to each reducer.

Reduce Task:

  • Shuffle and Sort: Initiates the Reducer task, sorting data during the shuffling process.
  • Reduce: Gathers Tuples from Map, performs sorting and aggregation based on key elements.
  • OutputFormat: Writes key-value pairs into files using a record writer.

3.YARN(Yet Another Resource Negotiator)

YARN, or Yet Another Resource Negotiator, is a vital component in the Hadoop framework, overseeing resource management and job scheduling. It separates these functions, employing a global Resource Manager and ApplicationMasters for individual applications. The NodeManager monitors container resource usage, providing data for efficient allocation of CPU, memory, disk, and connectivity by the ResourceManager.

YARN’s cool features include:

  • Multi-Tenancy: Manages numerous tasks concurrently without confusion, like chefs in a kitchen preparing their dishes.
  • Scalability: Adapts seamlessly to Hadoop team expansion, akin to adding players to a soccer team for a larger field.
  • Cluster-Utilization: Ensures every team member remains productive, likened to assigning roles to teammates during a game.
  • Compatibility: Versatile collaboration with various tools, making it akin to being friends with everyone, regardless of their preferred sport.

4. Common Utilities or Hadoop Common

Hadoop Common is basically a crucial but often overlooked component in the Hadoop ecosystem. It serves as the foundational layer that provides utilities & shared libraries essential for the proper functioning of other Hadoop modules. This component encapsulates the common functionalities required by various Hadoop modules, ensuring seamless integration and interoperability across the entire Hadoop framework.

Hadoop Common includes essential libraries and utilities, such as the Java Archive (JAR) files, necessary for executing MapReduce tasks and managing the distributed file system. 

Key functions of Hadoop Common:

  • Communication and Networking: It establishes the communication infrastructure for nodes in the Hadoop cluster, thereby ensuring seamless data exchange, job coordination, and cluster health.
  • Security: Hadoop Common prioritizes security in distributed computing. It generally includes features like authentication and authorization to safeguard data and resources in the Hadoop Distributed File System (HDFS).
  • Configuration Management: With tools for managing diverse hardware configurations in Hadoop clusters, administrators can fine-tune settings for optimal performance, allowing Hadoop applications to adapt to different setups.
  • Logging and Monitoring: Hadoop Common also provides logging and monitoring features for diagnosing issues, optimizing performance, and ensuring overall cluster health. This information helps administrators identify and address potential problems.

Advantages of Hadoop Architecture

  • Scalability and Cost-Effective Storage: Hadoop efficiently stores and distributes massive datasets across numerous low-cost servers. It allows businesses to process thousands of terabytes of data on scalable platforms.
  • Cost Reduction and Data Preservation: Traditional databases incur high costs in handling large datasets. Hadoop prevents data loss by avoiding down-sampling and preserving the integrity of the entire dataset.
  • Diverse Data Access: Hadoop facilitates access to a variety of data types (structured and unstructured) from a single source. It enables businesses to derive insights from diverse data sources such as social media and email communications.
  • Efficient Data Processing: Hadoop’s distributed file system reduces storage costs by mapping data across the cluster. It accelerates data processing by hosting processing tools on the same servers as the data.
  • Fault Tolerance: It ensures data integrity through duplication on multiple nodes, minimizing the risk of data loss in case of node failure.

Disadvantages of Hadoop Architecture

  • Complex Security Management: Managing security, especially in complex applications, poses challenges. The lack of straightforward examples in the security model may expose data to risks.
  • Security Risks and Java Vulnerabilities: Hadoop’s reliance on Java makes it susceptible to cyber threats. The absence of encryption at storage and network levels increases vulnerability.
  • Inefficiency with Small Data: Hadoop struggles to efficiently handle small files due to the nature of the Hadoop Distributed File System. Not well-suited for platforms primarily dealing with small data.
  • Version Compatibility and Stability: Organizations must ensure they use the latest stable version of Hadoop. Alternatively, opting for third-party vendors can address issues related to stability and compatibility.

Conclusion

Hadoop is a software framework for data processing. It is designed to be scalable, flexible, and easy to use. It is used to store and process large amounts of data, such as from databases, web services, and batch files. Hadoop is used in a variety of applications, including data warehousing, Big Data analytics, and cloud computing.

Hadoop has several components that work together to process data. The core component is the MapReduce framework. This framework allows for data processing by dividing tasks into small pieces and then recombining them into larger tasks. Hadoop also uses a distributed file system called HDFS to store data. HDFS allows for large amounts of data to be stored efficiently on multiple machines. Hadoop is designed to be easy to use and flexible. It can be used in a variety of environments, including the cloud, on-premise installations, and in the data centre.

Useful Resources

Previous Post

Data Warehouse Architecture – Detailed Explanation

Next Post

Network Architecture – Detailed Explanation