Hadoop Distributed File System
- To store a file in this architecture,
- HDFS splits the file into fixed-size blocks (e.g., 64 MB) and stores them on workers (Data Nodes).
- The mapping of blocks to Data Nodes is determined by the Name Node.
- The NameNode (master) also manages the file system’s metadata and namespace.
- Namespace is the area maintaining the metadata, and metadata refers to all the information stored by a file system that is needed for overall management of all files.
- NameNode in the metadata stores all information regarding the location of input splits/blocks in all DataNodes.
- Each DataNode, usually one per node in a cluster, manages the storage attached to the node.
- Each DataNode is responsible for storing and retrieving its file blocks
HDFS- Features
Distributed file systems have special requirements
- Performance
- Scalability
- Concurrency Control
- Fault Tolerance
- Security Requirements
HDFS Fault Tolerance Block replication:
- To reliably store data in HDFS, file blocks are replicated in this system.
- HDFS stores a file as a set of blocks and each block is replicated and distributed across the whole cluster.
- The replication factor is set by the user and is three by default.
- Replica placement: The placement of replicas is another factor to fulfill the desired fault tolerance in HDFS.
- Storing replicas on different nodes (DataNodes) located in different racks across the whole cluster.
- HDFS stores one replica in the same node the original data is stored.
- One replica on a different node but in the same rack
- One replica on a different node in a different rack.
- Heartbeats and Blockreports are periodic messages sent to the NameNode by each DataNode in a cluster.
- Receipt of a Heartbeat implies that the DataNode is functioning properly.
- Each Blockreport contains a list of all blocks on a DataNode .
- The NameNode receives such messages because it is the sole decision maker of all replicas in the system.
HDFS High Throughput
- Applications run on HDFS typically have large data sets.
- Individual files are broken into large blocks to allow HDFS to decrease the amount of metadata storage required per file.
- The list of blocks per file will shrink as the size of individual blocks increases.
- By keeping large amounts of data sequentially within a block, HDFS provides fast streaming reads of data.
HDFS- Read Operation Reading a file :
- To read a file in HDFS, a user sends an “open” request to the NameNode to get the location of file blocks.
- For each file block, the NameNode returns the address of a set of DataNodes containing replica information for the requested file.
- The number of addresses depends on the number of block replicas.
- The user calls the read function to connect to the closest DataNode containing the first block of the file.
- Then the first block is streamed from the respective DataNode to the user.
- The established connection is terminated and the same process is repeated for all blocks of the requested file until the whole file is streamed to the user.
HDFS-Write Operation Writing to a file:
- To write a file in HDFS, a user sends a “create” request to the NameNode to create a new file in the file system namespace.
- If the file does not exist, the NameNode notifies the user and allows him to start writing data to the file by calling the write function.
- The first block of the file is written to an internal queue termed the data queue.
- A data streamer monitors its writing into a DataNode.
- Each file block needs to be replicated by a predefined factor.
- The data streamer first sends a request to the NameNode to get a list of suitable DataNodes to store replicas of the first block.
- The steamer then stores the block in the first allocated DataNode.
- Afterward, the block is forwarded to the second DataNode by the first DataNode.
- The process continues until all allocated DataNodes receive a replica of the first block from the previous DataNode.
- Once this replication process is finalized, the same process starts for the second block.