Hadoop Distributed File System (HDFS)

To store a file in this architecture,
HDFS splits the file into fixed-size blocks (e.g., 64 MB) and stores them on workers (Data Nodes).
The mapping of blocks to Data Nodes is determined by the Name Node.
The NameNode (master) also manages the file system’s metadata and namespace.
Namespace is the area maintaining the metadata, and metadata refers to all the information stored by a file system that is needed for overall management of all files.
NameNode in the metadata stores all information regarding the location of input splits/blocks in all DataNodes.
Each DataNode, usually one per node in a cluster, manages the storage attached to the node.
Each DataNode is responsible for storing and retrieving its file blocks

Distributed file systems have special requirements

To reliably store data in HDFS, file blocks are replicated in this system.
HDFS stores a file as a set of blocks and each block is replicated and distributed across the whole cluster.
The replication factor is set by the user and is three by default.
Replica placement: The placement of replicas is another factor to fulfill the desired fault tolerance in HDFS.
Storing replicas on different nodes (DataNodes) located in different racks across the whole cluster.
HDFS stores one replica in the same node the original data is stored.
One replica on a different node but in the same rack
One replica on a different node in a different rack.
Heartbeats and Blockreports are periodic messages sent to the NameNode by each DataNode in a cluster.
Receipt of a Heartbeat implies that the DataNode is functioning properly.
Each Blockreport contains a list of all blocks on a DataNode .
The NameNode receives such messages because it is the sole decision maker of all replicas in the system.

Applications run on HDFS typically have large data sets.
Individual files are broken into large blocks to allow HDFS to decrease the amount of metadata storage required per file.
The list of blocks per file will shrink as the size of individual blocks increases.
By keeping large amounts of data sequentially within a block, HDFS provides fast streaming reads of data.

To read a file in HDFS, a user sends an “open” request to the NameNode to get the location of file blocks.
For each file block, the NameNode returns the address of a set of DataNodes containing replica information for the requested file.
The number of addresses depends on the number of block replicas.
The user calls the read function to connect to the closest DataNode containing the first block of the file.
Then the first block is streamed from the respective DataNode to the user.
The established connection is terminated and the same process is repeated for all blocks of the requested file until the whole file is streamed to the user.

To write a file in HDFS, a user sends a “create” request to the NameNode to create a new file in the file system namespace.
If the file does not exist, the NameNode notifies the user and allows him to start writing data to the file by calling the write function.
The first block of the file is written to an internal queue termed the data queue.
A data streamer monitors its writing into a DataNode.
Each file block needs to be replicated by a predefined factor.
The data streamer first sends a request to the NameNode to get a list of suitable DataNodes to store replicas of the first block.
The steamer then stores the block in the first allocated DataNode.
Afterward, the block is forwarded to the second DataNode by the first DataNode.
The process continues until all allocated DataNodes receive a replica of the first block from the previous DataNode.
Once this replication process is finalized, the same process starts for the second block.

Hadoop Distributed File System (HDFS) - Cloud Computing