Hadoop Distributed File System (HDFS) - Cloud Computing

Hadoop Distributed File System

  • To store a file in this architecture,
  • HDFS splits the file into fixed-size blocks (e.g., 64 MB) and stores them on workers (Data Nodes).
  • The mapping of blocks to Data Nodes is determined by the Name Node.
  • The NameNode (master) also manages the file system’s metadata and namespace.
  • Namespace is the area maintaining the metadata, and metadata refers to all the information stored by a file system that is needed for overall management of all files.
  • NameNode in the metadata stores all information regarding the location of input splits/blocks in all DataNodes.
  • Each DataNode, usually one per node in a cluster, manages the storage attached to the node.
  • Each DataNode is responsible for storing and retrieving its file blocks

HDFS- Features

Distributed file systems have special requirements
  • Performance
  • Scalability
  • Concurrency Control
  • Fault Tolerance
  • Security Requirements

HDFS Fault Tolerance Block replication:

  • To reliably store data in HDFS, file blocks are replicated in this system.
  • HDFS stores a file as a set of blocks and each block is replicated and distributed across the whole cluster.
  • The replication factor is set by the user and is three by default.
  • Replica placement: The placement of replicas is another factor to fulfill the desired fault tolerance in HDFS.
  • Storing replicas on different nodes (DataNodes) located in different racks across the whole cluster.
  • HDFS stores one replica in the same node the original data is stored.
  • One replica on a different node but in the same rack
  • One replica on a different node in a different rack.
  • Heartbeats and Blockreports are periodic messages sent to the NameNode by each DataNode in a cluster.
  • Receipt of a Heartbeat implies that the DataNode is functioning properly.
  • Each Blockreport contains a list of all blocks on a DataNode .
  • The NameNode receives such messages because it is the sole decision maker of all replicas in the system.

HDFS High Throughput

  • Applications run on HDFS typically have large data sets.
  • Individual files are broken into large blocks to allow HDFS to decrease the amount of metadata storage required per file.
  • The list of blocks per file will shrink as the size of individual blocks increases.
  • By keeping large amounts of data sequentially within a block, HDFS provides fast streaming reads of data.

HDFS- Read Operation Reading a file :

  • To read a file in HDFS, a user sends an “open” request to the NameNode to get the location of file blocks.
  • For each file block, the NameNode returns the address of a set of DataNodes containing replica information for the requested file.
  • The number of addresses depends on the number of block replicas.
  • The user calls the read function to connect to the closest DataNode containing the first block of the file.
  • Then the first block is streamed from the respective DataNode to the user.
  • The established connection is terminated and the same process is repeated for all blocks of the requested file until the whole file is streamed to the user.

HDFS-Write Operation Writing to a file:

  • To write a file in HDFS, a user sends a “create” request to the NameNode to create a new file in the file system namespace.
  • If the file does not exist, the NameNode notifies the user and allows him to start writing data to the file by calling the write function.
  • The first block of the file is written to an internal queue termed the data queue.
  • A data streamer monitors its writing into a DataNode.
  • Each file block needs to be replicated by a predefined factor.
  • The data streamer first sends a request to the NameNode to get a list of suitable DataNodes to store replicas of the first block.
  • The steamer then stores the block in the first allocated DataNode.
  • Afterward, the block is forwarded to the second DataNode by the first DataNode.
  • The process continues until all allocated DataNodes receive a replica of the first block from the previous DataNode.
  • Once this replication process is finalized, the same process starts for the second block.