Introduction to Hadoop Framework, Architecture

Hadoop is an Apache open-source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
Hadoop is designed to scale up from a single server to thousands of machines, each offering local computation and storage.
Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on different CPU nodes.

Users of Hadoop:

Hadoop is running searches on some of the Internet's largest sites:

Amazon Web Services: Elastic MapReduce
AOL: Variety of uses, e.g., behavioral analysis & targeting
eBay: Search optimization (532-node cluster)
Facebook: Reporting/analytics, machine learning (1100 m.)
LinkedIn: People You May Know (2x50 machines)
Twitter: Store + process tweets, log files, other data Yahoo: >36,000 nodes; the biggest cluster is 4,000 nodes

Hadoop Architecture

Hadoop has a Master-Slave Architecture for both Storage & Processing
Hadoop framework includes the following four modules:
Hadoop Common: These are Java libraries that provide file system and OS level abstractions and contain the necessary Java files and scripts required to start Hadoop.
Hadoop YARN: This is a framework for job scheduling and cluster resource management.
Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
Hadoop MapReduce: This is a system for parallel processing of large data sets.

Hadoop Architecture

Hadoop Architecture

The Hadoop core is divided into two fundamental layers:

MapReduce engine
Hadoop Distributed File System(HDFS)

The MapReduce engine is the computation engine running on top of HDFS as its data storage manager.
HDFS: HDFS is a distributed file system inspired by GFS that organizes files and stores their data on a distributed computing system.
HDFS Architecture: HDFS has a master/slave architecture containing a single Name Node as the master and a number of Data Nodes as workers (slaves).

Cloud Computing