Ticker

6/recent/ticker-posts

Introduction to Hadoop Framework, Architecture

  • Hadoop is an Apache open-source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
  • Hadoop is designed to scale up from a single server to thousands of machines, each offering local computation and storage.
  • Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on different CPU nodes.

Users of Hadoop:
  • Hadoop is running searches on some of the Internet's largest sites:
    • Amazon Web Services: Elastic MapReduce
    • AOL: Variety of uses, e.g., behavioral analysis & targeting
    • eBay: Search optimization (532-node cluster)
    • Facebook: Reporting/analytics, machine learning (1100 m.)
    • LinkedIn: People You May Know (2x50 machines)
    • Twitter: Store + process tweets, log files, other data Yahoo: >36,000 nodes; the biggest cluster is 4,000 nodes

Hadoop Architecture

  • Hadoop has a Master-Slave Architecture for both Storage & Processing
  • Hadoop framework includes the following four modules:
  • Hadoop Common: These are Java libraries that provide file system and OS level abstractions and contain the necessary Java files and scripts required to start Hadoop.
  • Hadoop YARN: This is a framework for job scheduling and cluster resource management.
  • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
  • Hadoop MapReduce: This is a system for parallel processing of large data sets.

Hadoop Architecture
Hadoop Architecture

  • The Hadoop core is divided into two fundamental layers:
    • MapReduce engine
    • Hadoop Distributed File System(HDFS)
  • The MapReduce engine is the computation engine running on top of HDFS as its data storage manager.
  • HDFS: HDFS is a distributed file system inspired by GFS that organizes files and stores their data on a distributed computing system.
  • HDFS Architecture: HDFS has a master/slave architecture containing a single Name Node as the master and a number of Data Nodes as workers (slaves).