- Hadoop is an Apache open-source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
- Hadoop is designed to scale up from a single server to thousands of machines, each offering local computation and storage.
- Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on different CPU nodes.
Users of Hadoop:
- Hadoop is running searches on some of the Internet's largest sites:
- Amazon Web Services: Elastic MapReduce
- AOL: Variety of uses, e.g., behavioral analysis & targeting
- eBay: Search optimization (532-node cluster)
- Facebook: Reporting/analytics, machine learning (1100 m.)
- LinkedIn: People You May Know (2x50 machines)
- Twitter: Store + process tweets, log files, other data Yahoo: >36,000 nodes; the biggest cluster is 4,000 nodes
Hadoop Architecture
- Hadoop has a Master-Slave Architecture for both Storage & Processing
- Hadoop framework includes the following four modules:
- Hadoop Common: These are Java libraries that provide file system and OS level abstractions and contain the necessary Java files and scripts required to start Hadoop.
- Hadoop YARN: This is a framework for job scheduling and cluster resource management.
- Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
- Hadoop MapReduce: This is a system for parallel processing of large data sets.
Hadoop Architecture |
- The Hadoop core is divided into two fundamental layers:
- MapReduce engine
- Hadoop Distributed File System(HDFS)
- The MapReduce engine is the computation engine running on top of HDFS as its data storage manager.
- HDFS: HDFS is a distributed file system inspired by GFS that organizes files and stores their data on a distributed computing system.
- HDFS Architecture: HDFS has a master/slave architecture containing a single Name Node as the master and a number of Data Nodes as workers (slaves).