- What is the difference between on demand and reserved instances?
- What are the provisions provided by Amazon Virtual Private cloud?
- What do you understand by MapReduce?
- Explain how MapReduce works.
- What is an input reader in reference to MapReduce?
- Explain combiners.
- Explain what you understand by speculative execution
- When do reducers play their role in a MapReduce task?
- How is MapReduce related to cloud computing?
- How does fault tolerance work in MapReduce?
- In map reduce what is a scarce system resource? Explain?
- What are the various input and output types supported by MapReduce?
- Explain the general MapReduce algorithm:
- Write a short note on the disadvantages of MapReduce
What is the difference between on demand and reserved instances?
- On demand instance allow user to pay for the computing capacity according to their use every hour, whereas reserved instances provide user to pay for every instance which they use and they want to reserve.
- On demand instance provide user a free working environment in which there is no need for too much of planning related to complexities, whereas reserved instances provide user with discounts on the hourly charge of an instance and provide a easy way to manage the instances as well.
- On demand instance provide maintenance of hardware and transforms fixed cost into much smaller variable costs, whereas reserved instance provide easy way to balance the pay package.
What are the provisions provided by Amazon Virtual Private cloud?
Amazon private cloud provides a provision to create a private and isolated networking infrastructure to give easily the Amazon web services.
- Virtual network topologies define the traditional data-center approach to control and mange the files from one place.
- It provides complete control over IP address range, creation of sub-nets and configuring the network gateways and route tables.
- It provides easy to customize network configuration like creation of public sub-net to access the Internet easily.
- It allow to create multiple security layers and provide network control list by which you can control the access to Amazon EC2 instances.
What do you understand by MapReduce?
MapReduce is a software framework that was created by Google. It`s prime focus was to aid in distributed computing, specifically large sets of data on a group of many computers. The frameworks took its inspiration from the map and reduce functions from functional programming.
Explain how MapReduce works.
The processing can occur on data which are in a file system (unstructured ) or in a database ( structured ).
The MapReduce framework primarily works on two steps:1. Map step2. Reduce step
Map step: During this step the master node accepts an input (problem) and splits itinto smaller problems. Now the node distributes the small sub problems to theworker node so that they can solve the problem.
Reduce step: Once the sub problem is solved by the worker node, the node returnsa solution to the master node which accepts all the solutions of the worker node andre-compiles them into a solution. This solution is for the input that was provided tothe master node.
What is an input reader in reference to MapReduce?
The input reader as the name suggests primarily has two functions:
1. Reading the Input
2. Splitting it into sub-parts
The input reader accepts a user entered problem and then it divides/splits the problem into parts which then each are assigned a map function. Also an input reader will always read data from a stable storage source only to avoid problems. Define the purpose of the Partition function in MapReduce framework
In MapReduce framework each map function generates key values. The partition function accepts these key values and in return provides the index for a reduce. Generally the key is hashed and a modulo is done to the number of reducers.
Explain combiners.
Combiners codes are used to increase the efficiency of a MapReduce process. They basically help by reducing the amount of data that needs to be shifted across to reducers. As a safe practice the MapReduce jobs should never depend upon combiners execution.
Explain what you understand by speculative execution
MapReduce works on the basis of large number of computers connected via a network also known as node. In a large network there is always a possibility that a system may not perform as quickly as others. This results in a task being delayed. By speculative execution this can be avoided as multiple instances of the same map are run on different systems.
When do reducers play their role in a MapReduce task?
The reducers in a MapReduce job do not begin before all the map jobs are completed. Once all the map jobs are completed the reducers begin copying the intermediate key-value pairs from the mappers. Overall reducers start working as soon as the mappers are ready with key-value pairs.
How is MapReduce related to cloud computing?
The MapReduce framework contains most of the key architecture principles of cloud computing such as:
- Scale: The framework is able to expand itself in direct proportion to the number of machines available.
- Reliable: The framework is able to compensate for a lost node and restart the task on a different node.
- Affordable: A user can start small and over time can add more hardware.
Due to the above features the MapReduce framework has become the platform of choice for the development of cloud applications.
How does fault tolerance work in MapReduce?
In a MapReduce job the master pings each worker periodically. In case a worker does not respond to that system then the system is marked as failed. Even completed tasks are rescheduled because the output was stored in a in a local disk of a worker which failed. Hence MapReduce is able to handle large-scale failures easily by simply restarting a task. The master node always saves itself at checkpoints and in case of any failure it simply restarts from that checkpoint.
In map reduce what is a scarce system resource? Explain?
A scarce resource is one which is available in limited quantities for the system. In MapReduce the network band-with is a scarce resource. It is conserved by making use of local disks and memory in cluster to store data during tasks. The function uses the location of the input files into account and aims to schedule a task on a system which has the input files.
What are the various input and output types supported by MapReduce?
MapReduce framework provides a user with many different output and input typeset. Each line is a key/value pair. The key is the offset of the line from the beginning of the file and the value are contents of the line. It is up-to the will of the user. Also a user can add functionality at his will to support new input and output types. Explain task granularity
In mapreduce the map phase if subdivided into M pieces and the reduce phase intoR pieces. Each worker is assigned a group of tasks this improves dynamic loadbalancing and also speeds up the recovery of a worker in case of failures.With the help of two examples name the map and reduce function purpose
Distributed grep: A line is emitted by the map function if it matches a pattern. Thereduce function is an identity function that copies supplied intermediate data foroutput.
Term-vector per host: In this the map function emits a hostname, vector pair for every document (input). The reduce function adds all the term vectors pairs generated and discards any infrequent terms.
Explain the general MapReduce algorithm:
The MapReduce algorithm has 4 main phases:1. Map,2. Combine,3. Shuttle and sort4. Phase output
Mappers simply execute on unsorted key/values pairs. They create the intermediate keys. Once these keys are ready the combiners pair the key/value pairs with the right key. The shuttle/sort is done by the framework their role being to group data and transfer it. Once completed, it will proceed for the output via the phase output process.
Write a short note on the disadvantages of MapReduce
Some of the shortcomings of MapReduce are:
- One-input two-phase data flow is rigid i.e. it does not allow for multiple step processing of records.
- Being based on a procedural programming model this framework requires code for simple operations.
- The map and reduce functions being opaque does not allow for optimization easily.