Hadoop Streaming, a versatile feature since Hadoop 0.14.1, empowers developers to write MapReduce programs in languages like Ruby, Perl, Python, C++, and more without being confined to Java. This utility leverages UNIX standard streams, allowing any program that reads from standard input (STDIN) and writes to standard output (STDOUT) to act as mappers or reducers.
With Hadoop Streaming, non-Java developers find an accessible path to process vast amounts of data using familiar tools and languages, enhancing the Hadoop ecosystem's flexibility and inclusivity.
How Hadoop Streaming Works
Hadoop Streaming works by using Unix pipes to connect the output of a mapper or reducer written in a non-Java language to the input of the next stage in the MapReduce pipeline.
The six steps involved in the working of Hadoop Streaming are:
- Step 1: The input data is divided into chunks or blocks, typically 64MB to 128MB in size automatically. Each chunk of data is processed by a separate mapper.
- Step 2: The mapper reads the input data from standard input (stdin) and generates an intermediate key-value pair based on the logic of the mapper function which is written to standard output (stdout).
- Step 3: The intermediate key-value pairs are sorted and partitioned based on their keys ensuring that all values with the same key are directed to the same reducer.
- Step 4: The key-value pairs are passed to the reducers for further processing where each reducer receives a set of key-value pairs with the same key.
- Step 5: The reducer function, implemented by the developer, performs the required computations or aggregations on the data and generates the final output which is written to the standard output (stdout).
- Step 6: The final output generated by the reducers is stored in the specified output location in the HDFS.
The distributed nature of Hadoop enables parallel execution of mappers and reducers across a cluster of machines, providing scalability and fault tolerance. The data processing is efficiently distributed across multiple nodes, allowing for faster processing of large-scale datasets.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.