dev notes


06 Aug 2014

Tag: hadoop

#Hadoop modules: * Hadoop Common: The common utilities that support the other Hadoop modules. * Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. * Hadoop YARN: A framework for job scheduling and cluster resource management. * Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Related projects: Cassandra™: A scalable multi-master database with no single points of failure. HBase™: A scalable, distributed database that supports structured data storage for large tables. Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. Pig™: A high-level data-flow language and execution framework for parallel computation. Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.

##MapReduce A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The “MapReduce System” (also called “infrastructure” or “framework”) orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

“Map” step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

“Reduce” step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.


Map is the name of a higher-order function that applies a given function to each element of a list, returning a list of results. It is often called apply-to-all when considered in functional form. This is an example of functoriality.

For example, if we define a function square as follows: square x = x * x Then calling map square [1,2,3,4,5] will return [1,4,9,16,25], as map will go through the list and apply the function square to each element.

Language Map Map 2 lists Map n lists Note
Python map(func,list) map(func,list1,list2) map(func,list1,list2,…) Returns a list in Python 2 and an iterator in Python 3.
Scala (list1,list2) (list1,list2,list3) note: more than 3 not possible.


fold – also known variously as reduce, accumulate, aggregate, compress, or inject – refers to a family of higher-order functions that analyze a recursive data structure and recombine through use of a given combining operation the results of recursively processing its constituent parts, building up a return value. Typically, a fold is presented with a combining function, a top node of a data structure, and possibly some default values to be used under certain conditions. The fold then proceeds to combine elements of the data structure’s hierarchy, using the function in a systematic way

Language Left fold Right fold Left fold without initial value Right fold Right fold
Python 2.x reduce(func, list, initval) reduce(lambda x,y: func(y,x), reversed(list), initval) reduce(func, list) reduce(lambda x,y: func(y,x), reversed(list))
Python 3.x functools.reduce(func, list, initval) functools.reduce(lambda x,y: func(y,x), reversed(list), initval) functools.reduce(func, list) functools.reduce(lambda x,y: func(y,x), reversed(list))
Java 8+ stream.reduce(initval, func)   stream.reduce(func)  
Scala list.foldLeft(initval)(func)(initval /: list)(func) list.foldRight(initval)(func)(list :\ initval){func} list.reduceLeft(func) list.reduceRight(func)

High Order Function

Use case scenario

Hadoop Spark Hbase Redis Kylin, which is an OLAP Engine for Hadoop, has been developed by eBay.

comments powered by Disqus