Distributed Computing for Data Mining

How can we extract knowledge from large data sets?

Typically, data is stored on networks of commodity hardware (cheap, off-the-shelf hardware) within data centers. A major challenge with this computing model is the failure of individual machines. One server may survive for ~3 years, but with 10,000 servers, you can expect one to fail every day. With 1M servers, you can expect 1000 failures per day.

One approach is to replicate data across multiple machines. However, transferring data is expensive and time intensive. A core idea of distributed computing is to move computation to the data, rather than moving data to the computation. Spark/Hadoop address these problems.

Distributed file system give you a global namespace. Typical usage patterns include huge files (100s of GBs to TBs), no updates in place (append only logs), and large streaming reads. HDFS is optimized for these patterns.

MapReduce

MapReduce is a style of programming that is designed for - Easy parallelization - Invisible management of hardware/software failures - Easy management of very large datasets - Very little required memory (since data is read and written to disk)

There are several implementations of MapReduce, including Hadoop and Spark.

It is important that your distribution of keys outputted by the map function is semi-uniform. Skew in keys leads to skew in the workload of reducers associated with those keys.

Example: Word Count

You have a huge text document and you want to count the number of times each word appears (ie analyzing a log file).

Map: For each word in the document, output a key-value pair where the key is the word and the value is 1.

def map(doc):
    for word in doc.split():
        yield (word, 1)

Group by key: Sort and shuffle the output of the mappers so that all values for a given key are grouped together.

def group_by_key(pairs):
    pairs.sort()
    for key, group in itertools.groupby(pairs, key=lambda x: x[0]):
        yield (key, [x[1] for x in group])

Reduce: For each key and its associated list of values, sum the values.

def reduce(key, values):
    yield (key, sum(values))

Spark

The two major limitations of MapReduce are - Rigid programming model - Performance bottleneck due to disk I/O

Spark is a general-purpose cluster computing system that addresses these limitations. It is instead dataflow based, where you define a series of transformations on data, and Spark figures out how to execute them in parallel. It is meant to be a more expressive and efficient than MapReduce. There are higher-level APIs like dataframes and SQL that make it easier to work with data.

Resilient Distributed Datasets (RDDs)

The core data structure in Spark. They are immutable, distributed collections of objects. You can perform transformations on RDDs to create new RDDs, and Spark will optimize the execution of these transformations.

They are essentially a partitioned collection of records that can be cached in memory across machines. They are fault-tolerant, meaning if a partition is lost, it can be recomputed from the original source.

Task Scheduling

Spark supports general DAGs of tasks, where each task is a unit of work that is sent to a worker. The DAG scheduler breaks the computation into stages, where each stage is a set of tasks that can be executed in parallel. The task scheduler then schedules tasks within each stage. Functions are pipelined together when possible, and tasks are scheduled in both a cache aware and partition aware manner.

Libraries

Spark vs. Hadoop + MapReduce