Distributed Data-Parallel

Distributed data parallelism will:

Since there are several nodes work together, we need to wrong about the network latency among the nodes.

Concerns

Partial failure

There is a crash of a subset of the machines.
Latency

Certain operations cost more time than others due to the network communication.

This issue exists all kinds of the distribution.

Spark

retains the fault-tolerance

Spark replays functional transformations, like Scala collections, over orignal dataset.
reduces the latency

Spark keeps all the data immutable and in-memory.

As the result, Spark is 100 times more performant than its precedent, Hadoop.

Spark has more APIs to use. Hadoop is basicaly a MapReduce.