Distributed Data-Parallel
Distributed data parallelism will:
- split data over several nodes
- Node independently works on the data in parallel.
- combine the results when done (optional)
Since there are several nodes work together, we need to wrong about the network latency among the nodes.
Concerns
-
Partial failure
There is a crash of a subset of the machines.
-
Latency
Certain operations cost more time than others due to the network communication.
This issue exists all kinds of the distribution.
Why Spark
Spark
-
retains the fault-tolerance
Spark replays functional transformations, like Scala collections, over orignal dataset.
-
reduces the latency
Spark keeps all the data immutable and in-memory.
As the result, Spark is 100 times more performant than its precedent, Hadoop.
Spark has more APIs to use. Hadoop is basicaly a MapReduce.