Spark
Spark is built after MapReduce. It has several advantages.
-
API is more friendly for developers.
-
It uses RAM for calculating.
-
It has more calculation stage.
Spark's core is RDD (Resilient Distributed Dataset).
RDD has 2 kinds of functions:
-
action:
count
,saveAsTextFile
, etc -
transformation:
map
,filter
,reduceByKey
Transformation can be classified by whether applying
shuffle
.map
doesn't shuffle, therefore it doesn't create new RDD. It's fast. However,reduceByKey
needs shuffling. It creates new RDD, and it costs time.