Transformations, Actions, Caching
Recall
In Scala sequential collections, we have:
-
Transformer
It returns new collections.
-
Accessor
It returns single values.
Basic
In Spark, we have:
-
Transformation
It returns new RDDs. It's lazy.
-
Action
It return the computed results of RDDs. It's eager.
It's very important that transform is lazy and action is eager. It's how Spark reduces the network latency.
For example, if we apply map
on a RDD:
// use parallelize function of SparkContext(sc)
// to create a RDD from a Scala collection
val wordsRDD = sc.parallelize(largeList)
val lengthsRDD = wordsRDD.map(_.length)
Nothing happens on the clusters yet. lengthsRDD
is just a reference. The actual compution is invoked only by actions.
val totalLength = lengthsRDD.reduce(_ + _)
// Now, the map function runs on the clusters,
// and the total length is computed.
Common functions
-
Transformations (lazy)
-
map
-
flatMap
-
filter
-
distinct
get the distincts element from RDD.
-
-
Actions (eager)
-
collect
return all the elements of RDD
-
count
return the number of elements in RDD
-
take
return first
n
elements of RDD -
reduce
combine all the elements in RDD
-
foreach
apply a function on each element of RDD
-
Caching
So far, we have discussed the common points between Spark and Scala collection. Now, let's talk about a main difference between these two, caching.
Spark allows user to cache the data in memory to improve the performance.
We use persist()
to cache data.
val logs = largeLogs.filter(_.contains("error")).persist()
val firstLogs = logs.take(10)
In this case, after firstLogs
is computed, Spark will cache logs
for faster future access.