Transformations, Actions, Caching

Recall

In Scala sequential collections, we have:

Transformer

It returns new collections.
Accessor

It returns single values.

Basic

In Spark, we have:

Transformation

It returns new RDDs. It's lazy.
Action

It return the computed results of RDDs. It's eager.

It's very important that transform is lazy and action is eager. It's how Spark reduces the network latency.

For example, if we apply map on a RDD:

// use parallelize function of SparkContext(sc)
// to create a RDD from a Scala collection
val wordsRDD = sc.parallelize(largeList)
val lengthsRDD = wordsRDD.map(_.length)

Nothing happens on the clusters yet. lengthsRDD is just a reference. The actual compution is invoked only by actions.

val totalLength = lengthsRDD.reduce(_ + _)
// Now, the map function runs on the clusters,
// and the total length is computed.

Common functions

Transformations (lazy)
- map
- flatMap
- filter
- distinct
  
  get the distincts element from RDD.
Actions (eager)
- collect
  
  return all the elements of RDD
- count
  
  return the number of elements in RDD
- take
  
  return first n elements of RDD
- reduce
  
  combine all the elements in RDD
- foreach
  
  apply a function on each element of RDD

Caching

So far, we have discussed the common points between Spark and Scala collection. Now, let's talk about a main difference between these two, caching.

Spark allows user to cache the data in memory to improve the performance.

We use persist() to cache data.

val logs = largeLogs.filter(_.contains("error")).persist()
val firstLogs = logs.take(10)

In this case, after firstLogs is computed, Spark will cache logs for faster future access.