RDD: Spark distributed collection

Resilient Distributed Datasets (RDD)

RDD is similar to immutable sequential Scala collection.

abstract class RDD[T] {
  def map[U](f: T => U): RDD[U]
  def flatMap[U](f: T => TraversableOnce[U]): RDD[U]
  def filter(f: T => U): RDD[U]
  def reduce(f: T => U): RDD[U]
}

Create a RDD

Transform from an existing RDD

For example, use map to transform a RDD.
Create from SparkContext (SparkSession)

SparkContext helps you to talk to Spark. There are 2 important functions:
- parallelize: It converts Scala collection to RDD.
- textFile: It reads files from HDFS(Hadoop's File System), or local machine.