RDD: Spark distributed collection
Resilient Distributed Datasets (RDD)
RDD is similar to immutable sequential Scala collection.
abstract class RDD[T] {
def map[U](f: T => U): RDD[U]
def flatMap[U](f: T => TraversableOnce[U]): RDD[U]
def filter(f: T => U): RDD[U]
def reduce(f: T => U): RDD[U]
}
Create a RDD
-
Transform from an existing RDD
For example, use
map
to transform a RDD. -
Create from SparkContext (SparkSession)
SparkContext helps you to talk to Spark. There are 2 important functions:
-
parallelize
: It converts Scala collection to RDD. -
textFile
: It reads files from HDFS(Hadoop's File System), or local machine.
-