Spark SQL

Hadoop stocks unstructured data. It's easy to use, but it's hard to keep the high performance.

Therefore, Spark SQL comes for help. It sits on top of Spark RDD. It provides 2 things:

Concepts

It's the core of Spark SQL. It's like a table. There are some features about it.

Similar to SparkContext, Spark SQL starts with SparkSession.

There are 2 ways to create a dataframe.

create a dataframe from RDD

val tupleRDD: RDD[(String, Int)] = ???
val tupleDF = tupleRDD.toDF("name", "age") // column names

Case class makes it easier.

case class Person(name: String, age: Int)

val peopleRDD: RDD[Person] = ???
val peopleDF = peopleRDD.toDF

It's possible to create custom schema, too.

First, we need to register the dataframe as a temporary SQl view.

peopleDF.createOrReplaceTempView("people")

Then, write a query.

val adultsDF = peopleDF
  .spl("SELECT * FROM people WHERE age > 17")

The statements available are largely in HiveQL.