Spark SQL

Hadoop stocks unstructured data. It's easy to use, but it's hard to keep the high performance.

Therefore, Spark SQL comes for help. It sits on top of Spark RDD. It provides 2 things:

  • build structured data
  • improve performance

Concepts

  • relation: table
  • attribute: column
  • record/tuple: row

Dataframe

It's the core of Spark SQL. It's like a table. There are some features about it.

  • Dataframe is RDD full of records with a schema.

  • Dataframe is untyped.

  • Transformation on dataframe is untyped.

Create a dataframe

Similar to SparkContext, Spark SQL starts with SparkSession.

There are 2 ways to create a dataframe.

  • create a dataframe from RDD

    val tupleRDD: RDD[(String, Int)] = ???
    val tupleDF = tupleRDD.toDF("name", "age") // column names
    

    Case class makes it easier.

    case class Person(name: String, age: Int)
    
    val peopleRDD: RDD[Person] = ???
    val peopleDF = peopleRDD.toDF
    

    It's possible to create custom schema, too.

  • create a dataframe from source

    Spark supports json, csv, etc.

    val df = spark.read.json("./names.json")
    

Query a dataframe

First, we need to register the dataframe as a temporary SQl view.

peopleDF.createOrReplaceTempView("people")

Then, write a query.

val adultsDF = peopleDF
  .spl("SELECT * FROM people WHERE age > 17")

The statements available are largely in HiveQL.