Handling JSON in Spark

Mapping JSON to a Custom Class with Gson

With Gson, you can read JSON dataset and map them to a custom class MyClass.

Since Gson is not serializable, each executor needs its own Gson object. Also, MyClass must be serializable in order to pass it between executors.

Note that the file(s) that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

val sc: org.apache.spark.SparkContext // An existing SparkContext

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files.
val path = "path/to/my_class.json"
val linesRdd: RDD[String] = sc.textFile(path)

// Mapping json to MyClass
val myClassRdd: RDD[MyClass] = linesRdd.map{ l => 
    val gson = new com.google.gson.Gson()
    gson.fromJson(l, classOf[MyClass])
}

If creation of Gson object becomes too costly, mapPartitions method can be used to optimize it. With it, there will be one Gson per partition instead of per line:

val myClassRdd: RDD[MyClass] = linesRdd.mapPartitions{p => 
    val gson = new com.google.gson.Gson()
    p.map(l => gson.fromJson(l, classOf[MyClass]))
}