Handling JSON in Spark
Mapping JSON to a Custom Class with Gson
With Gson
, you can read JSON dataset and map them to a custom class MyClass
.
Since Gson
is not serializable, each executor needs its own Gson
object. Also, MyClass
must be serializable in order to pass it between executors.
Note that the file(s) that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
val sc: org.apache.spark.SparkContext // An existing SparkContext
// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files.
val path = "path/to/my_class.json"
val linesRdd: RDD[String] = sc.textFile(path)
// Mapping json to MyClass
val myClassRdd: RDD[MyClass] = linesRdd.map{ l =>
val gson = new com.google.gson.Gson()
gson.fromJson(l, classOf[MyClass])
}
If creation of Gson
object becomes too costly, mapPartitions
method can be used to optimize it. With it, there will be one Gson
per partition instead of per line:
val myClassRdd: RDD[MyClass] = linesRdd.mapPartitions{p =>
val gson = new com.google.gson.Gson()
p.map(l => gson.fromJson(l, classOf[MyClass]))
}