Spark save dataframe metadata and reuse it -
when read dataset lot of files (in case google cloud storage), spark.read
works lot of time before first manipulation. i'm not sure guess maps files , sample them infer schema.
my question is, there option save metadata collected dataframe , reuse in other work on dataset.
-- update --
the data arranged this:
gs://bucket-name/table_name/day=yyyymmdd/many_json_files
when run: df = spark.read.json("gs://bucket-name/table_name")
that's take lot of time. wish following:
df = spark.read.json("gs://bucket-name/table_name") df.savemetadata("gs://bucket-name/table_name_metadata")
and in session:
df = spark.read.metadata("gs://bucket-name/table_name_metadata").json("gs://bucket-name/table_name") ... <some df manipulation> ...
we need infer schema once , reuse later files, if have lot of file has same schema. this.
val df0 = spark.read.json("first_file_we_wanna_spark_to_info.json") val schema = df0.schema // other files val df = spark.read.schema(schema).json("donnot_info_schema.json")
Comments
Post a Comment