Spark save dataframe metadata and reuse it -

May 15, 2015

when read dataset lot of files (in case google cloud storage), spark.read works lot of time before first manipulation. i'm not sure guess maps files , sample them infer schema.

my question is, there option save metadata collected dataframe , reuse in other work on dataset.

-- update --

the data arranged this:

gs://bucket-name/table_name/day=yyyymmdd/many_json_files

when run: df = spark.read.json("gs://bucket-name/table_name") that's take lot of time. wish following:

df = spark.read.json("gs://bucket-name/table_name")  df.savemetadata("gs://bucket-name/table_name_metadata")

and in session:

df = spark.read.metadata("gs://bucket-name/table_name_metadata").‌json("gs://bucket-na‌me/table_name")  ...  <some df manipulation>  ...

we need infer schema once , reuse later files, if have lot of file has same schema. this.

val df0 = spark.read.json("first_file_we_wanna_spark_to_info.json")  val schema = df0.schema  // other files val df = spark.read.schema(schema).json("donnot_info_schema.json")

Search This Blog

Force Net

Spark save dataframe metadata and reuse it -

Comments

Post a Comment

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -