pyspark - Spark function for loading parquet file on memory -


i have rdd loaded parquet file using sparksql

data_rdd = sqlcontext.read.parquet(filename).rdd

i have noticed actual reading file operation gets executed once there aggregation function triggering spark job.

i need measure computation time of job without time takes read data file. (i.e. same input rdd(dataframe) there because created sparksql)

is there function triggers loading of file on executors memory?

i have tried .cache() seems it's still triggering reading operation part of job.

spark lazy , computations needs. can .cache() .count() lines:

data_rdd = sqlcontext.read.parquet(filename).rdd data_rdd.cache() data_rdd.count() 

any set of computations follow start cached state of data_rdd since read whole table using count().


Comments

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -