pyspark - Spark function for loading parquet file on memory -
i have rdd loaded parquet file using sparksql
data_rdd = sqlcontext.read.parquet(filename).rdd
i have noticed actual reading file operation gets executed once there aggregation function triggering spark job.
i need measure computation time of job without time takes read data file. (i.e. same input rdd(dataframe) there because created sparksql)
is there function triggers loading of file on executors memory?
i have tried .cache()
seems it's still triggering reading operation part of job.
spark lazy , computations needs. can .cache()
.count()
lines:
data_rdd = sqlcontext.read.parquet(filename).rdd data_rdd.cache() data_rdd.count()
any set of computations follow start cached state of data_rdd
since read whole table using count()
.
Comments
Post a Comment