pyspark - Spark function for loading parquet file on memory -
i have rdd loaded parquet file using sparksql
data_rdd = sqlcontext.read.parquet(filename).rdd
i have noticed actual reading file operation gets executed once there aggregation function triggering spark job.
i need measure computation time of job without time takes read data file. (i.e. same input rdd(dataframe) there because created sparksql)
is there function triggers loading of file on executors memory?
i have tried .cache() seems it's still triggering reading operation part of job.
spark lazy , computations needs. can .cache() .count() lines:
data_rdd = sqlcontext.read.parquet(filename).rdd data_rdd.cache() data_rdd.count() any set of computations follow start cached state of data_rdd since read whole table using count().
Comments
Post a Comment