spark.sql
pipeline
1 | from pyspark.sql import SparkSession |
query and execution
It depends on query. Spark optimizer makes a decision about what it will be.
tempView
RDD vs. DF vs. SQL
DF is faster than RDD
Error will be found at code compilation when using DF
functions
mapping functions
generating functions
aggregating functions
user defined funcitons
time processing
xxx.withColumn(‘unixtime’,f.unix_timestamp(‘time’).limit(5).toPandas())
window functions
window.partitionBy(“ip”).orderBy(“unixtime”)
access_log_ts.select(“ip”,“unixtime”,f.row_number().over(user_window).alias(“count”),f.lag(“unixtime”).over(user_window).alias(“lag”),f.lead(“unixtime”).over(user_window).alias(“lead”)).limit(5).toPandas()