dataframe and sql

spark.sql

pipeline

from pyspark.sql import SparkSession
spark_session = SparkSession.builder.enableHiveSupport().appName('spark sql').master('local').getOrCreate()
spark_session.sql("""
	show databases
""").toPandas()

query and execution

It depends on query. Spark optimizer makes a decision about what it will be.

tempView

RDD vs. DF vs. SQL

DF is faster than RDD
Error will be found at code compilation when using DF

functions

mapping functions

generating functions

aggregating functions

user defined funcitons

time processing

xxx.withColumn(‘unixtime’,f.unix_timestamp(‘time’).limit(5).toPandas())

window functions

window.partitionBy(“ip”).orderBy(“unixtime”)
access_log_ts.select(“ip”,“unixtime”,f.row_number().over(user_window).alias(“count”),f.lag(“unixtime”).over(user_window).alias(“lag”),f.lead(“unixtime”).over(user_window).alias(“lead”)).limit(5).toPandas()