years

Tags: partition functions

Description

The years() function is a partition transformation function that extracts the year component from a timestamp or date column. It is commonly used for partitioning data by year.

Parameters

col: Column - a timestamp or date column

Return Value

Column - the year component as an integer

Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import years

# Create a Spark session
spark = SparkSession.builder.appName("years_example").getOrCreate()

# Create a DataFrame with sample timestamps
data = [("2023-01-15 10:30:00",), ("2022-12-31 23:59:59",), ("2024-03-20 15:45:30",)]
df = spark.createDataFrame(data, ["timestamp"])
df = df.withColumn("timestamp", df.timestamp.cast("timestamp"))

# Extract years
df = df.withColumn("year", years("timestamp"))
df.show()

# Output:
# +-------------------+----+
# |          timestamp|year|
# +-------------------+----+
# |2023-01-15 10:30:00|2023|
# |2022-12-31 23:59:59|2022|
# |2024-03-20 15:45:30|2024|
# +-------------------+----+

Notes

The function extracts the year component from a timestamp or date value
Returns an integer representing the year
Useful for partitioning data by year
Can be used in combination with other partition functions like months() and days()
Returns NULL if the input is NULL