months

Tags: partition functions

Description

The months() function is a partition transformation function that extracts the month component from a timestamp or date column. It is commonly used for partitioning data by month.

Parameters

col: Column - a timestamp or date column

Return Value

Column - the month component as an integer (1-12)

Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import months

# Create a Spark session
spark = SparkSession.builder.appName("months_example").getOrCreate()

# Create a DataFrame with sample timestamps
data = [("2023-01-15 10:30:00",), ("2022-12-31 23:59:59",), ("2024-03-20 15:45:30",)]
df = spark.createDataFrame(data, ["timestamp"])
df = df.withColumn("timestamp", df.timestamp.cast("timestamp"))

# Extract months
df = df.withColumn("month", months("timestamp"))
df.show()

# Output:
# +-------------------+-----+
# |          timestamp|month|
# +-------------------+-----+
# |2023-01-15 10:30:00|    1|
# |2022-12-31 23:59:59|   12|
# |2024-03-20 15:45:30|    3|
# +-------------------+-----+

Notes

The function extracts the month component from a timestamp or date value
Returns an integer representing the month (1-12)
Useful for partitioning data by month
Can be used in combination with other partition functions like years() and days()
Returns NULL if the input is NULL