bucket

Tags: partition functions

Description

The bucket() function is a partition transformation function that distributes data into a specified number of buckets using hash partitioning. It can be used with any data type and is useful for evenly distributing data across partitions.

Parameters

numBuckets: Integer - the number of buckets to partition the data into
col: Column - the column to use for partitioning

Return Value

Column - the bucket number as an integer (0 to numBuckets-1)

Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import bucket

# Create a Spark session
spark = SparkSession.builder.appName("bucket_example").getOrCreate()

# Create a DataFrame with sample data
data = [("A",), ("B",), ("C",), ("D",), ("E",)]
df = spark.createDataFrame(data, ["value"])

# Partition data into 3 buckets
df = df.withColumn("bucket", bucket(3, "value"))
df.show()

# Output:
# +-----+------+
# |value|bucket|
# +-----+------+
# |    A|     0|
# |    B|     1|
# |    C|     2|
# |    D|     0|
# |    E|     1|
# +-----+------+

Notes

The function uses hash partitioning to distribute data into buckets
Returns an integer representing the bucket number (0 to numBuckets-1)
Useful for evenly distributing data across partitions
Can be used with any data type
The same input value will always be assigned to the same bucket
Returns NULL if the input is NULL
The number of buckets should be chosen carefully to avoid data skew