bucket
Tags: partition functions
Description
The bucket() function is a partition transformation function that distributes data into a specified number of buckets using hash partitioning. It can be used with any data type and is useful for evenly distributing data across partitions.
Parameters
numBuckets: Integer - the number of buckets to partition the data intocol: Column - the column to use for partitioning
Return Value
Column - the bucket number as an integer (0 to numBuckets-1)
Example
from pyspark.sql import SparkSession
from pyspark.sql.functions import bucket
# Create a Spark session
spark = SparkSession.builder.appName("bucket_example").getOrCreate()
# Create a DataFrame with sample data
data = [("A",), ("B",), ("C",), ("D",), ("E",)]
df = spark.createDataFrame(data, ["value"])
# Partition data into 3 buckets
df = df.withColumn("bucket", bucket(3, "value"))
df.show()
# Output:
# +-----+------+
# |value|bucket|
# +-----+------+
# |    A|     0|
# |    B|     1|
# |    C|     2|
# |    D|     0|
# |    E|     1|
# +-----+------+
 Notes
- The function uses hash partitioning to distribute data into buckets
 - Returns an integer representing the bucket number (0 to numBuckets-1)
 - Useful for evenly distributing data across partitions
 - Can be used with any data type
 - The same input value will always be assigned to the same bucket
 - Returns NULL if the input is NULL
 - The number of buckets should be chosen carefully to avoid data skew