bucket
Tags: partition functions
Description
The bucket()
function is a partition transformation function that distributes data into a specified number of buckets using hash partitioning. It can be used with any data type and is useful for evenly distributing data across partitions.
Parameters
numBuckets
: Integer - the number of buckets to partition the data intocol
: Column - the column to use for partitioning
Return Value
Column - the bucket number as an integer (0 to numBuckets-1)
Example
from pyspark.sql import SparkSession
from pyspark.sql.functions import bucket
# Create a Spark session
spark = SparkSession.builder.appName("bucket_example").getOrCreate()
# Create a DataFrame with sample data
data = [("A",), ("B",), ("C",), ("D",), ("E",)]
df = spark.createDataFrame(data, ["value"])
# Partition data into 3 buckets
df = df.withColumn("bucket", bucket(3, "value"))
df.show()
# Output:
# +-----+------+
# |value|bucket|
# +-----+------+
# | A| 0|
# | B| 1|
# | C| 2|
# | D| 0|
# | E| 1|
# +-----+------+
Notes
- The function uses hash partitioning to distribute data into buckets
- Returns an integer representing the bucket number (0 to numBuckets-1)
- Useful for evenly distributing data across partitions
- Can be used with any data type
- The same input value will always be assigned to the same bucket
- Returns NULL if the input is NULL
- The number of buckets should be chosen carefully to avoid data skew