Skip to content

bucket

Tags: partition functions

Description

The bucket() function is a partition transformation function that distributes data into a specified number of buckets using hash partitioning. It can be used with any data type and is useful for evenly distributing data across partitions.

Parameters

  • numBuckets: Integer - the number of buckets to partition the data into
  • col: Column - the column to use for partitioning

Return Value

Column - the bucket number as an integer (0 to numBuckets-1)

Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import bucket

# Create a Spark session
spark = SparkSession.builder.appName("bucket_example").getOrCreate()

# Create a DataFrame with sample data
data = [("A",), ("B",), ("C",), ("D",), ("E",)]
df = spark.createDataFrame(data, ["value"])

# Partition data into 3 buckets
df = df.withColumn("bucket", bucket(3, "value"))
df.show()

# Output:
# +-----+------+
# |value|bucket|
# +-----+------+
# |    A|     0|
# |    B|     1|
# |    C|     2|
# |    D|     0|
# |    E|     1|
# +-----+------+

Notes

  • The function uses hash partitioning to distribute data into buckets
  • Returns an integer representing the bucket number (0 to numBuckets-1)
  • Useful for evenly distributing data across partitions
  • Can be used with any data type
  • The same input value will always be assigned to the same bucket
  • Returns NULL if the input is NULL
  • The number of buckets should be chosen carefully to avoid data skew