Making mapPartitions Accepts Partition Functions with More Than One Arguments
Published:
There might be a case where we need to perform a certain operation on each data partition. One of the most common examples is the use of mapPartitions. Sometimes, such an operation probably requires a more complicated procedure. This, in the end, makes the method executing the operation needs more than one parameter.
However, according to the documentation, such a transformer only accepts a function with iterator as the single parameter.
Therefore, the question is simply, how to make the function accepts more than just one parameter?
I came across such a challenge recently. I thought it was one of the limitations in Spark that couldn’t be engineered. Until I’ve found a solution that seems extremely simple.
To make it brief, just imagine that we have a dataframe that has been repartitioned into a certain number of partitions.
To apply an operation on each partition, we pass the corresponding function as the parameter of mapPartitions.
def func(param_a, param_b):
def partition_func(iterator):
total = 0
for row in iterator:
total += (row[‘a’] + row[‘b’]) * param_a
total = total - param_b
return total
return partition_func
# compute the total value for each partition
total = df.rdd.mapPartitions(func(param_a, param_b))
The above code will return the value of total for each partition. To get more insight on what the return value looks like, take a look at the below code snippet.
total.glom().collect()
The above code will return the following.
[
[total_for_partition_0],
[total_for_partition_1],
[total_for_partition_2],
]
Thank you for reading.