Making mapPartitions Accepts Partition Functions with More Than One Arguments

1 minute read

Published:

There might be a case where we need to perform a certain operation on each data partition. One of the most common examples is the use of mapPartitions. Sometimes, such an operation probably requires a more complicated procedure. This, in the end, makes the method executing the operation needs more than one parameter.

However, according to the documentation, such a transformer only accepts a function with iterator as the single parameter.

Therefore, the question is simply, how to make the function accepts more than just one parameter?

I came across such a challenge recently. I thought it was one of the limitations in Spark that couldn’t be engineered. Until I’ve found a solution that seems extremely simple.

To make it brief, just imagine that we have a dataframe that has been repartitioned into a certain number of partitions.

To apply an operation on each partition, we pass the corresponding function as the parameter of mapPartitions.

def func(param_a, param_b):
	def partition_func(iterator):
		total = 0
		for row in iterator:
			total += (row[a] + row[b]) * param_a
		
		total = total - param_b
		return total

	return partition_func

# compute the total value for each partition
total = df.rdd.mapPartitions(func(param_a, param_b))

The above code will return the value of total for each partition. To get more insight on what the return value looks like, take a look at the below code snippet.

total.glom().collect()

The above code will return the following.

[
	[total_for_partition_0],
	[total_for_partition_1],
	[total_for_partition_2],
]

Thank you for reading.