Failure When Overwriting A Parquet File Might Result in Data Loss

1 minute read

Published: September 10, 2019

There are several critical issues that present when using Spark. One of them relates to data loss when a failure occurs.

Recently I came across such an issue when overwriting a parquet file. Let me simulate the process in a simplified way. I used Spark in local mode.

Suppose we have a simple dataframe df.

df_elements = [
	(‘row_a’, ‘row_b’, ‘row_c’),
] * 100000

df = spark.createDataFrame(df_elements, [‘a’, ‘b’, ‘c’])

Now let’s store the dataframe to a parquet file.

df.write.mode(‘overwrite’).parquet(‘path_to_the_parquet_files’)

You should see that there are several partition files created when the saving process finishes.

Let’s make the overwriting process fails in the middle.

df.write.mode(‘overwrite’).parquet(‘path_to_the_parquet_files’)

When the above code is running, just press Ctrl + C to stop it.

Go back to path_to_the_parquet_files and you should find that all the previous files (before the second parquet write) has been removed.

I browsed the internet to investigate more about this issue, and found a YouTube video titled Delta Lake for Apache Spark - Why do we need Delta Lake for Spark?. Please watch it in case you want to know more.

IMO 2012 Problem 2 - Solution