Failure When Overwriting A Parquet File Might Result in Data Loss
Published:
There are several critical issues that present when using Spark. One of them relates to data loss when a failure occurs.
Recently I came across such an issue when overwriting a parquet file. Let me simulate the process in a simplified way. I used Spark in local mode.
Suppose we have a simple dataframe df.
df_elements = [
(‘row_a’, ‘row_b’, ‘row_c’),
] * 100000
df = spark.createDataFrame(df_elements, [‘a’, ‘b’, ‘c’])
Now let’s store the dataframe to a parquet file.
df.write.mode(‘overwrite’).parquet(‘path_to_the_parquet_files’)
You should see that there are several partition files created when the saving process finishes.
Let’s make the overwriting process fails in the middle.
df.write.mode(‘overwrite’).parquet(‘path_to_the_parquet_files’)
When the above code is running, just press Ctrl + C to stop it.
Go back to path_to_the_parquet_files and you should find that all the previous files (before the second parquet write) has been removed.
I browsed the internet to investigate more about this issue, and found a YouTube video titled Delta Lake for Apache Spark - Why do we need Delta Lake for Spark?. Please watch it in case you want to know more.