Resolving Attributes Data Inconsistency with Union By Name

1 minute read

Published: August 21, 2019

If you read my previous article titled Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data, it was shown that the attributes data was inconsistent when combining two data frames after inner-join. According to the article, the solution is really simple. We just need to reorder the attributes order by using select command. Here’s a simple example.

unioned_df = joined_df.union(df.select(*joined_df.columns))

However, recently I did a little investigation on PySpark’s Github repo. I jumped into the dataframe’s module code and found a method called unionByName . There’s a short statement explaining the use of the method: The difference between this function and :fun:’union’ is that this function resolves columns by name (not by position).

Let’s take a look at a simple example (taken from the Spark’s Github repo):

>>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
>>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])

>>> df1.unionByName(df2).show()

+----+----+----+
|col0|col1|col2|
+----+----+----+
|   1|   2|   3|
|   6|   4|   5|
+----+----+----+

What does it mean? Well, we have two solutions here, either using the select approach (as mentioned in the previous article) or just simply using this unionByName method.

Thank you for reading.

Share on

Twitter Facebook LinkedIn

Albertus Kelvin

Resolving Attributes Data Inconsistency with Union By Name

Share on

You May Also Enjoy

IMO 2012 Problem 2 - Solution

Little Note on MySQL and Adminer

XGBoost Algorithm for Classification Problem

The Levinson-Durbin Recursion Example