TIL: Quick Dataframe Column Rename in Spark

August 31, 2020
[scala] [spark] [pyspark] [python] [big data]

This is a quick and useful tip I learned recently. It's quite normal that you need to apply some renames on columns of a data frame. Well it turned out, it's quite straightforward in Spark. In the following example you can see how you can replace . characters in column names:

  df.toDF(df.columns.map(x => x.replace("." "_")): _*)

The same thing can be acheived in PySpark as follows:

  df.toDf(*[c.replace(".", "_") for c in df.columns])

I'm using replace dot function as an example of course, any other text transformation can take place here. However, this dot replacement is quite useful when you consume data from sources where `.` doesn't have special meaning. Without this whenever we want to use these columns, we should use `column.name` to skip nested column structure. Of course, other solution here would be casting data as a nested column, which requires another post 😉.

comments powered by Disqus