Cloud Data Fusion: Using Spark SQL for Column Transformations

Justin Taras
3 min readFeb 26, 2024

TL;DR: While data transformation tools like Wrangler offer extensive features, you may occasionally require custom functionality, such as complex case statements across multiple columns. In these scenarios, the Scala Spark plugin allows you to write custom SQL for scalable data transformations.

Using the Scala Plugin

The Scala plugin is an optional plugin you can add through the hub that enables the ability to write custom Scala code against data in your pipelines. The plugin gets input from the upstream plugin and the data is represented as either a RDD or Dataframe depending on your requirements. The schema of the RDD/DF will match the schema of the input so you know what columns you’ll be working with. The Spark SQL libraries are pre-loaded so it makes it easy to start writing your own SQL code to process your data. Below is an example pipeline with the Scala Plugin added.

Pipeline with Scala Plugin. See “Spark Transform”

The plugin many libraries pre-loaded but a user can optionally add additional libraries as needed. There’s an input at the bottom of the plugin page that allows you to add additional libraries.

I set “Compile at Deployment Time” to false. This will skip the validation of the code at the plugin level. Otherwise you may get some weird errors during validation. I always set to false so I can run validation tests during pipeline preview and detect any syntax or runtime errors there.

Use Case: Complex Case Logic

In this example we are using the in-build sample datasets that come with Data Fusion. This dataset contains the US states and their abbreviations. I’d like to do some silly transformations on the data that would be kinda difficult to do in Wrangler, especially across multiple columns. Below is the code I used to apply my transformations.

SQL Case Statement Logic in Scala Plugin

I first create a spark session for running the SQL commands. I then take the input Dataframe and create a temporary view that I can use the power of SparkSQL to query over. Most all transforms in Wrangler are single column based so you need to write custom JEXL expressions if you need to take into account the value of other columns in your transformation.

Here is the resulting JEXL expression for the last case statement condition where I’m looking for states that begin with “South” and have an abbreviation “SD”:


set-column :name (name =^ "South " && abbreviation == "SD") ? "Go Bison!" : name

JEXL is a much more developer centric approach and while very powerful can be tricky to use. It took me a while to figure out how to write this, though Google Gemini was pretty good at getting me 95% of the way there. SQL on the other hand is fairly ubiquitous and can be very simple to implement.

CASE 
WHEN name LIKE 'South%' AND abbreviation = "SD" THEN 'GO Bison!'
ELSE name
END

Concluding Thoughts

SQL is a highly effective language for data transformation. For straightforward transformations, Wrangler is an excellent choice due to its extensive feature set and ease of use. However, when encountering more complex scenarios that Wrangler may not fully support, the Scala Plugin seamlessly bridges the gap. It offers a user-friendly interface to implement custom transformations, leveraging the power of SQL.

--

--

Justin Taras

I’m a Google Customer Engineer interested in all things data. I love helping customers leverage their data to build new and powerful data driven applications!