Cloud Data Fusion: Using Terraform to run ephemeral Data Fusion Instances

Justin Taras
6 min readJun 3, 2022

TL;DR Some users of Data Fusion only have a small number of pipelines to run on a daily basis. This can make running an always on Data Fusion deployment expensive considering it will be idle the majority of time. You leverage Terraform to automate the deployment of Data Fusion environments for the situations where you only need to use the platform for short duration of time.

Data Fusion Deployments

Many enterprise customers today run Data Fusion as their core ETL engine. In these situations, Data Fusion is continually running pipelines around the clock, so the deployment of Data Fusion is typically a one time event. But what about smaller businesses that have just a handful of batch pipelines to run on a daily or weekly basis?

The price of a Data Fusion Enterprise deployment runs around $3k/month (3.40/hr)+ Dataproc execution cost. For a business where data pipelines are running 24/7 the instance costs are negligible. But a business with a small number of pipelines running infrequently, the cost of an always running Data Fusion instance can be high relative to the utility of the service.

In these situations, Terraform can be leveraged to dramatically reduce the costs of running the Data Fusion instance. In the example above, if a Data Fusion were deployed for 4 hours a day to run pipelines, it would only cost on average 4x3.40x30 = $408 per month + Dataproc execution costs. This can make running pipelines on Data Fusion more attainable for smaller businesses that want to run the service but don’t have the pipelines to justify the investment.

A question often comes up is, why not use basic or developer instances to save cost? For running production pipelines, Enterprise is the recommended deployment configuration. It has the highest level of availability of the 3 versions and can also support a larger number of concurrently running pipelines. Basic and Developer are good for development and testing but shouldn’t be used for production. Read here to find out more about the differences between the Data Fusion versions.

Deploy a Instance with Terraform

Deploying a data fusion instance with Terraform is fairly straight forward. In this example we’ll be deploying a private IP based Data Fusion instance with an already configured IP reservation. In situations where we are continually redeploying the environment, it makes sense to have that reservation already defined so its one less thing we have to add to our Terraform workflow.

In most production deployments, VPC peering will not be required. VPC peering is used when developing pipelines in Data Fusion Studio. Studio runs in the Data Fusion tenant VPC and therefore doesn’t natively have connectivity to on-prem data sources. Peering the Data Fusion VPC with the user VPC bridges those networks and enables connectivity back to on premise systems or data sources running on GCE (database on GCE). In production, the deployed pipelines run on Dataproc in the users VPC so they will have default connectivity. In this example, we will not be peering the VPC’s for this very reason. This is a production system that will be running pipelines, not development them. To read up on the different Data Fusion deployment models, check on this page.

My Terraform workflow in my folder layout looks like following.

Terraform Folder Layout

The workflow is distributed across 4 files.

  • main.tf contains the resource configuration to deploy the instance
  • terraform.tvars has the variables mapped to values
  • variables.tf contains the variables definitions
  • verions.tf contains the provisioner configuration

Below is my main.tf file which contains the resource I want to deploy. Note that I’ve parameterized most aspects of the configuration. I want to make this as flexible as possible so it can be used beyond just redeploying a daily production instance.

main.tf file

Next up is the variables.tf file. This file contains the variables I want to use in my configuration. You can see them referenced in the main.tf file with a var. prefix.

variables.tf

The last file is the versions.tf file. This contains the provider we want to use along with its configuration. Most important is the mapping we provide to the service account we will use to provision data fusion. Note the credentials variable pointing to the service account used below.

versions.tf

Lastly is the variables mapping file. Here we map the variables we defined in variables.tf to the values we want.

terraform.tvars file

Once these files have been configured and saved you can run your terraform code to deploy Data Fusion. Run terraform init to initialize your working directory.

Running Terraform init

Terraform plan describes what terraform will execute. You can see the configuration that was generated from the variables and what infra will be deployed.

Running Terraform Plan

Finally terraform apply deploys the configuration. After 10–15 minutes, the Data Fusion instance will be fully deployed and available.

Running Terraform Apply

What about adding pipelines?

In our example, we want our instance to be fully configured with pipelines ready to go right? In order to do that we’ll need to modify our Terraform workflow to include adding the pipelines post instance deployment. In this example, I pulled my pipelines to a local repository and referenced the full path in the resource. I also added dependencies so that the pipelines would only be added once the instance is fully deployed.

Modified main.tf file with added resources for pipelines

Since I’m using a different provider I’ll have to modify my versions file. Note the new CDAP provider as a required provider. The provider itself has 2 items of note:

  • host: is the endpoint for the data fusion instance.
  • token: to interact with the data fusion API you’ll need an access token. the provider will generate a toke to load your pipelines
Modified versions.tf file with new CDAP provider

We will need to modify the variables file to include entries for project_id and short-names (cdf_region_map) for cloud regions. The Data Fusion instance endpoint uses the region short-name as part of the URL. The variable maps a full region name to its short-name so we can build the endpoint.

Modified variables.tf file with new variables for region mapping and project_id

Once the changes have been incorporate, rerun the terraform apply to redeploy the instance.

Freshly deployed Data Fusion Instance

Check in Data Fusion to see the three pipelines we loaded.

Freshly deployed pipelines!

We now have a deployed Data Fusion instance with our Pipelines ready to launch!

Concluding Thoughts

While Data Fusion is intended to be a service you deploy once and continually use, Terraform can help in situations when you only need the service periodically and you don’t want to spend money on idle resources. Beyond that, it can be a really powerful to simplify the configuration and deployment of your Data Fusion environments.

To run this in production, look to a tool like Jenkins that can automate the build out of your environment on a scheduled basis. Then you can look to an external scheduler like Cloud Composer to run your pipelines.

The only real downside to this ephemeral approach is that any metadata or lineage that is collected as part of the pipelines execution is lost when the instance is deleted. If keeping this metadata is important, I’d recommend exploring using a persistent Data Fusion environment.

In the next article we’ll explore some advanced configurations with adding namespaces, compute profiles, and customizing your Data Fusion environment.

--

--

Justin Taras

I’m a Google Customer Engineer interested in all things data. I love helping customers leverage their data to build new and powerful data driven applications!