Cloud Data Fusion: Customizing Compute Profiles at Runtime

Justin Taras
3 min readJan 4, 2022

With Data Fusion, pipeline developers can create custom compute profiles to “right size” the Dataproc instance running the pipeline. These profiles are independent of the pipeline itself which enables developers to swap out profiles at runtime depending on how much data being processed.

Select from a list of available compute profiles at runtime

In the image above, a user can select the compute profile that best fits their needs once the pipeline is deployed. Developers can further customize the compute profile by clicking on the “Customize” link for their selected compute profile. Ultimately, the ability to customize existing compute profiles helps prevent compute profile proliferation. Being able to have a base template to build off of helps reduce the overall number of compute profiles to build and manage.

Customize elements of the compute profile

For instance, the profile may have a default number of Dataproc Workers. This profile can be customized for this pipeline to increase or decrease that number depending on the amount of resources required.

Modify the default settings of a profile for a deployed pipeline

From a governance standpoint, there may be some elements in a profile an admin doesn’t want changed. When the profile is initially created, each field that can be modified has a small padlock icon next to it. This is a way an admin can limit the customizability of the compute profile. Click on the padlock locks the given setting to lock it in place.

Establishing the default values for the compute profile

When a developer goes to customize that field, it will now show up as a customizable action. Note that the Region, Network, and Subnet are now unavailable to be customized.

The fields that were locked are no longer accessible

Customizing compute profiles through REST calls

If a developer wants to customize the compute profile of a pipeline they can supply the elements they want to modify as runtime parameters. In the example below, I want to to use a user defined compute profile with the name “test” and I want to make the following changes to the profile:

  • Increase the number of workers from 2->3
  • Increase the worker disk size to 500GB
  • Change the worker disky type to PD-SSD
  • secureBootEnabled=true
  • vTpmEnabled=true
  • integrityMonitoringEnabled=true

The code below walks through passing these cluster profile parameters in at runtime. The bold and italicized portion below are the customized parameters in JSON format as the POST action payload.

###set env vars
export AUTH_TOKEN=$(gcloud auth print-access-token)
export INSTANCE_ID=datafusion12
export REGION=us-central1
###get data fusion endpoint
export CDAP_ENDPOINT=$(gcloud beta data-fusion instances describe \
--location=$REGION \
--format="value(apiEndpoint)" \
${INSTANCE_ID})
###start the job with the
curl -X POST -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/default/apps/dfdf_v6/workflows/DataPipelineWorkflow/start" -d '{"system.profile.name":"test","system.profile.properties.workerNumNodes":3,"system.profile.properties.workerDiskGB":500,"system.profile.properties.workerDiskType":"pd-ssd","system.profile.properties.secureBootEnabled":"true","system.profile.properties.vTpmEnabled":"true","system.profile.properties.integrityMonitoringEnabled":"true"}'

Note that the format for submitting custom parameters follows the pattern: system.profile.properties.[property name]

*Note: If a developer customizes a parameter that the admin has “locked” modification, the pipeline will ignore the user provided value and use the default.

While this example is quite simple, these customizations could be extended to what service accounts to use and what projects to run the job in. The extensive list of property names can be found by exporting a custom compute profile.

For more information for using compute profiles and how to use them, check out these links below:

--

--

Justin Taras

I’m a Google Customer Engineer interested in all things data. I love helping customers leverage their data to build new and powerful data driven applications!