Cloud Data Fusion: Tracking Pipeline Spend

Justin Taras
3 min readApr 19, 2024

--

TL;DR Using cluster labels in compute profiles is a great way to track spend at a pipeline level.

Data Fusion Workload Tracking

The primary engine for running data pipelines in Data Fusion is Dataproc. Data Fusion costs are based on a monthly fee per instance type, while Dataproc costs are primarily driven by the number of pipelines and the amount of data processed.

For many customers, running pipeline jobs within a specific project allows all pipeline spending to be accurately accounted for in that project’s billing. However, there are scenarios where a centralized team might run pipelines for various business units, or a centralized deployment is used in a multi-tenant fashion. In these cases, tracking the various workloads running on Dataproc becomes important.

The key to this is using cluster labels. A label is a key-value pair that you can attach to a pipeline or a compute profile, enabling grouping in cloud billing reports. With these labels, you can track individual pipelines or group collections of pipelines as workloads, allowing you to monitor spending effectively.

You can apply cluster labels a three layers within Data Fusion and this article will look at these approaches:

  • Compute Profiles
  • Pipeline Level
  • Runtime Arguments

Adding Cluster Labels to Compute Profiles

When you design a compute profile, you can optionally add a label that will be applied to any pipeline that uses it. This is helpful in situations where you have namespaces configured by team or by workload. In the example below, I’m using the key “workload” and the value “marketing” for the cluster label.

Use Cluster Label setting to apply default labels on compute profiles

Adding Cluster Labels to Deployed Pipelines

You can also add this label to already deployed pipelines. You can set runtime arguments that can be saved directly on the pipeline. Now, whenever the pipeline runs, it will apply this label to the deployed Dataproc cluster. In the example below, I’m using the key “workload” and the value “marketing” for the cluster label. Note that this key:value pair is applied to the runtime variable system.profile.properties.clusterLabels.

Apply Lables to Deployed Pipeline’s Runtime Arguments

Applying Labels at Runtime

If you’re using an external scheduler like Composer and would like to dynamically pass in labels at runtime, you can do so with runtime arguments. A runtime argument is a JSON payload submitted with a pipeline run API call that can customize aspects of the compute profile or the pipeline itself. In the example below, we pass in the runtime arguments from the UI as part of the JSON payload for the start pipeline API. While this example is a REST based approach, the same runtime configuration payload can be configured in Composer using the CloudDataFusionStartPipelineOperator.

##get auth token
export AUTH_TOKEN=$(gcloud auth print-access-token)

##set instance name
export INSTANCE_ID=[instance_id]
export REGION=[region]

##get CDAP_ENDPOINT
export CDAP_ENDPOINT=$(gcloud beta data-fusion instances describe \
--location=$REGION \
--format="value(apiEndpoint)" \
${INSTANCE_ID})

curl -X POST -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/default/apps/test/workflows/DataPipelineWorkflow/start" \
-d '{"system.profile.name":"SYSTEM:autoscaling-dataproc","system.profile.properties.clusterLabels":"workload|marketing"}'

Conclusion

How you configure your cluster labels will vary depending on how you design your compute profiles and namespaces. However, with a little work, you can enable the ability to track spending for teams and pipelines with some simple configurations. Moreover, with this billing data flowing into BigQuery, teams can develop reports that track spending over time. This can help quickly identify trends in consumption within a workload or spot spikes or drop-offs, which may indicate issues with either the source or the pipelines themselves.

--

--

Justin Taras
Justin Taras

Written by Justin Taras

I’m a Google Customer Engineer interested in all things data. I love helping customers leverage their data to build new and powerful data driven applications!

No responses yet