Cloud Data Fusion: Using RBAC to Enforce Data Access

8 min readFeb 1, 2023

TL;DR You can use a combination of RBAC and Pipeline Service Accounts to scope data access for teams/project to just the data required for their development. This shifts the authorization pattern from dataproc to the acutal pipeline itself.

Securing Data Fusion with RBAC and Namespaces

RBAC (Role Based Access Control) is a relatively new feature that enables you to enforce access controls on user principals, service accounts or groups. This allows an admin to control what actions a user can perform like creating a pipeline or adding a secure key.

These RBAC controls are tightly integrated with the concept of a Namespaces in Data Fusion. Namespaces are logical containers for managing teams, projects or pipelines which helps with organizing pipelines in a multi-tenant deployment.

Combined with RBAC, it enables an admin to provide isolation of teams/project from each other and scope a principals access to just the Namespaces they require. For instance, you may have have pipelines that handle financial data that you don’t want someone from marketing viewing. There are several roles that can be assigned to a principal to scope their access:

Viewer: View pipelines but not run them in a namespace
Operator : Run/Configure deployed pipelines in a namespace
Developer: Design and deploy pipelines in a namespace
Editor: Full access to all resources in a namespace
Instance Admin: Full access to all resources in all namespaces

For information on what permission each role has, please visit this link to get the breakdown.

What’s the Problem?

When Data Fusion is deployed, a special service account is created with very broad data access permissions (Data Fusion API Service Agent). This account provides data access during development of pipelines as well as when pipelines are previewed (read only access). This service account is oftentimes added to other GCP projects where data exists thats required for development. The challenge is that this service account is used across namespaces so every team will have the same data access. For some customers this isn’t an issue but for others that may have strict rules governing data access, this can be problematic.

Connection Profiles and RBAC to the Rescue!

To get around the broad permissions of the instance service account, we can build connection profiles in wrangler to override the instance service account. This approach will provide strong data access ensuring that we can scope data access for individual namespaces.

Let’s look at the following scenario. Say you have a pipeline you need to develop and you don’t want to assign the instance service account to the project containing the data. As a Data Fusion admin you would do the following to scope the access to just the data required:

Create a new namespace
Create service account w/ scoped permissions
Load service account key to the Data Fusion secure store
Create connection profile for data data using a the service account loaded in the previous step
Assign users the developer role in the namespace
Build Pipelines!

Let’s go through this piece by piece to build this out.

First create a net new namespace for the new pipeline development. If you want to automate this via terraform, see this link.

Next create a service account you want to use for data access for this namespace. Once the service account is generated please create an access key in JSON format. Take the output and upload to the Data Fusion secure store in the namespace defined above. I detail all the steps involved in this blog here: https://medium.com/@jtaras/cloud-data-fusion-adding-a-service-account-to-the-secure-store-8fb987af917a. While adding a key doesn’t have a terraform module, you can read about using the RestAPI here.

With the service account loaded into the KMS we can now create a connection using that service account to limit the scope of access. Navigate to Wrangler and click “Add Connection”.

Click Add Connection and Select BigQuery

In the connection profile, provide a name for the connection and add the secure macro to the service account JSON field. Note, you’ll need to toggle the radio button from “File Path” to JSON. In this example I’m adding the key I loaded in the prior step using the secure macro format: ${secure(mykeyname)}

Also note the fields Project ID and Dataset Project ID. The Project ID field represents the project WHERE the BigQuery jobs will run. By default that is the Data Fusion service project (auto-detect). Whatever this project is, the service account must have the BigQuery Job User role applied…otherwise when you try to get a sample of the data, the job will timeout waiting for resources. The Dataset Project ID is used when the dataset exists in a project other than the project the BigQuery job will run. If you want to do this action via RestAPI, click here

When you click on the connection it will show that you don’t have any entries…which is correct because it doesn’t have permissions to access the tables in the dataset.

With the connection in place we can now focus on assigning the proper roles required. For the service account you created you will need to add proper role on the BigQuery Dataset we wish to access.

When we go back into the Data Fusion Wrangler UI and select the newly created BigQuery Connection we should now see the dataset!

Only the dataset we granted permissions for shows up!

Clicking on the dataset now shows the tables available for access!

When you click on a table you should be brought into the Wrangler UI. If the Wrangler UI just sits there and spins that’s a good indication that you don’t have the BigQuery User role applied to your Service Account in the project you want the job to run in.

Enforcing with RBAC

So how does RBAC come into the picture now that we can scope the access? The Editor role is the 2nd most privileged role outside of the instance admin role. The Editor role has the special permission of being able to create defined connections. This is important because this role can set up all the connections that are available to users with the Developer role. The Developer role DOES NOT have the ability to create/edit/delete a connection, only consume what is available. This model is incredibly powerful because this allows an admin to scope a namespace to JUST the data sources required for the use case or team. To demonstrate this, I have created a new user to access the namespace we created. Note, you can add these roles using RestAPI shown here.

Add the Demo User with the developer role to the newly created namespace

Once the permissions are added, click on the user you can see the assigned developer role.

Now, there is a caveat with this approach. For this to work, the user cannot have Project Viewer role in the project that Data Fusion is deployed in as it will override any RBAC defined role. You can get a breakdown of this here for more information. From a best practices perspective, Data Fusion should be in its own service project to track costs and enforce security. If this is the case, removing the Project Viewer role from users shouldn’t be a problem.

So, how do users get access without seeing instance URL in the service UI? The URL can be easily constructed for users based on the format below:

Go to specific namespace
https://INSTANCE_NAME-PROJECT_ID.REGION_NAME.datafusion.googleusercontent.com/cdap/ns/NAMESPACE_NAME

or go to default namespace
https://INSTANCE_NAME-PROJECT_ID.REGION_NAME.datafusion.googleusercontent.com/cdap/ns

With this URL, users can log into their specified namespace in which they have permissions. They can also navigate to to other namespaces as well once they are in the UI. If they try to access a namespace they don’t have permissions for they will see the following error:

If you log into the correct namespace, the user should be presented with the standard homepage. When a user goes into wrangler and views the connections they will only see the ones already predefined.

Here’s the single connection that’s available:

If they try to create a new connection one they will receive an error!

Error when creating a new connection in wrangler

Once a user has selected a connection they can then create a pipeline! As I write this, there is a bug associated with using Macros with Connections. This will be resolved as of 6.7.3 or 6.8.1. The workaround for this is to disable connection an manually add the service account JSON and project id details.

For example. Here is my Big Query plugin with my connection enabled.

Here’s my modified plugin to get around the bug mentioned above.

Its an easy change to make but one that you won’t have to worry about in the next release.

Concluding Thoughts

RBAC on Data Fusion can be an incredibly powerful tool to limit users access to data sources. While it does require up front administrative work it can greatly secure data access by delivering isolated development environments within a single systems.

Some food for thought:

To best take advantage of RBAC, create your Data Fusion instance in its own, standalone project. That way the default instance service account won’t be able to access any data already in the project.
The Editor role is very powerful. Use sparingly. Most users should only have the developer role.
Take some time upfront to design how you want to organize your pipelines with namespaces.
Leverage Google Groups when assigning RBAC permissions. That way, you don’t have to do it on a principal by principal level.
If you take the approach of using authorization at the plugin level, you don’t need data access roles on your Dataproc service account! This shifts the authorization for data to the pipeline rather than with Dataproc. This way you can have a single service account with minimal permissions used across teams.
If you use authorization at the pipeline level, make sure you use secure keys to store the service account…otherwise it will show up as plaintext in the logs! Handle service accounts with care!

Happy Pipelining!