Cloud Data Fusion: Connecting to CloudSQL via SSL/TLS

Justin Taras
5 min readOct 25, 2022

TL;DR How to configure your Data Fusion pipelines to support SSL/TLS connectivity between your Data Fusion pipelines and CLoudSQL MySQL.

Reasoning

I was recently working with a customer that needed SSL/TLS connectivity for all clients connecting to their CloudSQL MySQL environment. They to use Data Fusion to move data incrementally to BigQuery. Sounded easy enough. This is a very common pattern we see with Data Fusion. In this scenario, I typically advise the customer to use CloudSQL Auth Proxy to provide that secure connection but they required full end to end encrypted connections. If you how the CloudSQL Auth Proxy works, the connection is encrypted between the Auth Proxy Client and the CloudSQL service. However the connection between the local MySQL client and the Auth Proxy Client is not. In most situations this is fine because the CloudSQL Auth Proxy would be running in an isolated environment but in this was not the case for this customer due to regulatory/security requirements. This required the customer to configure their pipeline to use client side SSL.

CloudSQL Configuration

Once a CloudSQL instance has been deployed and running, click on Connections Tab followed by the “Security” to view the settings for enabling SSL.

CloudSQL Connections Tab under Security; Check to Only Allow SSL Connections

Changing the setting to allow only SSL connections will require a restart. Once the instance has been restarted you can generate a new client certificate. This will be used to secure the connection between Data Fusion (and Dataproc when pipelines are deployed) and CloudSQL MySQL.

CloudSQL Connections Tab under Security; Add new Client Certification for Connectivity

The output will look like the following with server/client certificates and a client private key for authentication. Be sure to download all the components before you close the window. Store this on your local machine so it can be used later.

New Client Certificate Output

Create KeyStore and TrustStore

Take the downloaded client credentials in the prior set and move them to your cloud console to create the required KeyStore and TrustStore. In truth they can be moved to any platform that has internet connectivity and has the Google Cloud SDK installed.

keytool -import -file server-ca.pem -alias MySQLCACert -keystore truststoreopenssl pkcs12 -export -in client-cert.pem -inkey client-key.pem -name "mysqlclient" -out client-keystore.p12keytool -importkeystore -srckeystore client-keystore.p12 -srcstoretype pkcs12 -srcstorepass welcome1 -destkeystore keystore -deststoretype jks -deststorepass welcome1

The first command inputs the CloudSQL Server certificate into your TrustStore. The second command converts your client key and certificate files to a PKCS #12 archive. The final command imports your PKCS cert and key into your local KeyStore. For more information on this process, visit the MySQL page on configuring SSL connectivity.

Once these have been created, upload them to a Google Cloud Storage Bucket. We’ll need them stored there so we can source them when provisioning Dataproc.

Uploaded KeyStore and TrustStore to GCS Bucket

Dataproc Initialization Action

Data Fusion uses Dataproc and Spark to execute the pipelines. Spark is a distributed process so pipelines will be executed across multiple compute workers on Dataproc. Therefore, we need to ensure that the KeyStore and TrustStore are on each node on the Dataproc cluster. To do this type of customization, it is recommend to do this at runtime via initialization action. Below is an example script to use. This will download the KeyStore and Truststore, add them to a directory and adjust permissions for access. This action will be executed across all cluster nodes.

#!/bin/bash
# Copyright 2015 Google, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -o errexit
set -o nounset
set -o xtrace

mkdir -p /tmp/sec
gsutil cp gs://[bucket name]/ssl/truststore /tmp/sec
gsutil cp gs://[bucket name]/ssl/keystore /tmp/sec
chmod 644 /tmp/sec/keystore
chmod 644 /tmp/sec/truststore

Keep this script in a Cloud Storage Bucket so that your Dataproc cluster can access at runtime.

Configure the Pipeline

The CloudSQL plugins don’t specifically have an option to configure for SSL/TLS but we can use the standard MySQL Plugin for this specific configuration.

Simple Pipeline

Below is my plugin configuration for enabling SSL/TLS connections.

Note that since these files are on the local Dataproc filesystem we need to use the file URI. The local path will need to be prefixed with file: . Note that since these credentials are stored locally on Dataproc, they will be only accessible by deployed pipelines. That means that if you try to validate the plugin in Studio, you’ll get failures. Remember to use macros to override the runtime validation. In my example, I added a Macro for IP address to override the configuration.

Configure Compute Profile

With the pipeline deployed, I configured the compute profile to use the initialization action. The initialization action will run once Dataproc is up and running and load the KeyStore and TrustStore for our pipeline.

system.profile.properties.initActions
Configure Runtime Arguments to use the Initialization Action.

The alternative approach to this would be to have a customized Datparoc image with the KeyStore and TrustStore baked into the image OR have a long running cluster. I highly recommended the initialization route because it will give you better security (most recent images), less operational support (ephemeral clusters) and most flexibility with least effort (just add new code to the action).

SUCCESS!

Concluding Thoughts

While this process works well with production pipelines, it can be kinda painful to develop your pipelines with this model. Not being able to preview your pipelines can make development slow and tedious. I highly recommend to have a development CloudSQL instance that doesn’t need that full end to end encryption so you can develop your pipelines in studio and preview your pipeline design.

Happy Pipelining!

--

--

Justin Taras

I’m a Google Customer Engineer interested in all things data. I love helping customers leverage their data to build new and powerful data driven applications!