Cloud Data Fusion: Connecting to CloudSQL via SSL/TLS
TL;DR How to configure your Data Fusion pipelines to support SSL/TLS connectivity between your Data Fusion pipelines and CLoudSQL MySQL.
Reasoning
I was recently working with a customer that needed SSL/TLS connectivity for all clients connecting to their CloudSQL MySQL environment. They to use Data Fusion to move data incrementally to BigQuery. Sounded easy enough. This is a very common pattern we see with Data Fusion. In this scenario, I typically advise the customer to use CloudSQL Auth Proxy to provide that secure connection but they required full end to end encrypted connections. If you how the CloudSQL Auth Proxy works, the connection is encrypted between the Auth Proxy Client and the CloudSQL service. However the connection between the local MySQL client and the Auth Proxy Client is not. In most situations this is fine because the CloudSQL Auth Proxy would be running in an isolated environment but in this was not the case for this customer due to regulatory/security requirements. This required the customer to configure their pipeline to use client side SSL.
CloudSQL Configuration
Once a CloudSQL instance has been deployed and running, click on Connections Tab followed by the “Security” to view the settings for enabling SSL.
Changing the setting to allow only SSL connections will require a restart. Once the instance has been restarted you can generate a new client certificate. This will be used to secure the connection between Data Fusion (and Dataproc when pipelines are deployed) and CloudSQL MySQL.
The output will look like the following with server/client certificates and a client private key for authentication. Be sure to download all the components before you close the window. Store this on your local machine so it can be used later.
Create KeyStore and TrustStore
Take the downloaded client credentials in the prior set and move them to your cloud console to create the required KeyStore and TrustStore. In truth they can be moved to any platform that has internet connectivity and has the Google Cloud SDK installed.
keytool -import -file server-ca.pem -alias MySQLCACert -keystore truststoreopenssl pkcs12 -export -in client-cert.pem -inkey client-key.pem -name "mysqlclient" -out client-keystore.p12keytool -importkeystore -srckeystore client-keystore.p12 -srcstoretype pkcs12 -srcstorepass welcome1 -destkeystore keystore -deststoretype jks -deststorepass welcome1
The first command inputs the CloudSQL Server certificate into your TrustStore. The second command converts your client key and certificate files to a PKCS #12 archive. The final command imports your PKCS cert and key into your local KeyStore. For more information on this process, visit the MySQL page on configuring SSL connectivity.
Once these have been created, upload them to a Google Cloud Storage Bucket. We’ll need them stored there so we can source them when provisioning Dataproc.
Dataproc Initialization Action
Data Fusion uses Dataproc and Spark to execute the pipelines. Spark is a distributed process so pipelines will be executed across multiple compute workers on Dataproc. Therefore, we need to ensure that the KeyStore and TrustStore are on each node on the Dataproc cluster. To do this type of customization, it is recommend to do this at runtime via initialization action. Below is an example script to use. This will download the KeyStore and Truststore, add them to a directory and adjust permissions for access. This action will be executed across all cluster nodes.
#!/bin/bash
# Copyright 2015 Google, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -o errexit
set -o nounset
set -o xtrace
mkdir -p /tmp/sec
gsutil cp gs://[bucket name]/ssl/truststore /tmp/sec
gsutil cp gs://[bucket name]/ssl/keystore /tmp/sec
chmod 644 /tmp/sec/keystore
chmod 644 /tmp/sec/truststore
Keep this script in a Cloud Storage Bucket so that your Dataproc cluster can access at runtime.
Configure the Pipeline
The CloudSQL plugins don’t specifically have an option to configure for SSL/TLS but we can use the standard MySQL Plugin for this specific configuration.
Below is my plugin configuration for enabling SSL/TLS connections.
Note that since these files are on the local Dataproc filesystem we need to use the file URI. The local path will need to be prefixed with file: . Note that since these credentials are stored locally on Dataproc, they will be only accessible by deployed pipelines. That means that if you try to validate the plugin in Studio, you’ll get failures. Remember to use macros to override the runtime validation. In my example, I added a Macro for IP address to override the configuration.
Configure Compute Profile
With the pipeline deployed, I configured the compute profile to use the initialization action. The initialization action will run once Dataproc is up and running and load the KeyStore and TrustStore for our pipeline.
system.profile.properties.initActions
The alternative approach to this would be to have a customized Datparoc image with the KeyStore and TrustStore baked into the image OR have a long running cluster. I highly recommended the initialization route because it will give you better security (most recent images), less operational support (ephemeral clusters) and most flexibility with least effort (just add new code to the action).
Concluding Thoughts
While this process works well with production pipelines, it can be kinda painful to develop your pipelines with this model. Not being able to preview your pipelines can make development slow and tedious. I highly recommend to have a development CloudSQL instance that doesn’t need that full end to end encryption so you can develop your pipelines in studio and preview your pipeline design.
Happy Pipelining!