Dataproc Serverless: Python Package Management through Conda
TL;DR Use Conda to package up python dependencies for your Dataproc Serverless jobs
Python Package Management
When running a distributed application like PySpark, you must make sure your python dependencies are installed on every node of your cluster. This ensures that the driver or executor running on that machine has access to the dependencies needed to run the code. The primary supported way to do this with Dataproc Serverless is through custom container image. With this model, you use a docker file to build your dependencies into a container that you reference when you deploy your Dataproc serverless job.
While the custom image typical recommended approach there are other avenues you can take to load your dependencies without using a custom image. In this article we’ll explore using Conda package management as a alternative approach to custom container images.
What is Conda and how is it used?
Conda is an open-source, cross-platform package and environment manager that can install and manage packages from the Anaconda repository and other sources. It greatly simplifies the process of installing, running, and updating various software packages and their dependencies.
Conda creates isolated environments for managing the Python dependencies required for PySpark applications, and it allows packaging of these environments using conda-pack to ensure consistent dependencies across all Spark nodes when deploying on YARN or other cluster managers.
Initialize Conda Environment
Since Dataproc Serverless is based on Linux, you’ll need a linux environment for your environment builds.
To start the environment setup, create a directory for the miniconda3 installer.
sudo mkdir -p /opt/miniconda3
sudo wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /opt/miniconda3/miniconda.sh
Next run the installer to deploy miniconda.
sudo bash /opt/miniconda3/miniconda.sh -b -u -p /opt/miniconda3
sudo rm -rf /opt/miniconda3/miniconda.sh
With miniconda3 installed you need to add the Conda path to your users bash profile and source it so you have the bin in your path.
echo 'PATH=$PATH:/opt/miniconda3/bin' >> ~/.profile
source ~/.profile
Next you need to initialize the Conda environment. You do this by running the conda init command. When you initialize an environment, it will make some changes to your users bashrc file. You’ll need to source these changes so you can create your envionrment.
conda init
source ./.bashrc
With this work complete, we can now create a Conda environment. Now it’s important to align your python version with what the serverless dataproc runtime supports. Below is a mapping of runtimes to python versions.
- Runtime 1.1= Python 3.10
- Runtime 2.0= Python 3.10
- Runtime 2.1= Python 3.11
- Runtime 2.2= Python 3.12
Create Conda Environment
With Conda initialized we can now create individual environments for different use cases that require different packages/python versions. Below is an example of creating an environment with python version 3.11 using the conda-forge channel.
conda create -y -n [conda enviornment name] -c conda-forge python=3.11
Next, install the packages required for your pyspark code. In order to create an archive, you must include in the conda-pack package.
conda install numba pandas pyarrow psycopg2 conda-pack
To archive your environment, you’ll need to run the conda pack command.
conda pack -o [archive name].tar.gz
Lastly, you’ll need to send the archive to a GCS bucket so you can use the archive in your serverless Dataproc jobs.
gsutil cp [archive name].tar.gz gs://[bucket]/
Add Archive to Serverless Dataproc Job
In order to use the archive in your job, you’ll need to apply some additional settings to reference the use of the archive.
gcloud dataproc batches submit pyspark gs://[bucket]/deps_test/numba_test.py \
--project [project id] \
--region [region] \
--batch [batch id] \
--version 2.1 \
--archives gs://[bucket]/[archive name].tar.gz#[conda envirnment name] \
--subnet default \
--properties spark.executorEnv.PYSPARK_PYTHON=./[conda enviornment name]/bin/python,spark.dataproc.driverEnv.PYSPARK_PYTHON=./[conda enviornment name]/bin/python
You’ll need to add the — archives flag that points to the zipped archive. You need to add the # symbol and attach the environment name to the archive to tell Dataproc to use that specific environment from the archive. You may optionally have more than one environment in your archive so its important to tell Dataproc which one to use.
When you use the archives flag, Dataproc looks for the python executable in the unpacked archive so you need to point explicitly to the full unpacked python path. Note that the top level directory is your environment name.
- spark.executorEnv.PYSPARK_PYTHON points to the python bin for the executor. The path must be ./[conda environment name]/bin/python
- spark.dataproc.driverEnv.PYSPARK_PYTHON points to the python bin for the driver. The path must be ./[conda environment name]/bin/python
When you run the job, you’ll see a reference that PYSPARK_PYTHON will be pointing to the path you set in your properties.
Concluding Thoughts
Using conda archives is a great way to package up dependencies for your pyspark jobs. This can be a much more efficient than building custom images for your different workloads. With this approach you can always use the latest images and not have to be constantly building images. You can easily build these environments as part of a CI/CD process and push the archives to a centralized bucket for many different jobs to utilize.
One thing to watch out for is local development on compatible platforms. I recently spend some time working with a customer that was creating the archive on their Mac running ARM processors. Dataproc runs x86–64 not ARM so when they tried to run a job on Dataproc it was crashing. The good thing is that you can easily build these environments cross platforms so long as you have a list of packages needed.