Building Apache Livy 0.8.0 for Spark 3.x

Justin Taras
3 min readMar 21, 2022

TL;DR The 0.7.0 version of Livy was built for Spark 2.4.5 and Scala 11. Newer versions of Spark build on Scala 12 will not run the current Livy release. To get support for Spark 3.x Livy will need to be rebuilt from the master branch with some changes in order to run Spark 3 workloads.

Dependencies

You will need the following dependencies setup to build Livy 0.8.0

  • maven (I used 3.6.3)
  • Java 8 (I used Oracle JDK)
  • Python 3
  • R 3.x

Livy Build

One thing I found in trying to get this to work was a couple of missing libraries.

  • jsr311-api-1.1.1.jar
  • jersey-core-1.19.jar
  • scala-compiler-2.12.14.jar

The build ran fine but I received errors when trying to startup and run Livy. Most of these resulted in class not found exceptions and were fixed once I relocated the libraries to Livy’s classpath. I’ve added them as dependencies in the livy server pom.xml so they will be in place when the build is run.

The pom.xml’s can be found here. The big changes were aligning the Scala versions for the Spark build with the Livy build. There were also needed changes in the spark-maven-plugin versions.

Before I built I set JAVA_HOME and added maven and java to my PATH.

export PATH=/home/user/apache-maven-3.6.3/bin:$PATH
export JAVA_HOME=/home/user/jdk1.8.0_321
export PATH=$PATH:/home/user/jdk1.8.0_321/bin

The command below will build the special spark 3 profile.

mvn clean package -B -V -e \
-Pspark-3.0 \
-DskipTests \
-DskipITs \
-Dmaven.javadoc.skip=true

The build should take a couple of minutes to complete.

Deploying Livy

Very minimal configurations are needed for Livy to get started. Here is what I used for the livy.conf configuration file.

livy.spark.master = yarn
livy.spark.deploy-mode = client
livy.server.session.timeout=1h

I added these to livy-env.sh to initialize the environment upon startup. These will differ depending on your environment setup.

export SPARK_HOME=/usr/lib/spark
export SPARK_CONF_DIR=/etc/spark/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
export LIVY_LOG_DIR=/var/log/livy
export PYSPARK_PYTHON=/opt/conda/miniconda3/bin/python3
export PYSPARK_DRIVER_PYTHON=/opt/conda/miniconda3/bin/python3

Also make sure that the log4j file is created so there’s adequate logging for troubleshooting.

Starting Livy

Once the binaries have been deployed on the server you wish to run data fusion on, you can start Livy. Before doing this makes sure Livy bin has been added to your path.

export PATH=$PATH:/home/user/0.8.0-incubating-SNAPSHOT/bin
livy-server start

Once the server has started you can create a session for testing.

curl -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions

With the session starting you can use the following command to get the log from the sessions

curl localhost:8998/sessions/ | python -m json.tool
Successful Spark Session in Livy

Now send a super simple math operation to Livy to see if the session and Spark work. This will compute the sum of 1+2 using spark.

curl localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"1 + 2"}'

Immediate after running this command, you should get a response that looks like the following. This means that the job was submitted and and is running. The progress of this job is 0%.

{"id":0,"code":"1 + 2","state":"running","output":null,"progress":0.0,"started":1647894533358,"completed":0}

You can run the following to find the status fro the first statement ran in session 0.

curl localhost:8998/sessions/0/statements/0

The result should look like this if the job is successful.

{"id":0,"code":"1 + 2","state":"available","output":{"status":"ok","execution_count":2,"data":{"text/plain":"3"}},"progress":1.0,"started":1647894533358,"completed":1647894533359}

--

--

Justin Taras

I’m a Google Customer Engineer interested in all things data. I love helping customers leverage their data to build new and powerful data driven applications!