Building Apache Livy 0.8.0 for Spark 3.x
TL;DR The 0.7.0 version of Livy was built for Spark 2.4.5 and Scala 11. Newer versions of Spark build on Scala 12 will not run the current Livy release. To get support for Spark 3.x Livy will need to be rebuilt from the master branch with some changes in order to run Spark 3 workloads.
Dependencies
You will need the following dependencies setup to build Livy 0.8.0
- maven (I used 3.6.3)
- Java 8 (I used Oracle JDK)
- Python 3
- R 3.x
Livy Build
One thing I found in trying to get this to work was a couple of missing libraries.
- jsr311-api-1.1.1.jar
- jersey-core-1.19.jar
- scala-compiler-2.12.14.jar
The build ran fine but I received errors when trying to startup and run Livy. Most of these resulted in class not found exceptions and were fixed once I relocated the libraries to Livy’s classpath. I’ve added them as dependencies in the livy server pom.xml so they will be in place when the build is run.
The pom.xml’s can be found here. The big changes were aligning the Scala versions for the Spark build with the Livy build. There were also needed changes in the spark-maven-plugin versions.
Before I built I set JAVA_HOME and added maven and java to my PATH.
export PATH=/home/user/apache-maven-3.6.3/bin:$PATH
export JAVA_HOME=/home/user/jdk1.8.0_321
export PATH=$PATH:/home/user/jdk1.8.0_321/bin
The command below will build the special spark 3 profile.
mvn clean package -B -V -e \
-Pspark-3.0 \
-DskipTests \
-DskipITs \
-Dmaven.javadoc.skip=true
The build should take a couple of minutes to complete.
Deploying Livy
Very minimal configurations are needed for Livy to get started. Here is what I used for the livy.conf configuration file.
livy.spark.master = yarn
livy.spark.deploy-mode = client
livy.server.session.timeout=1h
I added these to livy-env.sh to initialize the environment upon startup. These will differ depending on your environment setup.
export SPARK_HOME=/usr/lib/spark
export SPARK_CONF_DIR=/etc/spark/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
export LIVY_LOG_DIR=/var/log/livy
export PYSPARK_PYTHON=/opt/conda/miniconda3/bin/python3
export PYSPARK_DRIVER_PYTHON=/opt/conda/miniconda3/bin/python3
Also make sure that the log4j file is created so there’s adequate logging for troubleshooting.
Starting Livy
Once the binaries have been deployed on the server you wish to run data fusion on, you can start Livy. Before doing this makes sure Livy bin has been added to your path.
export PATH=$PATH:/home/user/0.8.0-incubating-SNAPSHOT/bin
livy-server start
Once the server has started you can create a session for testing.
curl -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions
With the session starting you can use the following command to get the log from the sessions
curl localhost:8998/sessions/ | python -m json.tool
Now send a super simple math operation to Livy to see if the session and Spark work. This will compute the sum of 1+2 using spark.
curl localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"1 + 2"}'
Immediate after running this command, you should get a response that looks like the following. This means that the job was submitted and and is running. The progress of this job is 0%.
{"id":0,"code":"1 + 2","state":"running","output":null,"progress":0.0,"started":1647894533358,"completed":0}
You can run the following to find the status fro the first statement ran in session 0.
curl localhost:8998/sessions/0/statements/0
The result should look like this if the job is successful.
{"id":0,"code":"1 + 2","state":"available","output":{"status":"ok","execution_count":2,"data":{"text/plain":"3"}},"progress":1.0,"started":1647894533358,"completed":1647894533359}