Thursday, January 10, 2019

Compiling Spark Jobs using SBT on Build Machines

Deploying a Spark job can be challenging, especially considering that one Spark job is never the same as another. Deployment times can vary significantly depending on a wide range of factors. Finding ways to make the deployment efficient can greatly improve the process, and that can be achieved with a few simple strategies.

Consider the following steps required to deploy a Spark job that is built in scala and compiled using sbt:
  1. Get latest project files (requires git or subversion etc.)
  2. Download dependencies (requires sbt)
  3. Compile Spark application (requires sbt)
  4. Run unit test (requires sbt)
  5. Package Spark application to a .jar file (requires sbt)
  6. Upload .jar file to an accessible location (e.g. s3 bucket) (requires aws-cli)
  7. Spark submit using the s3 location to the master node (requires spark)
This can be achieved using a deployment pipeline created in a CI tool like TeamCity or Jenkins.
Usually build machines used for a deployment pipeline are provisioned so they are lightweight with very few applications installed.
In fact, if the only application installed on a build machine is docker, then that is enough.

Since our build machines don't have sbt installed, a docker image with scala and sbt is required to run sbt assembly. A docker image found on docker hub like spikerlabs/scala-sbt can achieve this. However, running sbt assembly using this docker image will take a significantly long time to complete, sometimes as long as 30 minutes! This is because all the necessary dependencies for your spark job will need to be downloaded before compiling your Spark application.

Performing this operation every time you need to deploy your Spark job is costly. So in order to improve the efficiency of builds to prevent these download times, the dependencies first need to be downloaded onto a specific sbt docker image that is tailored for your Spark job. This can then be used as part of the deployment pipeline.

The following steps will need to be carried out for this to be achieved:

1. cd into the project folder.

2. Run the following docker command to start a spikerlabs/scala-sbt container in interactive mode.
docker run -i -v "/$(pwd)/":/app -w "//app" --name "my-scala-sbt-container" "spikerlabs/scala-sbt:scala-2.11.8-sbt-0.13.15" bash
Note that a specific scala and sbt version will alway be required otherwise the "latest" tag could fail during compilation with any breaking changes.

3. Once in interactive mode within the docker container, run the following commands within the container.
> sbt assembly # to download dependencies, compile, and package the spark job
> exit # once the packaging is complete
4. This will create a docker container called my-scala-sbt-container which will need to be exported, then imported as an image, as follows:
docker export --output="./my-scala-sbt-container.tar" "my-scala-sbt-container"
docker import "./my-scala-sbt-container.tar" [INTERNAL_REPOSITORY_URL]/my-scala-sbt-image:[VERSION_NUMBER]
Where [INTERNAL_REPOSITORY_URL] is a company wide docker repository location e.g. like nexus,
and where [VERSION_NUMBER] needs to be bumped up from a previous version.

Note that the import allows a container that has all the necessary dependencies to be converted into an image called [INTERNAL_REPOSITORY_URL]/my-scala-sbt-image:[VERSION_NUMBER].

This will be needed when docker push is executed. Unfortunately the docker API for a push command requires the image name to be exactly the same as the docker repository url which seems a little non-intuitive.

5. To publish the the local docker image to the internal docker repository, run the following docker push command:
docker image push [INTERNAL_REPOSITORY_URL]/my-scala-sbt-image:[VERSION_NUMBER]
Where [INTERNAL_REPOSITORY_URL] is a company wide docker repository location e.g. like nexus,
and where [VERSION_NUMBER] is the same as the previous step.

This can now help in the CI deployment pipeline, where the step to run sbt assembly can be done with the docker run command as follows:
cat <<EOF >
#!/usr/bin/env bash
set -ex
sbt assembly
EOF docker run \ --rm \ -v "/$(pwd)/..":/app \ -v "/$(pwd)/":/ \ -w "//" \ --entrypoint=sh \ [INTERNAL_REPOSITORY_URL]/my-scala-sbt-image:[VERSION_NUMBER] \ //
That should improve the build timings by removing up to 30 minutes off the deployment pipeline, depending on how many dependencies are required for the Spark application.