Deploying a Spark job can be challenging, especially considering that one Spark job is never the same as another. Deployment times can vary significantly depending on a wide range of factors. Finding ways to make the deployment efficient can greatly improve the process, and that can be achieved with a few simple strategies.Consider the following steps required to deploy a Spark job that is built in
scala and compiled using sbt:
- Get latest project files (requires
gitorsubversionetc.) - Download dependencies (requires
sbt) - Compile Spark application (requires
sbt) - Run unit test (requires
sbt) - Package Spark application to a .jar file (requires
sbt) - Upload .jar file to an accessible location (e.g. s3 bucket) (requires
aws-cli) - Spark submit using the s3 location to the master node (requires
spark)
Usually build machines used for a deployment pipeline are provisioned so they are lightweight with very few applications installed.
In fact, if the only application installed on a build machine is docker, then that is enough.
Since our build machines don't have sbt installed, a docker image with
scala and sbt is required to run sbt assembly. A docker image found on docker hub like spikerlabs/scala-sbt can achieve this. However, running sbt assembly using this docker image will take a significantly long time to complete, sometimes as long as 30 minutes! This is because all the necessary dependencies for your spark job will need to be downloaded before compiling your Spark application.Performing this operation every time you need to deploy your Spark job is costly. So in order to improve the efficiency of builds to prevent these download times, the dependencies first need to be downloaded onto a specific sbt docker image that is tailored for your Spark job. This can then be used as part of the deployment pipeline.
The following steps will need to be carried out for this to be achieved:
1.
cd into the project folder.2. Run the following docker command to start a
spikerlabs/scala-sbt container in interactive mode.docker run -i -v "/$(pwd)/":/app -w "//app" --name "my-scala-sbt-container" "spikerlabs/scala-sbt:scala-2.11.8-sbt-0.13.15" bashNote that a specific
scala and sbt version will alway be required otherwise the "latest" tag could fail during compilation with any breaking changes.3. Once in interactive mode within the docker container, run the following commands within the container.
> sbt assembly # to download dependencies, compile, and package the spark job > exit # once the packaging is complete4. This will create a docker container called
my-scala-sbt-container which will need to be exported, then imported as an image, as follows:
docker export --output="./my-scala-sbt-container.tar" "my-scala-sbt-container" docker import "./my-scala-sbt-container.tar" [INTERNAL_REPOSITORY_URL]/my-scala-sbt-image:[VERSION_NUMBER]Where
[INTERNAL_REPOSITORY_URL] is a company wide docker repository location e.g. like nexus,and where
[VERSION_NUMBER] needs to be bumped up from a previous version.Note that the import allows a container that has all the necessary dependencies to be converted into an image called
[INTERNAL_REPOSITORY_URL]/my-scala-sbt-image:[VERSION_NUMBER].This will be needed when
docker push is executed. Unfortunately the docker API for a push command requires the image name to be exactly the same as the docker repository url which seems a little non-intuitive.5. To publish the the local docker image to the internal docker repository, run the following
docker push command:
docker image push [INTERNAL_REPOSITORY_URL]/my-scala-sbt-image:[VERSION_NUMBER]Where
[INTERNAL_REPOSITORY_URL] is a company wide docker repository location e.g. like nexus,and where
[VERSION_NUMBER] is the same as the previous step.This can now help in the CI deployment pipeline, where the step to run
sbt assembly can be done with the docker run command as follows:
cat <<EOF > assembly.shThat should improve the build timings by removing up to 30 minutes off the deployment pipeline, depending on how many dependencies are required for the Spark application.
#!/usr/bin/env bash
set -ex
sbt assembly EOF docker run \ --rm \ -v "/$(pwd)/..":/app \ -v "/$(pwd)/assembly.sh":/assembly.sh \ -w "//" \ --entrypoint=sh \ [INTERNAL_REPOSITORY_URL]/my-scala-sbt-image:[VERSION_NUMBER] \ //assembly.sh