Deploying a Spark job can be challenging, especially considering that one Spark job is never the same as another. Deployment times can vary significantly depending on a wide range of factors. Finding ways to make the deployment efficient can greatly improve the process, and that can be achieved with a few simple strategies.
Consider the following steps required to deploy a Spark job that is built in
Usually build machines used for a deployment pipeline are provisioned so they are lightweight with very few applications installed.
In fact, if the only application installed on a build machine is docker, then that is enough.
Since our build machines don't have sbt installed, a docker image with
Performing this operation every time you need to deploy your Spark job is costly. So in order to improve the efficiency of builds to prevent these download times, the dependencies first need to be downloaded onto a specific sbt docker image that is tailored for your Spark job. This can then be used as part of the deployment pipeline.
The following steps will need to be carried out for this to be achieved:
1.
2. Run the following docker command to start a
3. Once in interactive mode within the docker container, run the following commands within the container.
and where
Note that the import allows a container that has all the necessary dependencies to be converted into an image called
This will be needed when
5. To publish the the local docker image to the internal docker repository, run the following
and where
This can now help in the CI deployment pipeline, where the step to run
Consider the following steps required to deploy a Spark job that is built in
scala
and compiled using sbt
:
- Get latest project files (requires
git
orsubversion
etc.) - Download dependencies (requires
sbt
) - Compile Spark application (requires
sbt
) - Run unit test (requires
sbt
) - Package Spark application to a .jar file (requires
sbt
) - Upload .jar file to an accessible location (e.g. s3 bucket) (requires
aws-cli
) - Spark submit using the s3 location to the master node (requires
spark
)
Usually build machines used for a deployment pipeline are provisioned so they are lightweight with very few applications installed.
In fact, if the only application installed on a build machine is docker, then that is enough.
Since our build machines don't have sbt installed, a docker image with
scala
and sbt
is required to run sbt assembly
. A docker image found on docker hub like spikerlabs/scala-sbt can achieve this. However, running sbt assembly
using this docker image will take a significantly long time to complete, sometimes as long as 30 minutes! This is because all the necessary dependencies for your spark job will need to be downloaded before compiling your Spark application.Performing this operation every time you need to deploy your Spark job is costly. So in order to improve the efficiency of builds to prevent these download times, the dependencies first need to be downloaded onto a specific sbt docker image that is tailored for your Spark job. This can then be used as part of the deployment pipeline.
The following steps will need to be carried out for this to be achieved:
1.
cd
into the project folder.2. Run the following docker command to start a
spikerlabs/scala-sbt
container in interactive mode.docker run -i -v "/$(pwd)/":/app -w "//app" --name "my-scala-sbt-container" "spikerlabs/scala-sbt:scala-2.11.8-sbt-0.13.15" bashNote that a specific
scala
and sbt
version will alway be required otherwise the "latest" tag could fail during compilation with any breaking changes.3. Once in interactive mode within the docker container, run the following commands within the container.
> sbt assembly # to download dependencies, compile, and package the spark job > exit # once the packaging is complete4. This will create a docker container called
my-scala-sbt-container
which will need to be exported, then imported as an image, as follows:
docker export --output="./my-scala-sbt-container.tar" "my-scala-sbt-container" docker import "./my-scala-sbt-container.tar" [INTERNAL_REPOSITORY_URL]/my-scala-sbt-image:[VERSION_NUMBER]Where
[INTERNAL_REPOSITORY_URL]
is a company wide docker repository location e.g. like nexus,and where
[VERSION_NUMBER]
needs to be bumped up from a previous version.Note that the import allows a container that has all the necessary dependencies to be converted into an image called
[INTERNAL_REPOSITORY_URL]/my-scala-sbt-image:[VERSION_NUMBER]
.This will be needed when
docker push
is executed. Unfortunately the docker API for a push
command requires the image name to be exactly the same as the docker repository url which seems a little non-intuitive.5. To publish the the local docker image to the internal docker repository, run the following
docker push
command:
docker image push [INTERNAL_REPOSITORY_URL]/my-scala-sbt-image:[VERSION_NUMBER]Where
[INTERNAL_REPOSITORY_URL]
is a company wide docker repository location e.g. like nexus,and where
[VERSION_NUMBER]
is the same as the previous step.This can now help in the CI deployment pipeline, where the step to run
sbt assembly
can be done with the docker run
command as follows:
cat <<EOF > assembly.shThat should improve the build timings by removing up to 30 minutes off the deployment pipeline, depending on how many dependencies are required for the Spark application.
#!/usr/bin/env bash
set -ex
sbt assembly EOF docker run \ --rm \ -v "/$(pwd)/..":/app \ -v "/$(pwd)/assembly.sh":/assembly.sh \ -w "//" \ --entrypoint=sh \ [INTERNAL_REPOSITORY_URL]/my-scala-sbt-image:[VERSION_NUMBER] \ //assembly.sh