At svds, we’ll often run spark on yarn in production. Add some artful tuning and this works pretty well. However, developers typically build and test spark application in standalone mode… not on yarn.
Rather than get bitten by the ideosyncracies involved in running spark on yarn -vs- standalone when you go to deploy, here’s a way to set up a development environment for spark that more closely mimics how it’s used in the wild.
A simple yarn “cluster” on your laptop
Run a docker image for a cdh standalone instance
docker run -d --name=mycdh svds/cdh
when the logs
docker logs -f mycdh
stop going wild, you can run the usual hadoop-isms to set up a workspace
docker exec -it mycdh hadoop fs -ls /
docker exec -it mycdh hadoop fs -mkdir -p /tmp/blah
Then, it’s pretty straightforward to run spark against yarn
docker exec -it mycdh \
--master yarn-cluster \
--class org.apache.spark.examples.SparkPi \
Note that you can submit a spark job to run in either “yarn-client” or “yarn-cluster” modes.
In “yarn-client” mode, the spark driver runs outside of yarn and logs to console and all spark executors run as yarn containers.
In “yarn-cluster” mode, all spark executors run as yarn containers, but then the spark driver also runs as a yarn container. Yarn manages all the logs.
You can also run the spark shell so that any workers spawned run in yarn
docker exec -it mycdh spark-shell --master yarn-client
docker exec -it mycdh pyspark --master yarn-client
SparkPi is all fine and dandy, but how do I run a real application?
Let’s make up an example. Say you build your spark project on your laptop in the
When you build with maven or sbt, it typically builds and leaves jars under a
/Users/myname/mysparkproject/target/ directory… for sbt, it’ll look like
The idea here is to make these jars directly accessible from both your laptop’s build process as well as from inside the cdh container.
When you start up the
cdh container, map this local host directory up and
into the container
docker run -d -v ~/mysparkproject/target:/target --name=mycdh svds/cdh
-v option will make
/target within the container.
sbt clean assembly
leaves a jar under
~/mysparkproject/target, which the container sees as
/target and you can run jobs using something like
docker exec -it mycdh \
--master yarn-cluster \
--name MyFancySparkJob-name \
--class org.markmims.MyFancySparkJob \
--name arg makes it easier to find in the midst of multiple yarn jobs.
While a spark job is running, you can get its yarn “applicationId” from
docker exec -it mycdh yarn application -list
or if it finished already just list things out with more conditions
docker exec -it mycdh yarn application -list -appStates FINISHED
You can dig through the yarn-consolidated logs after the job is done by using
docker exec -it mycdh yarn logs -applicationId <applicationId>
Web consoles are critical for application development. Spend time up front getting ports open or forwarded correctly for all environments. Don’t wait until you’re actually trying to debug something critical to figure out how to forward ports to see the staging UI in all environments.
Yarn ResourceManager UI
Yarn gives you quite a bit of info about the system right from the ResourceManager on its ip address and webgui port (usually 8088)
Spark Staging UI
Yarn also conveniently proxies access to the spark staging UI for a given application. This looks like
Ports and Docker
There are a few ways to deal with accessing port
8088 of the yarn resource
manager from outside of the docker container. I typically use ssh for
everything and just forward ports out to
localhost on the host. However,
most people will expect to access ports directly on the
address. To do that, you have to map each port when you first spin up the
cdh container using the
-p 8088 option
docker run -d -v target -p 8088 --name=mycdh svds/cdh
Then you should be good to go with something like
open http://`docker-machine ip`:8088/
to access the yarn console.
Tips and Gotchas
The docker image
svds/cdhis quite large (2GB). I like to do a separate
docker pullfrom any
docker runcommands just to isolate the download. In fact, I recommend pinning the cdh version for the same reason… so
docker pull svds/cdh:5.4.0for instance, then refer to it that way throughout
docker run -d --name=mycdh svds/cdh:5.4.0and that’ll insure you’re not littering your laptop’s filesystem with docker layers from multiple cdh versions. The bare
svds/cdh:latest) floats with the most recent cloudera versions
I’m using a CDH container here… but there’s an HDP one on the way as well. Keep an eye out for it on svds’s dockerhub page
web consoles and forwarding ports through SSH
Ok, so the downside here is that the image is fat. The upside is that it lets you play with the full suite of CDH-based tools. I’ve tested out (besides the spark variations above)
docker exec mycdh impala-shell
docker exec mycdh hbase shell
echo "show tables;" | docker exec mycdh beeline -u jdbc:hive2://localhost:10000 -n username -p password -d org.apache.hive.jdbc.HiveDriver