Scaling a 2000-node Hadoop cluster on EC2/Ubuntu with Juju

Written by Mark Mims and James Page

Lately we’ve been fleshing out our testing frameworks for Juju and Juju Charms. There’s lots of great stuff going on here, so we figured it’s time to start posting about it.

First off, the coolest thing we did during last month’s Ubuntu Developer Summit (UDS) was get the go-ahead to spend more time/effort/money scale-testing Juju.

The Plan

pick a service that scales
spin up a cluster of units for this service
try to run it in a way that actively engages all units of the cluster
repeat:
- instrument
- profile
- optimize
- grow

James, Kapil, Juan, Ben, and Mark sat down over the course of a couple of nights at UDS to take a crack at it. We chose Hadoop. We started with 40 nodes and iterated up 100, 500, 1000 and 2000. Here’re some notes on the process.

Hadoop

Hadoop was a pretty obvious choice here. It’s a great actively-maintained project with a large community of users. It scales in a somewhat known manner, and the hadoop charm makes it super-simple to manage. There are also several known benchmarks that are pretty straightforward to get going, and distribute load throughout the cluster.

There’s an entire science/art to tuning hadoop jobs to run optimally given the characteristics of a particular cluster. Our sole goal in tuning hadoop benchmarks was to engage the entire cluster and profile juju during various activities throughout an actual run. For our purposes, we’re in no hurry… a slower/longer run gives us a good profiling picture for managing the nodes themselves under load (with a sufficient mix of i/o -vs- cpu load).

EC2

Surprisingly enough, we don’t really have that many servers just lying around… so EC2 to the rescue.

Disclaimer… we’re testing our infrastructure tools here, not benchmarking hadoop in EC2. Some folks advocate running hadoop in a cloudy virtualized environment… while some folks are die-hard server huggers. That’s actually a really interesting discussion. It comes down to the actual jobs/problems you’re trying to solve and how those jobs fit in your data pipeline. Please note that we’re not trying to solve that problem here or even provide realistic benchmarking data to contribute to the discussion… we’re simply testing how our infrastructure tools perform at scale.

If you do run hadoop in EC2, Amazon’s Elastic Map Reduce service is likely to perform better at scale in EC2 than just running hadoop itself on general purpose instances. Amazon can do all sorts of stuff internally to show hadoop lots of love. We chose not to use EMR because we’re interested in testing how juju performs with generic Ubuntu Server images, not EMR… at least for now.

Note that stock EC2 accounts limit you to something like 20 instances. To grow beyond that, you have to ask AWS to bump up your limits.

Juju

We started scale testing from a fresh branch of juju trunk… what gets deployed to the PPA nightly… this freed us up to experiment with live changes to add instrumentation, profiling information, and randomly mess with code as necessary. This also locks in the branch of juju that the scale testing environment uses.

As usual, juju will keep track of the state of our infrastructure going forward and we can make changes as necessary via juju commands. To bootstrap and spin up the initial environment we’ll just use shell scripts wrapping juju commands.

Spinning up a cluster

These scripts are really just hadoop versions of some standard juju demo scripts such as those used for a simple rails stack or a more realistic HA wiki stack.

The hadoop scripts for EC2 will get a little more complex as we grow simply because we don’t want AWS to think we’re a DoS attack… we’ll pace ourselves during spinup.

From the hadoop charm’s readme, the basic steps to spinning up a simple combined hdfs and mapreduce cluster are:

juju bootstrap

juju deploy hadoop hadoop-master
juju deploy -n3 hadoop hadoop-slavecluster

juju add-relation hadoop-master:namenode hadoop-slavecluster:datanode
juju add-relation hadoop-master:jobtracker hadoop-slavecluster:tasktracker

which we expand on a bit to start with a base startup script that looks like:

#!/bin/bash

juju_root="/home/ubuntu/scale"
juju_env=${1:-"-escale"}

###

echo "deploying stack"

juju bootstrap $juju_env

deploy_cluster() {
  local cluster_name=$1

  juju deploy $juju_env --repository "$juju_root/charms" --constraints="instance-type=m1.large" --config "$juju_root/etc/hadoop-master.yaml" local:hadoop ${cluster_name}-master

  juju deploy $juju_env --repository "$juju_root/charms" --constraints="instance-type=m1.medium" --config "$juju_root/etc/hadoop-slave.yaml" -n 37 local:hadoop ${cluster_name}-slave

  juju add-relation $juju_env ${cluster_name}-master:namenode ${cluster_name}-slave:datanode
  juju add-relation $juju_env ${cluster_name}-master:jobtracker ${cluster_name}-slave:tasktracker

  juju expose $juju_env ${cluster_name}-master

}

deploy_cluster hadoop

echo "done"

and then manually adjust this for cluster size.

Configuring Hadoop

Note that we’re specifying constraints to tell juju to use different sized ec2 instances for different juju services. We’d like an m1.large for the hadoop master

juju deploy ... --constraints "instance-type=m1.large" ... hadoop-master

and m1.mediums for the slaves

juju deploy ... --constraints "instance-type=m1.medium" ... hadoop-slave

Note that we’ll also pass config files to specify different heap sizes for the different memory footprints

juju deploy ... --config "hadoop-master.yaml" ... hadoop-master

where hadoop-master.yaml looks like

# m1.large
hadoop-master:
  heap: 2048
  dfs.block.size: 134217728
  dfs.namenode.handler.count: 20
  mapred.reduce.parallel.copies: 50
  mapred.child.java.opts: -Xmx512m
  mapred.job.tracker.handler.count: 60
#  fs.inmemory.size.mb: 200
  io.sort.factor: 100
  io.sort.mb: 200
  io.file.buffer.size: 131072
  tasktracker.http.threads: 50
  hadoop.dir.base: /mnt/hadoop

and

juju deploy ... --config "hadoop-slave.yaml" ... hadoop-slave

where hadoop-slave.yaml looks like

# m1.medium
hadoop-slave:
  heap: 1024
  dfs.block.size: 134217728
  dfs.namenode.handler.count: 20
  mapred.reduce.parallel.copies: 50
  mapred.child.java.opts: -Xmx512m
  mapred.job.tracker.handler.count: 60
#  fs.inmemory.size.mb: 200
  io.sort.factor: 100
  io.sort.mb: 200
  io.file.buffer.size: 131072
  tasktracker.http.threads: 50
  hadoop.dir.base: /mnt/hadoop

Note also that we also have our juju environment configured to use instance-store images… juju defaults to ebs-rooted images, but that’s not a great idea with hdfs. You specify this by adding a default-image-id into your ~/.juju/environments.yaml file. This gave each of our instances an extra ~400G local drive on /mnt… hence the hadoop.dir.base of /mnt/hadoop in the config above.

40 nodes and 100 nodes

Both the 40-node and 100-node runs went as smooth as silk. The only thing to note was that it took a while to get AWS to increase our account limits to allow for 100+ nodes.

500 nodes

Once we had permission from Amazon to spin up 500 nodes on our account, we initially just naively spun up 500 instances… and quickly got throttled.

No particular surprise, we’re not specifying multiplicity in the ec2 api, nor are we using an auto scaling group… we must look like a DoS attack.

The order was eventually fulfilled, and juju waited around for it. Everything ran as expected, it just took about an hour and 15 minutes to spin up the stack. This gave us a nice little cluster with HDFS storage of almost 200TB

The hadoop terasort job was run from the following script

#!/bin/bash

SIZE=10000000000
NUM_MAPS=1500
NUM_REDUCES=1500
IN_DIR=in_dir
OUT_DIR=out_dir

hadoop jar /usr/lib/hadoop/hadoop-examples*.jar teragen -Dmapred.map.tasks=${NUM_MAPS} ${SIZE} ${IN_DIR}

sleep 10

hadoop jar /usr/lib/hadoop/hadoop-examples*.jar terasort -Dmapred.reduce.tasks=${NUM_REDUCES} ${IN_DIR} ${OUT_DIR}

which, with a replfactor of 3, engaged the entire cluster just fine, and ran terasort with no problems

Juju itself seemed to work great in this run, but this brought up a couple of basic optimizations against the EC2 api:

- pass the '-n' options directly to the provisioning agent... don't expand `juju deploy -n <num_units>` and `juju add-unit -n <num_units>` in the client
- pass these along all the way to the ec2 api... don't expand these into multiple api calls

We’ll add those to the list of things to do.

1000 nodes

Onward, upward!

To get around the api throttling, we start up batches of 99 slaves at a time with a 2-minute wait between each batch

#!/bin/bash

juju_env=${1:-"-escale"}
juju_root="/home/ubuntu/scale"
juju_repo="$juju_root/charms"

############################################

timestamp() {
  date +"%G-%m-%d-%H%M%S"
}

add_more_units() {
  local num_units=$1
  local service_name=$2

  echo "sleeping"
  sleep 120

  echo "adding another $num_units units at $(timestamp)"
  juju add-unit $juju_env -n $num_units $service_name
}

deploy_slaves() {
  local cluster_name=$1
  local slave_config="$juju_root/etc/hadoop-slave.yaml"
  local slave_size="instance-type=m1.medium"
  local slaves_at_a_time=99
  #local num_slave_batches=10

  juju deploy $juju_env --repository $juju_repo --constraints $slave_size --config $slave_config -n $slaves_at_a_time local:hadoop ${cluster_name}-slave
  echo "deployed $slaves_at_a_time slaves"

  juju add-relation $juju_env ${cluster_name}-master:namenode ${cluster_name}-slave:datanode
  juju add-relation $juju_env ${cluster_name}-master:jobtracker ${cluster_name}-slave:tasktracker

  for i in {1..9}; do
    add_more_units $slaves_at_a_time ${cluster_name}-slave
    echo "deployed $slaves_at_a_time slaves at $(timestamp)"
  done
}

deploy_cluster() {
  local cluster_name=$1
  local master_config="$juju_root/etc/hadoop-master.yaml"
  local master_size="instance-type=m1.large"

  juju deploy $juju_env --repository $juju_repo --constraints $master_size --config $master_config local:hadoop ${cluster_name}-master

  deploy_slaves ${cluster_name}

  juju expose $juju_env ${cluster_name}-master
}

main() {
  echo "deploying stack at $(timestamp)"

  juju bootstrap $juju_env --constraints="instance-type=m1.xlarge"

  sleep 120
  deploy_cluster hadoop

  echo "done at $(timestamp)"
}
main $*
exit 0

We experimented with more clever ways of doing the spinup (too little coffee at this point of the night)… but the real fix is to get juju to take advantage of multiplicity in api calls. Until then, timed batches work just fine.

Juju spun the cluster up in about 2 and a half hours. It had about 380TB of HDFS storage

The terasort job that was run from the script above with

SIZE=10000000000
NUM_MAPS=3000
NUM_REDUCES=3000

eventually completed.

2000 nodes

After the 1000-node run, we chose to clean up from the previous job and just add more nodes to that same cluster.

Again, to get around the api throttling, we added batches of 99 slaves at a time with a 2-minute wait between each batch until we got near 2000 slaves.

This gave us almost 760TB of HDFS storage

and was running fine

but was stopped early b/c waiting for the job to complete would’ve just been silly at this point. With our naive job config, we’re considerably past the point of diminishing returns for adding nodes to the actual terasort, and we’d captured the profiling info we needed at this point.

Juju spun up 1972 slaves in just over seven hours total. Profiling showed that juju was spending a lot of time serializing stuff into zookeeper nodes using yaml. It looks like python’s yaml implementation is python, and not just wrapping libyaml. We tested a smaller run replacing the internal yaml serialization with json.. Wham! two orders of magnitude faster. No particular surprise.

Lessons Learned

Ok, so at the end of the day, what did we learn here?

What we did here is the way developing for performance at scale should be done… start with a naive, flexible approach and then spend time and effort obtaining real profiling information. Follow that with optimization decisions that actually make a difference. Otherwise it’s all just a crapshoot based on where developers think the bottlenecks might be.

Things to do to juju as a result of these tests:

streamline our implementation of ‘-n’ options
- the client should pass the multiplicity to the provisioning agent
- the provisioning agent should pass the multiplicity to the EC2 api
don’t use yaml to marshall data in and out of zookeeper
replace per-instance security groups with per-instance firewalls

What’s Next?

So that’s a big enough bite for one round of scale testing.

Next up:

land a few of the changes outlined above into trunk. Then, spin up another round of scale tests to look at the numbers.
more providers (other clouds as well as a MaaS lab too)
regular scale testing?
- can this coincide with upstream scale testing for projects like hadoop?
test scaling for various services? What does this look like for other stacks of services?

Wishlist

find some better test jobs! benchmarks are boring… perhaps we can use this compute time to mine educational data or cure cancer or something?
perhaps push juju topology information further into zk leaf nodes? Are there transactional features in more recent versions of zk that we can use?
use spot instances on ec2. This is harder because you’ve gotta incorporate price monitoring.

Software, Physics, Data, Mountains

...and other random associations

Scaling a 2000-node Hadoop Cluster on EC2/Ubuntu With Juju