2011-11-22
(cloud, charmschool, juju)
Wanna learn more about juju?
Drop by Charm School:
Details from Jorge’s post:
We're holding a Charm School on IRC.
juju Charm School is a virtual event where a juju expert
is available to answer questions about writing your own
juju charms. The intended audience are people who deploy
software and want to contribute charms to the wider devops
community to make deploying in the public and private
cloud easy.
Attendees are more than welcome to:
Ask questions about juju and charms
Ask for help modifying existing scripts and make charms out of them
Ask for peer review on existing charms you might be working on
Though not required, we recommend that you have juju installed
and configured if you want to get deep into the event.
2011-11-08
(cloud, hadoop, juju)
#########################################################
NOTE: Repost
The ubuntu project “ensemble” is now publicly known as “juju”. This is a repost of an older article Monitoring Hadoop Benchmarks TeraGen/TeraSort with Ganglia to reflect the new names and updates to the api.
#########################################################
Here I’m using new features of Ubuntu Server (namely juju) to easily deploy Ganglia alongside a small Hadoop cluster to play around with monitoring some benchmarks like Terasort.
Short Story
Deploy hadoop and ganglia using juju:
$ juju bootstrap
$ juju deploy --repository "~/charms" local:hadoop-master namenode
$ juju deploy --repository "~/charms" local:ganglia jobmonitor
$ juju deploy --repository "~/charms" local:hadoop-slave datacluster
$ juju add-relation namenode datacluster
$ juju add-relation jobmonitor datacluster
$ for i in {1..6}; do
$ juju add-unit datacluster
$ done
$ juju expose jobmonitor
When all is said and done (and EC2 has caught up), run the jobs
$ juju ssh namenode/0
ubuntu$ sudo -su hdfs
hdfs$ hadoop jar hadoop-*-examples.jar teragen -Dmapred.map.tasks=100 -Dmapred.reduce.tasks=100 100000000 in_dir
hdfs$ hadoop jar hadoop-*-examples.jar terasort -Dmapred.map.tasks=100 -Dmapred.reduce.tasks=100 in_dir out_dir
While these are running, we can run
$ juju status
to get the URL for the jobmonitor ganglia web frontend
http://<jobmonitor-instance-ec2-url>/ganglia/
and see…
and a little later as the jobs run…
Of course, I’m just playing around with ganglia at the moment… For real performance, I’d change my juju config file to choose larger (and ephemeral) EC2 instances instead of the defaults.
A Few Details…
Let’s grab the charms necessary to reproduce this.
First, let’s install juju and set up a our charms.
$ sudo apt-get install juju charm-tools
Note that I’m describing all this using an Ubuntu laptop to run the juju cli because that’s how I roll, but you can certainly use a Mac to drive your Ubuntu services in the cloud. The juju CLI is already available in ports, but I’m not sure the version. Homebrew packages are in the works. Windows should work too, but I don’t have a clue.
$ mkdir -p ~/charms/oneiric
$ cd ~/charms/oneiric
$ charm get hadoop-master
$ charm get hadoop-slave
$ charm get ganglia
That’s about all that’s really necessary to get you up and benchmarking/monitoring.
I’ll do another post on how to adapt your own charms to use monitoring and the monitor juju interface as part of the “Core Infrastructure” series I’m writing for charm developers. I’ll go over the process of what I had to do to get the hadoop-slave service talking to monitoring services like ganglia.
Until then, clone/test/enjoy… or better yet, fork/adapt/use!
2011-11-08
(cloud, juju, hadoop)
#########################################################
NOTE: Repost
The ubuntu project “ensemble” is now publicly known as “juju”. This is a repost of an older article Painless Hadoop / Ubuntu / EC2 to reflect the new names and updates to the api.
#########################################################
Thanks Michael Noll for the posts where I first learned how to do this stuff:
I’d like to run his exact examples, but this time around I’ll use juju for hadoop deployment/management.
The Short Story
Setup
install/configure juju client tools
$ sudo apt-get install juju charm-tools
$ mkdir ~/charms && charm getall ~/charms
run hadoop services with juju
$ juju bootstrap
$ juju deploy --repository ~/charms local:hadoop-master namenode
$ juju deploy --repository ~/charms local:hadoop-slave datanodes
$ juju add-relation namenode datanodes
optionally add datanodes to scale horizontally
$ juju add-unit datanodes
$ juju add-unit datanodes
$ juju add-unit datanodes
(you can add/remove these later too)
Scaling is so easy there’s no point in separate standalone -vs- multinode versions of the setup.
Data and Jobs
Load your data and jars
$ juju ssh namenode/0
ubuntu$ sudo -su hdfs
hdfs$ cd /tmp
hdfs$ wget http://files.markmims.com/gutenberg.tar.bz2
hdfs$ tar xjvf gutenberg.tar.bz2
copy the data into hdfs
hdfs$ hadoop dfs -copyFromLocal /tmp/gutenberg gutenberg
run mapreduce jobs against the dataset
hdfs$ hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar wordcount -Dmapred.map.tasks=20 -Dmapred.reduce.tasks=20 gutenberg gutenberg-output
That’s it!
Now, again with some more details…
Installing juju
Install juju client tools onto your local machine…
# sudo apt-get install juju charm-tools
We’ve got the juju CLI in ports now too for Mac clients (Homebrew is in progress).
Now generate your environment settings with
$ juju
and then edit ~/.juju/environments.yaml to use your EC2 keys. It’ll look something like:
environments:
sample:
type: ec2
control-bucket: juju-<hash>
admin-secret: <hash>
access-key: <your ec2 access key>
secret-key: <your ec2 secret key>
default-series: oneiric
In real life you’d probably want to specify default-image-type to at least m1.large too, but I’ll give some examples of that in later posts.
Hadoop
Grab the juju charms
Make a place for charms to live
$ mkdir charms/oneiric
$ cd charms/oneiric
$ charm get hadoop-master
$ charm get hadoop-slave
(optionally, you can charm getall but it’ll take a bit to pull all charms).
Start the Hadoop Services
Spin up a juju environment
$ juju bootstrap
wait a minute or two for EC2 to comply. You’re welcome to watch the water boil with
$ juju status
or even
$ watch -n30 juju status
which’ll give you output like
$ juju status
2011-07-12 15:20:54,978 INFO Connecting to environment.
The authenticity of host 'ec2-50-17-28-19.compute-1.amazonaws.com (50.17.28.19)' can't be established.
RSA key fingerprint is c5:21:62:f0:ac:bd:9c:0f:99:59:12:ec:4d:41:48:c8.
Are you sure you want to continue connecting (yes/no)? yes
machines:
0: {dns-name: ec2-50-17-28-19.compute-1.amazonaws.com, instance-id: i-8bc034ea}
services: {}
2011-07-12 15:21:01,205 INFO 'status' command finished successfully
Next, you need to deploy the hadoop services:
$ juju deploy --repository ~/charms local:hadoop-master namenode
$ juju deploy --repository ~/charms local:hadoop-slave datanodes
now you simply relate the two services:
$ juju add-relation namenode datanodes
Relations are where the juju special sauce is, but more about that in another post.
You can tell everything’s happy when juju status gives you something like (looks a bit different, but basics are the same):
$ juju status
2011-07-12 15:29:20,331 INFO Connecting to environment.
machines:
0: {dns-name: ec2-50-17-28-19.compute-1.amazonaws.com, instance-id: i-8bc034ea}
1: {dns-name: ec2-50-17-0-68.compute-1.amazonaws.com, instance-id: i-4fcf3b2e}
2: {dns-name: ec2-75-101-249-123.compute-1.amazonaws.com, instance-id: i-35cf3b54}
services:
namenode:
formula: local:hadoop-master-1
relations: {hadoop-master: datanodes}
units:
namenode/0:
machine: 1
relations:
hadoop-master: {state: up}
state: started
datanodes:
formula: local:hadoop-slave-1
relations: {hadoop-master: namenode}
units:
datanodes/0:
machine: 2
relations:
hadoop-master: {state: up}
state: started
2011-07-12 15:29:23,685 INFO 'status' command finished successfully
Loading Data
Log into the master node
$ juju ssh namenode/0
and become the hdfs user
ubuntu$ sudo -su hdfs
pull the example data
hdfs$ cd /tmp
hdfs$ wget http://files.markmims.com/gutenberg.tar.bz2
hdfs$ tar xjvf gutenberg.tar.bz2
and copy it into hdfs
hdfs$ hadoop dfs -copyFromLocal /tmp/gutenberg gutenberg
Running Jobs
Similar to above, but now do
hdfs$ hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar wordcount gutenberg gutenberg-output
you might want to explicitly call out the number of jobs to use…
hdfs$ hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar wordcount -Dmapred.map.tasks=20 -Dmapred.reduce.tasks=20 gutenberg gutenberg-output
depending on the size of the cluster you decide to spin up.
You can look at logs on the slaves by
$ juju ssh datanodes/0
ubuntu$ tail /var/log/hadoop/hadoop-hadoop-datanode*.log
ubuntu$ tail /var/log/hadoop/hadoop-hadoop-tasktracker*.log
similarly for subsequent slave nodes if you’ve spun them up
$ juju ssh datanodes/1
or
$ juju ssh datanodes/2
Horizontal Scaling
To resize your cluster,
$ juju add-unit datanodes
or even
$ for i in {1..10}
$ do
$ juju add-unit datanodes
$ done
Wait for juju status to show everything in a happy state and then run your jobs.
I was able to add slave nodes in the middle of a run… they pick up load and crank.
Check out the juju status output for a simple 10-slave cluster here