-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Before you do anything else, make sure you have the right version of [lein] (https://github.com/technomancy/leiningen). If you don't have the right version, run lein upgrade and hen run lein clean, deps.
If inexplicable things happen, make sure your local repository (~/.m2) has the latest versions of all dependencies. We've gotten f'd by outdated versions more than once. We're using development versions of a few things after all.
One issue we've run into is having an outdated version of Cascalog. The solution seems to be the following:
rm -r ~/.m2/repository/cascalogNow that you have the right code, you need to download and install the elastic-mapreduce ruby script. Installation is simple: unzip the the file you downloaded, then add that location to your .bash_profile, akin to this:
PATH="/path/to/elastic-mapreduce-ruby:${PATH}"
export PATHThen either run the command source ~/.bash_profile or open a new terminal window and work in that. You'll set up the credentials for actually using elastic-mapreduce below.
Finally, and most importantly, pronounce lein as LINE and not LEAN else the cluster will shutdown in a non-deterministic way.
Save the json object below in a file called credentials.json in the forma-deploy directory.
{
"access-id":"AKIAJ56QWQ45GBJELGQA",
"private-key":"6L7JV5+qJ9yXz1E30e3qmm4Yf7E1Xs4pVhuEL8LV",
"key-pair":"forma-keypair",
"key-pair-file":"~/.ssh/id_rsa-forma-keypair",
"log-uri":"s3n://reddemrlogs"
}The id_rsa-forma-keypair file referenced in credentials.json is required. Ask Dan, Sam, or Robin for it. Once you have it, be sure the permissions are set correctly - too loose and you won't be able to do much. The command to properly set the permissions is.
$ chmod 600 ~/.ssh/id_rsa-forma-keypairFirst, upload the latest forma-clj/dev/forma_bootstrap.sh script to S3 here:
https://s3.amazonaws.com/reddconfig/bootstrap-actions/forma_bootstrap.sh
Note that the above URL is hard coded here:
https://github.com/reddmetrics/forma-deploy/blob/master/src/forma/hadoop/cluster.clj#L241
In forma-deploy/src/forma/hadoop/cli.clj is the CLI for deploying to cluster. From within the forma-deploy directory, use this:
lein run --type large --emr --size 25There are three supported cluster types, defined in forma-deploy/src/forma/hadoop/cluster.clj: large which launches instances of type m1.large (ami-4db76624), high-memory which launches instances of type m2.4xlarge, and cluster-compute which launches instances of type cc1.4xlarge.
It takes awhile to boot up the cluster. Here are a few helpful commands for monitoring and terminating a cluster:
$ elastic-mapreduce --list --active
$ # terminate all active clusters/job flows
$ elastic-mapreduce --list --active --terminate
$ # get a cluster/job flow id, and terminate it
$ elastic-mapreduce --terminate --jobflow j-ABABABASABA
$ # log into master node
elastic-mapreduce --ssh --jobflow j-ABABABABAMore helpful hints from AWS: http://aws.amazon.com/developertools/2264
You can use the option --all to list all past jobs on EMR, not just the ones from the previous week or two.
You can also monitor the cluster via the AWS EMR console:
https://console.aws.amazon.com/elasticmapreduce/home?region=us-east-1
And to monitor nodes coming online, see the AWS EC2 console:
https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances
If your cluster gets killed due to spot price spikes, try moving it to a different availability zone (AZ). For now you need to edit the file src/forma/hadoop/cluster.clj and add a line like the following (choose your own AZ) in the boot-emr! function:
--availability-zone us-east-1eAfter the cluster on EMR is in the Waiting state, we're ready. Let's lein uberjar the forma-clj project and then upload it to the job tracker:
$ lein uberjar
# Creates forma-0.2.0-SNAPSHOT-standalone.jarThen from within forma-deploy:
$ elastic-mapreduce --list # Get the job tracker URL (e.g., ec2-67-202-41-140.compute-1.amazonaws.com)
$ sftp -i ~/.ssh/id_rsa-forma-keypair [email protected]
$ put forma-0.2.0-SNAPSHOT-standalone.jar
# alteratively, for the scp-inclined:
$ scp -i ~/.ssh/id_rsa-forma-keypair forma-0.2.0-SNAPSHOT-standalone.jar [email protected]:Next SSH in, create a new screen, and become the hadoop user:
$ ssh -i ~/.ssh/id_rsa-forma-keypair [email protected]
$
$ # this gets you a fresh, ready-to-use "window" in screen,
$ # and turns on logging for anything printed to stdout.
$ screen -LmThen you're ready to run a job! Try this, for 500m data in Indonesia.
$ hadoop jar forma-0.2.0-SNAPSHOT-standalone.jar forma.hadoop.jobs.modis s3n://modisfiles/MOD13A1 s3n://pailbucket/master * :IDNOr launch a REPL and run your job:
$ hadoop jar forma-0.2.0-SNAPSHOT-standalone.jar clojure.mainOnce in the REPL:
(use 'forma.hadoop.jobs.scatter)
(in-ns 'forma.hadoop.jobs.scatter)
(ultrarunner "/user/hadoop/checkpoint"
"s3n://formaresults/staticbuckettemp"
"s3n://formaresults/finalbuckettemp"
"s3n://formaresults/finaloutput")It can be useful to log into slave nodes on occasion, whether to check memory usage or browse local log files. You can do this by grabbing the private IP address of an instance using the Hadoop Jobtracker. For example, this page gives you the list of nodes you're running:
http://ec2-23-20-44-190.compute-1.amazonaws.com:9100/machines.jsp?type=activeThe Host column gives you the private IP address for each node, for example 10.28.100.164. Log into the AWS console and go to the EC2 page, and enter the private IP address in the search/filter box. The instance this matches will become the only one visible. Click as usual to get its public DNS address. If you have function emr() { ssh -i ~/.ssh/id_rsa-forma-keypair hadoop@$@; } in your ~/.bash_profile, you can login using the public DNS like this:
$ emr ec2-23-20-44-190.compute-1.amazonaws.comNice tutorial for getting started. And here are the basic commands you'll need:
Launching:
launch: screen
launch with no welcome screen: screen -m
launch with logging: screen -Lm
launch and immediately disconnect: screen -dLm
Once in screen:
disconnect and kill: ^-d
disconnect without killing: ^-a d
view stdout history: ^-a [
see key bindings: ^-a ?
Reconnect:
get list of existing screens (if more than one): screen -r
reconnect to existing screen (if only one): screen -r
reconnect to most recent screen (if one or more): screen -rr
reconnect to specific screen: screen -r internal-address.ttys.other-stuff
If you somehow get disconnected and screen says something is still attached to a "window", use this to force a disconnect and reconnect immediately:
screen -dr internal-address.ttys.other-stuff
Sign into the jobtracker by
http://<Master DNS>:9100/jobtracker.jsp
For example,
http://ec2-107-21-135-90.compute-1.amazonaws.com:9100/jobtracker.jsp
You can find the DNS here.
Sometimes the forma_bootstrap.sh is messed up, or it's not public on S3, causing the EMR cluster to automatically fail and shutdown. If this happens, stay calm! Just comment out the bootstrap step here, spin up a small cluster, upload forma_bootstrap.sh to the job tracker, run it, and investigate the errors.
Make sure you have the latest version of the EMR CLI tools. Defaults change across versions, and the bootstrap script as of 2/13/2011 expects the version released in December 2011.
To examine the EMR logs, navigate to the reddemrlogs bucket on the S3 console or some other S3 access service. The folders are tagged with with the Job Flow ID, which can be found by navigating to Elastic MapReduce tab on the AWS console. Click on the relevant EMR job, and the Job Flow ID will be at the top of the descriptive panel.
If there are memory errors, try out a few things listed on this StackOverflow page
If there's a problem with bootstrapping the cluster, it can be useful to run the bootstrapping script on the AMI we use: ami-4db76624.
hadoop jar forma-0.2.0-SNAPSHOT-standalone.jar clojure.main