Spinning Up a Spark Cluster on Spot Instances: Step by Step

eranation · on Oct 17, 2015

Very nice. I prefer using the built in command line ec2 scripts packages with all Spark versions.

https://spark.apache.org/docs/latest/ec2-scripts.html

You can even specify the spot instance price, instance type, number of nodes, etc.

e.g.

./spark-ec2 --key-pair=awskey --identity-file=awskey.pem --region=us-west-1 --zone=us-west-1a --spot-price=0.2 --instance-type=m4.4xlarge launch my-spark-cluster

ddrum001 · on Oct 17, 2015

Indeed, the script works quite well for Spark - but we also wanted to provide a guide for those who would like to further understand the config files, especially for those who want to 'tinker' with these down the road (e.g. adding nodes to the cluster). Also, we find it helpful for learning to set up Spark on other platforms, and setting up other systems that don't have ec2 scripts yet.

boulos · on Oct 17, 2015

If you don't care to do all the setup yourself, we've recently announced Dataproc as a fully-managed service including support for Preemptible VMs: https://cloud.google.com/dataproc/ .

Disclaimer: I work on Compute Engine, specifically Preemptible VMs, but didn't work on Dataproc (though I did add --preemptible to bdutil!)

stuartaxelowen · on Oct 17, 2015

It only specifies pricing per CPU - how is affected by memory per node? Is that configurable?

boulos · on Oct 18, 2015

I wish they had been more clear, sorry about that (I'll send them the equivalent of a pull request): you pay for Dataproc at a rate of $.01/"core"/hour regardless of which instance shape you use. However, you still pay for the underlying compute and storage; $.01/core/hour is the "service" fee.

jcrites · on Oct 18, 2015

Neat!

EMR also added support for Spark this year: https://aws.amazon.com/elasticmapreduce/details/spark/

stuartaxelowen · on Oct 17, 2015

There's a script for that, actually:

http://spark.apache.org/docs/latest/ec2-scripts.html

technofiend · on Oct 18, 2015

Man, wish I could use spark at work but it uses a mavenized build that requires open internet access for dependency download. Not worth it when I have to register every dependent jar file by hand in my internal repository.

angryasian · on Oct 18, 2015

you use an uberjar or fatjar and package the dependencies as a single jar.

angryasian · on Oct 18, 2015

Why wouldn't you just use EMR with yarn and spark 1.5 ?