Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Spinning Up a Spark Cluster on Spot Instances: Step by Step (insightdataengineering.com)
37 points by ddrum001 on Oct 17, 2015 | hide | past | favorite | 10 comments


Very nice. I prefer using the built in command line ec2 scripts packages with all Spark versions.

https://spark.apache.org/docs/latest/ec2-scripts.html

You can even specify the spot instance price, instance type, number of nodes, etc.

e.g.

./spark-ec2 --key-pair=awskey --identity-file=awskey.pem --region=us-west-1 --zone=us-west-1a --spot-price=0.2 --instance-type=m4.4xlarge launch my-spark-cluster


Indeed, the script works quite well for Spark - but we also wanted to provide a guide for those who would like to further understand the config files, especially for those who want to 'tinker' with these down the road (e.g. adding nodes to the cluster). Also, we find it helpful for learning to set up Spark on other platforms, and setting up other systems that don't have ec2 scripts yet.


If you don't care to do all the setup yourself, we've recently announced Dataproc as a fully-managed service including support for Preemptible VMs: https://cloud.google.com/dataproc/ .

Disclaimer: I work on Compute Engine, specifically Preemptible VMs, but didn't work on Dataproc (though I did add --preemptible to bdutil!)


It only specifies pricing per CPU - how is affected by memory per node? Is that configurable?


I wish they had been more clear, sorry about that (I'll send them the equivalent of a pull request): you pay for Dataproc at a rate of $.01/"core"/hour regardless of which instance shape you use. However, you still pay for the underlying compute and storage; $.01/core/hour is the "service" fee.


Neat!

EMR also added support for Spark this year: https://aws.amazon.com/elasticmapreduce/details/spark/


There's a script for that, actually:

http://spark.apache.org/docs/latest/ec2-scripts.html


Man, wish I could use spark at work but it uses a mavenized build that requires open internet access for dependency download. Not worth it when I have to register every dependent jar file by hand in my internal repository.


you use an uberjar or fatjar and package the dependencies as a single jar.


Why wouldn't you just use EMR with yarn and spark 1.5 ?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: