Spark Clusters on AWS EC2 - Reading and Writing S3 Data - Predicting Flight Delays with Spark Part 1


This tutorial will give you a picture of the overall workflow needed to setup a predictive model in Spark using Scala and Python. Since we're going to do this from scratch, we're going to have to go through the entire process from gathering our data, munging it and merging it together. After we've done that, we're going to perform two modeling exercises. We're going to perform a supervised classification to try and predict whether or not a flight will be delayed and secondly we're going to perform a regression to try and predict how much a flight will be delayed. This will be performed in a series with the first part dedicated to setting up a Spark Cluster on Amazon AWS, then we're going to perform some exploratory analysis and feature creation, finally we're perform our modeling and see what we get back! However before we get there, we're going to need to understand some of the basic terminology. This predictive pipeline is going to take data from S3, convert it into Spark DataFrames (Pipelines API in MLlib to automate our transformations and predictions. So let's walk through the basics.

What is MLlib?

DataFrame before continuing.

How does MLlib compare to Mahout?

If you're coming from any big data experience or the Hadoop ecosystem. You'll likely have heard of Mahout, another Apache project. Mahout runs on MapReduce (so it's much slower that Spark) but has had more time to mature. It's certainly a very stable, very operationally powerful system. However my bet is on Spark, and in some respects, it seems that Mahout's is as well. If you look at the latest release of Mahout (0.11.0 as of the time of this writing), you'll see that they're starting to allow Mahout to plug into Spark as a backend. I'm not familiar enough with the Mahout system or the future to describe in detail what will happen but I think it's safe to say that using and developing with Spark is a safe bet.

What are Pipelines?

Pipelines are a concept that was borrowed from Python's extremely popular scikit-learn library. Scikit-learn is an excellent library that allows for single computer machine learning and has an awesome abstraction called the pipeline. It makes for very repeatable transformations and processing - something that Spark really aims for as well. The core idea that really defines the Pipeline concept is that you're going to apply a series of transformation to data and finally fit a model at the end. It'll become more clear as we move forward.

The Data

As I mentioned before, we are going to predict flight delays however the data we're going to use isn't just flight information but weather information as well. This will make for an interesting exercise because we're going to be able to answer questions like "if it snows the night before, what is the likelihood that a United flight will be delay?" I think this is a fun pair of datasets to use.

Flight Data

We are going to use some flight data from the Stat-Computing Data Expo, in this tutorial I am going to use just 2007 and 2008 however if you're feeling adventurous feel free to download more!

Weather Data

The weather data is from the second QCLCD link. You're going to see a lot of tar and zip files. Those are what we're going to be downloading. Now all in all, this data is of a decent size. No it's not "big data" but it's definitely getting towards medium data. So I won't be performing this computations on my computer, I'm going to be performing them on a Spark cluster which we will set up shortly! Go ahead and download all the 2007 and 2008 data.

Loading the Data

Now we can certainly load data from our local computer to our cluster, however this is a bit inefficient and makes our exercises much less repeatable. So what you should do is put all the data in an Amazon S3 Bucket to be able easily access. One of my best practices is to use an Amazon S3 bucket that I call "datasets" or something similar. From there, I'm able to save lots of interesting datasets to be able to play around with later. You can use the simple web UI to upload data into Amazon S3. Once you've created that bucket, I would recommend creating two folders in your bucket. First should be flight_datawhere you'll put the aforementioned flight data, second should be weather_datawhere you'll put the aforementioned weather data. If you're feeling lost, this tutorial should help you out.

Spark Clusters on Amazon EC2

Now there are multiple ways of setting up a cluster on AWS however by far the simplest way is just to use the baked in script distributed with Spark. Not everyone knows this, but Spark has built in support for launching a cluster of basically any size on Amazon EC2. The official documentation is here but I've found it a bit sparse. So, I've decided to write it out in a bit more detail with examples of the basic commands. What's really exceptional about doing things this way is that it's extremely repeatable and automatically sets up HDFS and Spark on our master and slave machines.


You're going to need an AWS account and you're going to need to have set the environment variables AWS_ACCESS_KEY_IDand AWS_SECRET_ACCESS_KEYto your Amazon EC2 access key ID and secret access key. You can find instructions about how to that on the Amazon AWS Documentation website. Please, please don't ever commit these keys to version control as you will pay dearly for it later on when some starts up max GPU clusters to mine bitcoin on your account.

Launching a Cluster

Now for the fun part! Navigate to the Spark directory and cdinto the ec2directory. What I'm going to do is create a 7 node cluster called airplane-cluster I'm going to use the default instance size/type and specify a region. To simplify accessing S3 data I'm also going to pass in the --copy-aws-credentialsoption.
./spark-ec2 -k  -i  --region=us-west-2 -s 6 -v 1.4.1 --copy-aws-credentials launch airplane-cluster
The -sspecifies the number of slaves in this cluster. While the -voption specifies the Spark version. To state the obvious, you should be be filling in the with your own AWS information. Doing that will launch our cluster on Amazon! This does cost money and I'm not responsible for any charges that you incur, however keep in mind that they're pretty cheap overall. This whole tutorial shouldn't cost you much more than a couple of dollars. Just be sure to shut off your instances and remove your EBS volumes when you're done. Your cluster will take some time to start, maybe 15 or 20 minutes so feel free to grab a cup of coffee. Sometimes you'll see some launch errors. Sometimes these can be ignored and other times they can't. If you have any trouble with the rest of the tutorial just kill the cluster and start fresh. That's what's awesome about this start up script, it's just so easy!

Other Options

When you're creating a cluster there are plenty of other options that you can pass if you like. You can launch under a VPC, in a specific zone and specific subnet as well. Elaborating on these options is outside the scope of this tutorial but feel free to ask questions in the comments section. Typically these options are going to be to solve pretty specific business cases so, especially if you're learning, these can typically be ignored.

Logging into the Cluster

Once you've got your cluster all booted up, you're going to want to SSH into it to run some commands and all that. Luckily, the command to login to the cluster is simple.
./spark-ec2 -k  -i  --region=us-west-2 login airplane-cluster
That will log you into the master node which allows you to start the Spark Shell or access the HDFS. What we're able to do at that point is use something like the Spark shell with one command.

Pausing a Cluster

Because of this start up time, if you're running a cluster, you're at times going to want to pause it for a little bit while you jump into a meeting or on a phone call. It may also be that you're performing some analysis and realize that you need to wait for some other data but don't want to kill your cluster because you're hoping to pick up where you left off. To do this you're going to pause the cluster to avoid being charged for compute time while you will still incurring the storage costs associated with AWS EBS. This might sound a bit strange, but basically the idea is that you don't want to start fresh from where you were, you just want to pause things for a little bit to pick up where you left off.
./spark-ec2 -k  -i  --region=us-west-2 stop airplane-clutser

Restarting a Cluster

Once we've paused, we need a way to get going again. Now we can start our cluster again with a command you may have already guessed!
./spark-ec2 -k  -i  --region=us-west-2 start airplane-cluster

Killing a Cluster

Now once you've run all your calculations - you're going to want to destroy the cluster. *This will remove all data and all code and cannot be undone.* When you perform this action, please be aware that you will not be able to recover anything so be sure that this is what you do indeed want to do. The command do so can be found below.
./spark-ec2 -k  -i  --region=us-west-2 destroy airplane-cluster

HDFS on The Cluster

Now there is an HDFS that is setup when we start up using the EC2 Scripts. This is incredibly convenient because it sets everything up for you. Naturally, there are some things that you should be aware of moving forward.

Storing Ephemeral HDFS Data

Sometimes you are just going to add data into the cluster and do not intend to persist it across sessions. For example, sometimes you may need to restart your cluster. The data in the ephemeral-hdfsallows you to store just this kind of data. This data does not persist across node/cluster restarts. We can access this HDFS through the folder ephemeral-hdfsin our the rootdirectory of our master node. For example, to list all the folders on our ephemeral-hdfswe would run the following command while we were ssh'd into the server.
./ephemeral-hdfs/bin/hadoop dfs -ls /

Storing Persistent HDFS Data

The persistent HDFS is the one that survives pauses and restarts of our cluster/nodes. With the defaults provided by the spark-ec2 command, there isn't that much space available on each node for this HDFS, however you can expand it by specifying a size when you create your cluster with the option --ebs-vol-size By default, this HDFS service is not running. To start the persistent HDFS, you'll need to stop the ephemeral one and start the persistent one.
Of course you'll reverse that for the opposite effect.

Storing Data to Your Local Computer

Now you're likely going to want to save the output of your programs and so saving it to HDFS or even your nodes just won't do. That's what you'll want to write to S3 or download the data when your done. If you choose the latter route, you should use a tool like rsyncor scpto copy the data from the node to your local machine. However remember that you're limited by the size of the master node (or rather the available space on the master node). Therefore I recommend just writing out to S3.

Monitoring Your Cluster

Now plenty of times you're going to want to monitor your cluster - most simply you're going to want to know what is going on with the applications that you're running. To access the Spark Application Information You'll need to navigate to the Master node's Web UI. That's at the url of your Spark master node on port :8080. There is other monitoring that can be used like Ganglia however I won't be diving into those details. Spark Master Web UI For example at the time of this writing my master node is available at Now keep in mind that this is *public by default*. That means that anyone can go to that URL and see what you're running as well as the stdout and stderr. While you're playing around this might not be an issue but moving forward you might want to change those security rules if you're going to be using this in production.

Reading & Writing S3 Data

Writing directly to S3 is actually super easy because Spark (assuming you used the --copy-aws-credentialsoption) already has all that it needs to access S3. However sometimes it may not work or you may not be able to do this, so let's go over some other ways of doing this. On your master node you can just run this in your standard bash shell.
You can also specify them inline in your applications when you read in the data, although remember that this is a potential security issue.
val data = sparkContext.textFile("s3n://yourAccessKey:yourSecretKey@/path/")
data = sparkContext.textFile("s3n://yourAccessKey:yourSecretKey@/path/")
Once that's completed then we can read and write from/to Amazon S3.

Demonstration of Reading and Writing to S3

This part assumes that you've been following this tutorial so far. So assuming that you have followed the instructions for the flight data mentioned at the top, you can just run a couple of commands in your spark / pyspark shell. Notice how easy that is because we can specify wildcards. This may take some time to write out to S3 but be patient because it will work! Please note that you're going to have to swap out your own personal bucket name. My bucket is named b-datasetsso change that to whatever you've decided to call your bucket!
val x = sc.textFile("s3n://b-datasets/flight_data/*") // we can just specify all the files.
x.take(5) // to make sure we read it correctly
x = sc.textFile("s3n://b-datasets/flight_data/*") # we can just specify all the files.
x.take(5) # to make sure we read it correctly
Once that's completed you should see the below in your AWS console. S3 Flight Folders

Conclusion for Part 1

Alright this has been extensive! We've downloaded and prepped our data by putting it into Amazon S3. We then launched a cluster on Amazon EC2 using the simple provided scripts. After launching that cluster, we were able to ssh into our master node and start the shell. Finally we read and wrote data from Amazon S3 using our credentials in a very easy and straight forward way. However we haven't even begun modeling yet!!! Such is the work of a data scientist, you're going to have to do a lot of setup just to get started. Subscribe on the right hand side to receive notifications about the next part in this series!

Questions or comments?

comments powered by Disqus