Reading and Writing S3 Data with Apache Spark

Overview

In our previous article, we worked with some flight data. I uploaded that manually to s3 into a folder called flight_data, now what I can do is read and write directly to S3. While I you certainly don't need to, I set up my Spark cluster using the tool provided with Spark. When you create a spark cluster on AWS EC2, you should use the --copy-aws-credentialsto save yourself some headache when you want to write your output or read your input from s3. However sometimes it may not work or you may not be able to do this, so let's go over some other ways of doing this. On your master node you can just run this in your standard bash shell. Now you don't need to be running on a cluster or even on EC2 to do this, it just needs to have the correct information (ie your access information).
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
You can also specify them inline in your applications when you read in the data, although remember that this is a potential security issue.
val data = sparkContext.textFile("s3n://yourAccessKey:yourSecretKey@/path/")
data = sparkContext.textFile("s3n://yourAccessKey:yourSecretKey@/path/")
Once that's completed then we can read and write from/to Amazon S3.

Demonstration of Reading and Writing to S3

This part assumes that you've been following this tutorial so far. So assuming that you have followed the instructions for the flight data mentioned at the top, you can just run a couple of commands in your spark / pyspark shell. Notice how easy that is because we can specify wildcards. This may take some time to write out to S3 but be patient because it will work! Please note that you're going to have to swap out your own personal bucket name. My bucket is named b-datasetsso change that to whatever you've decided to call your bucket!
val x = sc.textFile("s3n://b-datasets/flight_data/*") // we can just specify all the files.
x.take(5) // to make sure we read it correctly
x.saveAsTextFile("s3n://b-datasets/flight_data2/")
x = sc.textFile("s3n://b-datasets/flight_data/*") # we can just specify all the files.
x.take(5) # to make sure we read it correctly
x.saveAsTextFile("s3n://b-datasets/flight_data2/")
Once that's completed you should see the below in your AWS console. S3 Flight Folders

Questions or comments?

comments powered by Disqus