Building Spark for your Cluster to Support Hive SQL and YARN

Introduction

/If you're reading this and wondering about how to get set up on your local computer, you might want to read this guide to getting Apache Spark set up on your local machine./ While the spark documentation can definitely be helpful at times, it doesn't always include user friendly guides for the more simple things. This guide will walk you through how to build Spark for your cluster. In this guide we'll go through how to build it for scala 2.11 and 2.10 which should be pretty similar. Setting up Spark to work with YARN can be a bit of a pest sometimes, especially if you already have a cluster up and running. Hopefully this will make things a bit more straightforward. At its core this is a guide for how to upgrade or install Spark on your cluster. /If you're just arriving here and haven't played with Spark before, I recommend downloading one of the pre-built distributions - it'll save you a lot of time and headache!/

Requirements

You're going to need Scala installed on your machine to build and use Spark. You can find a guide for that on the Apache Maven website. You'll likely have to set your $JAVA_HOMEvariable and possibly others. Feel free to leave a comment if you're having trouble getting it installed or set up correctly.

Download

First, we've got to download the Spark project. While you can download prebuilt Spark packages for certain Hadoop distributions. I always like to start with the raw package and be able to build it out to meet my requirements. So head on over to the Spark Downloads Page and get the primary package. Now one thing that hangs me up a fair amount is that the link under step 4 is actually just a link to mirrors - not actually the package. Click that link in order to access a mirror to download. Once you've downloaded it, go ahead and verify it with a checksum.

Building

Once you've got that downloaded, we're going to have to build it. Now some instructions are given here, but it's not always clear what the steps should be in your situation so let's just walk through exactly what I'm doing on OS X. It should be fairly similar on Linux OS's as well.

Decompressing the file

Navigate to the downloaded spark tar file and run the following command:
tar -xvf spark-1.4.1.tgz
You may have to modify the version number to suit the version of Spark that you are downloading but that should be pretty straightforward. Once that's complete you should have a directory in that same folder labelled spark-1.4.1 or something similar. Now the Spark documentation instructions are thorough but aren't always going to give you everything you want out of the box. For example, if you build Spark with SBT, you won't have support for PySpark which can be annoying if that's how you're looking to code. Additionally, unless you specify it correctly you won't get access to the Hive Query engine without building it to support this. Not always a huge deal but especially if I'm working on my local machine - I'd like to get access to it all.

Building with Maven

Now if you're going to build Spark with Maven, you'll build it to work with Yarn and PySpark. Now because this is a big project what you should first do is set some memory management configurations in your shell. Here's what is recommended.
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
Once you've done that, let's get building. What you should check prior to building is the Scala version.
scala -version
If you've got 2.11 skip ahead to the Building for 2.11 section and if not just keep going!

Building for Scala 2.10

One command should work for you right away! This is going to give you support for basically anything which is probably best for your local environment.
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package
Now this will take a while so it might be worth it to go grab some coffee.

Building for Scala 2.11

If you're going to build to support scala 2.11 you're going to have to add another command.
./dev/change-version-to-2.11.sh
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Dscala-2.11 -Phive -Phive-thriftserver -DskipTests clean package
I've had some trouble with the thrift server on 2.11 and you may as well. Removing the -Phive-thriftserveroption will help build Spark correctly however that won't allow you to use the Thrift Server. If that's a requirement i recommend building with Scala 2.10.

Building with SBT

Building with SBT (the standard Scala compiler) is supported as well. You can pass in the same parameters that you might to the maven build as they're derived from the same base. The "get everything" build can be found below. Remember however that this will not include PySpark. You'll need to build with Maven if you want to use that.
build/sbt -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Dscala-2.11 -Phive -Phive-thriftserver -DskipTests clean package

Making sure the Upgrade/Install Worked

Now you should be all set but before you go ahead and replace all your environmental variables that point to the Spark version you're working with, you should test it out to make sure that it's working. There are sometimes issues that can come up that you might not have seen before. I would recommend putting this in the same directory as your other Spark Versions and testing it out withe some of the code you had written before. You probably want to start with the local command just to see if you can get some code running, create some RDDs and DataFrames and that sort of thing.

Questions or comments?

comments powered by Disqus