Struggling to conquer Apache Spark?

Learning is hard enough as it is but when you bring in distributed computing frameworks in sophisticated programming languages - things don't get any easier. While self-study can certainly help, without a good guide, things are always more difficult than they should be. That's why I created Spark Tutorials, to make it easier to learn and use Apache Spark. is here to provide simple, easy to follow tutorials to help you get up and running quickly. You'll learn the foundational abstractions in Apache Spark from RDDs to DataFrames and MLLib. Start off with some of the articles below.

Building Spark for your Cluster to Support Hive SQL and YARN

This article will walk you through how to build Apache Spark to support the HIVE SQL execution engine as well as YARN. After that it should be ready to get up and running on your hadoop cluster.

Visit Article »

Spark Clusters on AWS EC2 - Reading and Writing S3 Data - Predicting Flight Delays with Spark Part 1

In this tutorial we're gong to set up a complete predictive modeling pipeline in Spark using DataFrames, Pipelines and MLlib. The first part of this tutorial will explain some of the basic concepts that we're going to need to build this model, walk you through how to download the data we'll use, and lastly create our Spark Cluster on Amazon AWS and read and write from AWS S3!

Visit Article »

Spark Broadcast Variables - What are they and how do I use them

In this short article, we'll go over what Broadcast variables are, some of their uses, and how you should try and leverage them in your projects. We'll be covering topics like the broadcast join to keep your cluster from having to do too much work!

Visit Article »