All Tutorials

On this page you'll find a list of all of the tutorials on this site.

A Simple Scala Spark Project Template and Guide

Thus far I haven't found a good project template for Apache Spark and it's been a repeated process to get it right. In this tutorial, I walk through a simple project template that I've created as an effort to help others get started with Apache Spark in Scala.

Read Article

Analyzing Flight Data: A Gentle Introduction to GraphX in Spark

Graphs are a simple way of representing relationships in data and Apache Spark provides a simple way of creating and manipulating them. This tutorial will walk you through the basics of GraphX in Apache Spark using Scala. You'll analyze flight data from 2008 and run algorithms like PageRank to better understand all the flights that took place!

Read Article

Building Apache Spark on your Local Machine

This article will walk you through how to build Apache Spark for usage on your local machine. After that you'll be able to create Spark Clusters or try out Spark on your local computer.

Read Article

Building Spark for your Cluster to Support Hive SQL and YARN

This article will walk you through how to build Apache Spark to support the HIVE SQL execution engine as well as YARN. After that it should be ready to get up and running on your hadoop cluster.

Read Article

Getting Started with Apache Spark DataFrames in Python and Scala

In this easy to follow tutorial, learn the basics of Spark DataFrames, how they're composed of RDDs and what they allow you to do in Scala. They're a similar abstraction to pandas DataFrames or R's DataFrames.

Read Article

Getting Started with Apache Spark RDDs

This introductory tutorial will walk you through the basic RDD abstraction in Spark. It has code samples in both Scala as well as Python Spark (PySpark). We'll answer the question, what is an RDD?

Read Article

Opening CSV Files in Apache Spark - The Spark Data Sources API and Spark-CSV

This guide will show you how to read in csv files in Apache Spark. We'll walk through how to use this package in both Python and Scala.

Read Article

Reading and Writing S3 Data with Apache Spark

In this tutorial we're going to show you how to read and write from Amazon S3.

Read Article

Setup Your Zeppelin Notebook For Data Science in Apache Spark

Notebooks are quickly becoming the go to way of running and developing code in data science. While it's not the only way, it's certainly popular and is an Apache Incubating Project. In this tutorial, we'll walk through how to get a Zeppelin notebook setup on your machine or cluster for data science development.

Read Article

Spark Broadcast Variables - What are they and how do I use them

In this short article, we'll go over what Broadcast variables are, some of their uses, and how you should try and leverage them in your projects. We'll be covering topics like the broadcast join to keep your cluster from having to do too much work!

Read Article

Spark Clusters on AWS EC2 - Reading and Writing S3 Data - Predicting Flight Delays with Spark Part 1

In this tutorial we're gong to set up a complete predictive modeling pipeline in Spark using DataFrames, Pipelines and MLlib. The first part of this tutorial will explain some of the basic concepts that we're going to need to build this model, walk you through how to download the data we'll use, and lastly create our Spark Cluster on Amazon AWS and read and write from AWS S3!

Read Article

Spark MLLib - Predict Store Sales with ML Pipelines

In this tutorial we're going to be doing a full-stack machine learning project. We're going all the way from data manipulation to feature creation and finally serving predictions.

Read Article

Spark Will Not Start with Spark Error-java.lang.OutOfMemoryError PermGen space

This article will walk you through how to resolve the java.lang.OutOfMemoryError: PermGen space exception that can occur when you're trying to start Spark.

Read Article

Spark Will Not Start with Spark Error-java.net.BindException: Address already in use

This article will walk you through how to resolve the somewhat common java.net.BindException: Address already in use exception that can occur when you're trying to start Spark.

Read Article

The Simplest Explanation of and Approaches to Optimizing Spark Shuffles

This post will dive into some of the details of the Spark Shuffle and what it means for you while using Apache Spark to perform your data analysis in a cluster setting.

Read Article

Using SparkSQL UDFs to Create Date Times in Apache Spark

In this article we're going to create some date times using some new SQL functions in Spark.

Read Article