Setup Your Zeppelin Notebook For Data Science in Apache Spark

What are Zeppelin Notebooks?

Zeppelin Drawing Logo The Jupyter (formerly known as IPython) Notebook that has been extremely popular in the Python community. It lacks a lot of maturing that you'll find in the IPython community or in the Databricks notebook. If you're here to learn how to use Apache Spark, I'd focus on using the Databricks notebook as it's much simpler to set up and get started.

How do they compare to IPython/Jupyter Notebooks?

As someone that is a huge fan of Juypter notebooks, this was my initial question. How do they compare to Jupyter Notebooks? Frankly, they're pretty similar. They're going to have a similar coding style, embed images, run different programming languages. However there are some weaknesses with the Jupyter Notebook. Taking a look at iScala, the scala engine for Jupyter notebooks, it's unfortunate to see that the project seems abandoned. Documentation is out of date and issues aren't touched/accepted/acknowledged. This makes it a risk if you're running anything besides Python.

Advantages of Zeppelin Notebooks

First it's easier to mix languages in the same notebook. You can do some SQL, scala, then markdown to document it all together. You can also easily convert your notebook into a presentation style - for maybe presenting to a management or using in dashboards.

Advantages of Jupyter Notebooks

Jupyter notebooks are /a lot/ more mature in their abilities and utility. The shortcuts work extremely well, it's easy to keep your hands on the keyboard. Github also allows you to embed them in repositories which makes them easier to share statically. Frankly, the notebook technology is just better. However, it doesn't really have a strong scala version. the iScala notebook environment does exist, however it doesn't seem to be actively maintained. At the end of the day, they're tools. It's going to have some disadvantages as well as advantages. I'm a fan of using different tools to perform different tasks but I'd love to hear what you think of the Zeppelin Notebook vs Jupyter Notebook comparison. Feel free to post a comment!

Building Zeppelin

Apache Zeppelin Logo Now to use Zeppelin, you're going to have to build it yourself. This is a bit of an arduous process and if you want to use a specialized version of Spark, this could take a lot of work on your part. This actually isn't too different from building Apache Spark (free to read my tutorials on that very subject. You can build Spark for a cluster.). There are a couple of different things you might run into that I've enumerated below.

Clone the Github Repository

First you're going to need to clone the github repository.
git clone git@github.com:apache/incubator-zeppelin.git

Set Environmental Variables

Once that downloads, go and and cdinto the directory.
cd incubator-zeppelin/
Now you're going to have to set some environmental variables so that Maven doesn't throw a permgen error.
export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=128m"
You're also going to need to set your JAVA_HOMEvariable. On Mac, it's as easy as running, however this assumes that you've got the Java version 1.7 installed.
export JAVA_HOME=$(/usr/libexec/java_home -v 1.7)
You can run both those commands right in the terminal and don't need to add them to your bash profile unless you want to. Once that's done, you're ready to build!

Build for Local Testing

Building for local is a bit more simple because you don't need to specify the hadoop version or anything like that. Just go ahead and run:
mvn clean install -DskipTests
That should do it!

Build for Your Cluster

Building for your cluster can be a bit more complicated, as it is with building Spark as well. Be sure to set the Spark and Hadoop versions correctly!
mvn install -DskipTests -Dspark.version=1.1.0 -Dhadoop.version=2.2.0

Building for Custom Built Spark

If you're looking to build it for custom built Spark - you probably know what you're doing already! Read the documentation :).

Using Zeppelin Notebooks

Now that it's all built. Let's get started on how to use it!

Starting and Stopping the Notebook

Starting and stopping the notebook daemon is super simple.
bin/zeppelin-daemon.sh start
After running that, just navigate to localhost:8080 and you should see your notebook.
bin/zeppelin-daemon.sh stop

Conclusion and Next Steps

Now we've got our Zeppelin Notebook all set up. Go ahead and start one up and start executing code. Personally, I don't like them as much as I like working with Databricks notebooks, in the plain REPL environment. I've liked the IPython notebook in the past, especially when working in python but I personally haven't liked the Zeppelin notebook very much. It's not user friendly, doesn't have a lot of they keyboard shortcuts, and is still finicky. Give it a try and see if you like it and let me know what you think in the comments. Personally, while I've kept an eye on the project, I haven't found it to be something that I'll follow as closely as I have and do the Jupyter Notebook - the technology just isn't there to integrate into my workflow quite yet.

Questions or comments?

comments powered by Disqus