Setting Up a Virtual Machine with SparkR

It is fairly safe to say that at this point that Spark is the heir apparent standard for advanced analytics applications on big data. As things stand, it is a relatively friendly environment for data scientists and analysts whose preferred languages are Scala, Python, or Java. In contrast, support for the R language, arguably the most commonly used open source language for advanced analytics, is in its comparative infancy in the Spark environment. However, active development, aimed at making the R language a “first class citizen” in Spark, is under way as part of the SparkR Project.

Setting up a system to do development as part of contributing to the SparkR project, or to easily experiment with SparkR, is a task that involves a bit of effort (there are fair number of prerequisites for building SparkR). Rather than creating a SparkR ready environment directly on your own computer, an easier approach is to start with a virtual machine that has many of the prerequisites already installed, and then customize it by installing the remaining software. In this post, I provide a recipe for doing so, using Cloudera’s CDH 5.2 Quick Start VM for Oracle’s VirtualBox virtualization software package. The Cloudera Quick Start VM comes with much of the needed software installed (Spark with SparkSQL, Hadoop, Maven, Git, and Java), it also comes with R, but not the most recent release R as of early December 2014 (the VM was released shortly before the release of R version 3.1.2, so it comes with R 3.1.1). I’ve written a shell script to automate most of the task of installing the most recent version of R, which also installs R’s rJava package(currently needed for SparkR) and the rServe package (needed for Alteryx’s SparkR development team’s work). The script also installs Scala (version 2.10.4) for option value.

In what follows, I first describe how to: get setup the VirtualBox Cloudera Quick Start VM, install and configuring the remaining need software, and configure Git and install SparkR. Hopefully this post will enable you to get up and running with SparkR quickly.

Getting Started with the Cloudera Quick Start VM

The first thing that needs to be done is to prepare the VM. If you are familiar with working with VirtualBox VM’s, you can skip over much of this section. The following steps provide a recipe for getting the VM ready for the installation of additional needed software:

Figure 1

Figure 2

Figure 3

Installing and Configuring the Additional Software

As indicated in the introduction, the installation of R and Scala is a fairly automated process done through a shell script. The script itself is fairly well commented, and you probably want to take a look at it prior to running it in order to get a sense of what it is doing. While running the script to install the software involves a single user command, configuring everything is a bit more involved, but fairly straight forward. Here are the steps: - Open a terminal window (press the console icon on the upper task bar of the vm) - We will use the wget utility to download the installation script via the terminal command line interface by entering the command

sudo sh

Figure 4

gedit .bash_profile
export SCALA_HOME=/usr/local/share/scala


export PATH

Figure 5

source .bash_profile

Configuring Git and Building SparkR

The SparkR Project is hosted on GitHub. If you want to contribute to SparkR, you will need a GitHub account. In addition, the project has both a dedicated JIRA and a devloper’s mailing list on Google Groups. In this blog I only cover project infrastructure as it relates to GitHub. In what follows, it is assumed that you have a GitHub account.

git config --global "<handle or name>"
git config --global "<email address>"
git config --global credential.helper cache
git config --global credential.helper 'cache --timeout=3600'

Figure 6

git clone<your GitHub user ID>/SparkR-pkg
cd SparkR-pkg
SPARK_HADOOP_VERSION=2.5.0-mr1-cdh5.2.0 ./
sudo -E R CMD javareconf

Figure 7

comments powered by Disqus