Viva la Revolution Analytics

Dan Putler — Sun, 25 Jan 2015 15:05:21 PST

Last Friday was a very busy day for several of us at Alteryx in the wake of the announcement that Microsoft and Revolution Analytics had agreed to have Microsoft acquire Revolution Analytics. In this post I won’t go into the Alteryx angle of this story, other than to say we think this is a net positive. Instead, I wanted to provide a few words of appreciation for what Revolution Analytics has done for both R based technology and for their non-technology contributions to the R community since its creation (as REvolution Computing) in 2007.

Contributions to R Based Technology

Revolution Analytics has long been at the forefront of efforts to scale R for applications involving large amounts of data. They have approached this problem using both coarse grained parallel and streaming computing approaches. As of now, considerably more effort is going into coarse grained parallel computing approaches (with Hadoop being the most well publicized of these efforts), but streaming computing approaches can be very effective in scaling predictive analytics with more limited hardware resources. The most impressive methods for doing this that I have seen are the streaming linear model and generalized linear model methods contained in Revolution Analytics (proprietary) Revo ScaleR package (which also makes use of Intel’s multi-threaded linear algebra libraries), that comes with their Revolution R Enterprise product. We have found that with moderate data volumes they are faster than the comparable open source R functions, and they can easily scale to millions of records on a common business laptop configuration (e.g., 8 GB of memory and a modern multicore CPU), while that same configuration is capable of estimating the same type of model with at most between 100,000 to 200,000 records with fewer than 10 predictors using open source R’s lm or glm functions. What Lee Edlefsen and the engineering team at Revolution Analytics has done in this area represents the state of the art (they clearly outshine comparable methods from SAS and IBM/SPSS), and will likely represent an important point of comparison for others developing streaming algorithms for a long time to come.

While they have kept their streaming methods proprietary, they have given back to the R community much of the technology they have developed in the area of coarse grained parallel computing methods in R. Chief among these are the foreach and the iterators packages. In academia, one thing professors are judged on in tenure and promotion decisions is how many other published articles cite their articles (there are a number of different broad discipline oriented citation indexes, such as the Social Science Citation Index, that provide this information, and a lot of attention is now being paid to Google Scholar citations, which are often more interdisciplinary in nature). The R Project originated in academia, and still has a very academic feel to it. As a result, the package archive for the project (CRAN, or the Comprehensive R Archive Network) provides something very similar to a citation index. Specifically, for every CRAN package there is an indication of how other CRAN packages make use of it. There are three levels of this: a “reverse dependency” (the package is absolutely necessity to install another package); a “reverse imports” (the package is a critical component of another package, but the package can be installed without it); and a “reverse suggests” (a package provides additional, less central, functionality to another package). The Revolution Analytics foreach package has (as I write this) a reverse status on the part of 111 other R packages, while the iterators package has a reverse status on the part of 34 other CRAN packages. Only three of these represent “vanity reverses” (i.e., a package that makes use of a package written by the same author), and the vast majority are either of the more important “depends” or “imports” variety. In both cases, this represents an extraordinarily high number of reverse status packages (the reverse status for the foreach package is extremely high). Put another way, if there was a University of R, Revolution Analytics would have the rank of Full Professor.

More recently, Revolution Analytics has made an effort to address issues that come up with R packages. The first of these efforts is incorporated in the miniCRAN package that allows an organization with strict firewall rules to create an internal, selective archive of R packages that members of that organization can access. The second package that addresses issues surrounding R packages is the checkpoint package which is closely linked with Revolution Analytics “Managed R Archive Network”, or MRAN. The purpose of the combination of the checkpoint package and MRAN is to address a common problem in reproducing R based research results, addressing changes in contributed R packages. R consists of three components, a small set of “base” packages that provide basic functionality; a still very small, but somewhat larger, set of “recommended” packages that provide additional core R functionality, and then a huge set (nearly 5000 as of this writing) of contributed packages. R’s base and recommended packages are shipped with R’s installer package from CRAN, and are very stable. The same cannot be said of all of R’s contributed packages. We at Alteryx have never had issues migrating to the base or recommended packages of a new version of R, but we have experienced a few hiccups in migrating to some new versions of contributed packages that we use and bundle with our Predictive Plug-in (yes, regression testing is a useful thing). It turns out we are not alone, and in some cases (particularly in clinical trial settings for new drugs or medical devices) can make research results difficult to reproduce. The problems can be due to changes in the API of a package (that can cause R analysis scripts to break) or changes in the underlying methods used by a package (which can change the nature of the results in marginally significant cases). The goal of the checkpoint package / MRAN combination is to allow researchers to “freeze” on a particular vintage of R packages in order to make sure past research results can be replicated in a setting that takes changes in underlying R packages out of the picture. I view this as a very selfless move on Revolution Analytics part since it is a technology that is likely to be extremely useful to portions of the R community, takes real resources on the part of Revolution Analytics to implement, but is one that seems difficult for them to monetize.

Non-Technology Contributions to the R Community

Revolution Analytics has consistently given back to the R Community on a non-technological basis in three ways. First, it has been a primary sponsor of the annual international R user group conference (UseR!) since 2008 (longer than any other software vendor, only the book publishers CRC Press and Springer having been sponsors of the conference more years than Revolution Analytics). Since Revolution Analytics was only founded in 2007, the length of time they have been a primary sponsor of UseR! is remarkable.

The second way Revolution Analytics has given back to the R community in a non-technical way is in help sponsor local R user groups, through there R User Group Sponsorship Program. I am a member of the Bay Area R Users Group which Revolution Analytics sponsors, and Joe Rickert of Revolution Analytics acts as the primary organizer. Revolution Analytics provided financial support to 51 local R user groups in 2014, all local R user groups are eligible for sponsorship, but not all apply). In addition, it supports all 150 local R user groups via the Local R User Group Directory, the R Community Calendar, and the @inside_r Twitter channel.

The third way they contribute back to the community is the Revolutions blog which is one of the longest on-going blogs that covers topics relevant to the R community. Most company blogs are done for specific, very narrow marketing or product education purposes. However, this is not the case with the Revolutions blog, which strives to cover all topics relevant to the R community, even new R based technologies that represent, at least to my mind, a potential competitive threat to them.

Going Forward

What exactly the longer-term future holds for Revolution Analytics as they become part of the Microsoft family is unknown at this point. However, my belief is the assessment of David Smith (Revolution Analytics Chief Community Officer) that

For our users and customers, nothing much will change with the acquisition. We’ll continue to support and develop the Revolution R family of products — including non-Windows platforms like Mac and Linux. The free Revolution R Open project will continue to enhance open source R. We’ll continue to offer expert technical support for R with Revolution R Plus subscriptions from the same team of R experts. We’ll continue to advance the big data and enterprise integration capabilities of Revolution R Enterprise. And we’ll continue to offer expert technical training and consulting services.

is correct. Moreover, the financial backing of Microsoft will likely provide a strong tail wind to help several of the initiatives that Revolution Analytics started move forward more rapidly.

As part of Alteryx’s partnership with them, I’ve had the opportunity and pleasure to interact with many people at Revolution Analytics, and I wish them the best of luck in the next part of their journey.

Setting Up a Virtual Machine with SparkR

Dan Putler — Mon, 08 Dec 2014 21:09:05 -0700

It is fairly safe to say that at this point that Spark is the heir apparent standard for advanced analytics applications on big data. As things stand, it is a relatively friendly environment for data scientists and analysts whose preferred languages are Scala, Python, or Java. In contrast, support for the R language, arguably the most commonly used open source language for advanced analytics, is in its comparative infancy in the Spark environment. However, active development, aimed at making the R language a “first class citizen” in Spark, is under way as part of the SparkR Project.

Setting up a system to do development as part of contributing to the SparkR project, or to easily experiment with SparkR, is a task that involves a bit of effort (there are fair number of prerequisites for building SparkR). Rather than creating a SparkR ready environment directly on your own computer, an easier approach is to start with a virtual machine that has many of the prerequisites already installed, and then customize it by installing the remaining software. In this post, I provide a recipe for doing so, using Cloudera’s CDH 5.2 Quick Start VM for Oracle’s VirtualBox virtualization software package. The Cloudera Quick Start VM comes with much of the needed software installed (Spark with SparkSQL, Hadoop, Maven, Git, and Java), it also comes with R, but not the most recent release R as of early December 2014 (the VM was released shortly before the release of R version 3.1.2, so it comes with R 3.1.1). I’ve written a shell script to automate most of the task of installing the most recent version of R, which also installs R’s rJava package(currently needed for SparkR) and the rServe package (needed for Alteryx’s SparkR development team’s work). The script also installs Scala (version 2.10.4) for option value.

In what follows, I first describe how to: get setup the VirtualBox Cloudera Quick Start VM, install and configuring the remaining need software, and configure Git and install SparkR. Hopefully this post will enable you to get up and running with SparkR quickly.

Getting Started with the Cloudera Quick Start VM

The first thing that needs to be done is to prepare the VM. If you are familiar with working with VirtualBox VM’s, you can skip over much of this section. The following steps provide a recipe for getting the VM ready for the installation of additional needed software:

Download the Quick Start VM from Cloudera. The VM uses Centos 6.2, and appears to have been created with VirtualBox 4.3.10, which is an older version of VirtualBox (this version was released back in March, while the current version of VirtualBox is 4.3.20), and this difference in versions has some implications for preparing the VM. The password for the VM is “cloudera” and can be used for superuser authentication as well.
The Quick Start VM comes as a 7-zip archive, and you will need to unzip the archive to proceed. To do this, you will need to have 7-zip available. 7-zip is open source, and versions can be downloaded for all major operating systems. The archive consists of a folder (cloudera-quickstart-vm-5.2.0-0-virtualbox) that contains two files.
If you do not already have VirtualBox, and your operating system is not Windows 7 download and install it. There are binary installers/packages for Windows, OS X, Solaris, and all major Linux distributions available. If you are on Windows 7, the most recent version of VirtualBox may well work (the most recent version does on my Win 7 machine), but we (and many other VirtualBox users) have encountered Win 7 machines where recent versions of VirtualBox cannot load a VM. The VirtualBox site does maintain an archive of older VirtualBox releases. Locally we have found that prior versions as recent as the 4.3.12 release do not exhibit the problem.
Load the Cloudera VM into VirtualBox by using the VirtualBox pull-down menu option File > Import Appliance… and then navigate to the cloudera-quickstart-vm-5.2.0-0-virtualbox folder, where you will be able to select the file cloudera-quickstart-vm-5.2.0-0-virtualbox.ovf (the only visible file) and begin importing the VM.
Once the VM has been imported, highlight the “cloudera-quickstart-vm-5.2.0-0-virtualbox VM and press the green start button to launch it. This initial launch of the VM is done just to make sure all is well with the initial setup. Assuming you are on Win 7, and installed VirtualBox 4.3.20, and see the error message in Figure 1, you will need to install an older version of VirtualBox.

Figure 1

When the VM has loaded you will want to sync the VM’s “Guest Additions” to the version of VirtualBox you are using. To do this, use VirtualBox’s pull-down menu option Devices > Insert Guest Additions CD image…, which is illustrated in Figure 2. What you are likely thinking at the moment is “I don’t have a Guest Additions CD!”, don’t worry, you have a virtual copy of it. When you select this option a dialog box will pop-up indicating “That you have just inserted a medium…”, press the OK button, without changing any of the options, which will result in another dialog box appearing. Press the OK button in the second dialog box, which will bring up a third (and final) dialog box, into which you want to enter “cloudera” as the root password and press the Authenticate button. A terminal window will appear to start the update process, when the process is complete, click on the terminal window to make it active and press the Enter key on your keyboard to close the window.

Figure 2

One important benefit to sync the VM’s Guest Additions to the version of VirtualBox you are using is that you can take advantage of a shared, bidirectional clipboard that will enable you to copy and paste between your machine and the VM. One immediate benefit is you will be able to copy commands from these instructions and paste them into a terminal window on the VM. To enable the bidirectional clipboard, use VirtualBox’s pull-down menu option Devices > Shared Clipboard > Bidirectional. You can also enable bidirectional drag and drop via the Drag’n’Drop options of the Devices menu.
At this point you will want to shutdown the VM using VirtualBox’s pull-down menu option Machine > ACPI Shutdown (illustrated in Figure 3).

Figure 3

After the VM has shutdown, you need to alter the VM in VirtualBox, both to give it the resources that will likely be needed for SparkR development, and to enable the VM to directly connect to the Internet. One potential issue here is we are increasing the VM’s RAM from 4 GB to 6 GB. As a result, your computer should have at least 8 GB of memory, and preferably more, since VirtualBox will immediately grab 6 GB of memory from your computer when the VM is launched. The steps needed to prepare the VM are:
- Click on the cloudera-quickstart-vm-5.2.0-0-virtualbox icon to highlight the VM.
- Under System > Motherboard increase the base memory to 6144 MB.
- Under System > Processor increase the processor(s) from 1 to 2 CPUs.
- If you will want to access this VM from other machines within you local area network, then under Network > Adapter 1 change the “Attached Adapter” to “Bridged Adapter”.
Start the VM again.

Installing and Configuring the Additional Software

As indicated in the introduction, the installation of R and Scala is a fairly automated process done through a shell script. The script itself is fairly well commented, and you probably want to take a look at it prior to running it in order to get a sense of what it is doing. While running the script to install the software involves a single user command, configuring everything is a bit more involved, but fairly straight forward. Here are the steps: - Open a terminal window (press the console icon on the upper task bar of the vm) - We will use the wget utility to download the installation script via the terminal command line interface by entering the command

wget http://adventures.putler.org/SparkR_prep-0.1.sh

Once the script has downloaded, issue the command below at the terminal prompt to start the installation process.

sudo sh SparkR_prep-0.1.sh

You will be prompted once during the installation process to OK the installation of the dependencies required to build R from source. Figure 2 shows this prompt, at the prompt type the letter “y” followed by enter.

Figure 4

After the script is run, Scala needs to be permanently added to the user cloudera’s path. To do this, cloudera’s .bash_profile file. The instructions use gedit to do this, but all the common Unix/Linux text editors (VI, nano, EMACS) are installed on the VM. To start editing the file, use the following command

gedit .bash_profile

The lines that need to be added/altered are after the “User specific environment and startup programs” comment, and are given below. When you are done editing, the .bash_profile file should look like the one shown in Figure 3. Once it does, save the file and close gedit.

export SCALA_HOME=/usr/local/share/scala

PATH=$PATH:$HOME/bin:$SCALA_HOME/bin

export PATH

Figure 5

The last thing that needs to be done with respect to software installation and configuration is to have the environment variable changes recognized by the operating system. To do this, enter the command below at the terminal command line

source .bash_profile

Configuring Git and Building SparkR

The SparkR Project is hosted on GitHub. If you want to contribute to SparkR, you will need a GitHub account. In addition, the project has both a dedicated JIRA and a devloper’s mailing list on Google Groups. In this blog I only cover project infrastructure as it relates to GitHub. In what follows, it is assumed that you have a GitHub account.

We start by configuring Git on the Quick Start VM (the VM comes with the Git client pre-installed) by entering the commands below in a terminal window. You will want to replace <handle or name> with your own handle (mine is dputler) or name, and <email address> with your own email address. The handle/name and email address values need to be in quotes. The last two commands increase the amount of time you can interact with a GitHub repository before you need to re-authenticate.

git config --global user.name "<handle or name>"
git config --global user.email "<email address>"
git config --global credential.helper cache
git config --global credential.helper 'cache --timeout=3600'

Instead of developing against the main repository, the SparkR project uses what is known as a “fork & pull” collaboration model whereby a developer makes changes to a personal fork of the project, and then makes a pull request to have those changes integrated into the main repository. As a result, we are going to make a fork of the SparkR project. To do this, go to the SparkR GitHub repository at https://github.com/amplab-extras/SparkR-pkg and click on the Fork button, which is shown (boxed in red) in Figure 4. Pressing this button will result in a fork of SparkR being linked to your GitHub account. The URL for you personal copy of SparkR should be https://github.com/<your GitHub user ID>/SparkR-pkg, where <your GitHub user ID> is your actual GitHub user ID.

Figure 6

The next step of the process is to “clone” (create a local copy on the VM) and build SparkR from your forked version of the repository. The terminal commands below will allow you to do this (you will need to replace <your GitHub user ID> with your actual ID).

git clone https://github.com/<your GitHub user ID>/SparkR-pkg
cd SparkR-pkg
SPARK_HADOOP_VERSION=2.5.0-mr1-cdh5.2.0 ./install-dev.sh

We need to properly connect Java and R together at this point, which is accomplished with the command

sudo -E R CMD javareconf

At this point you hopefully have a successful build of SparkR, which you can begin to explore by bringing up the SparkR console (shown in Figure 7) via the command:

./sparkR

Figure 7

Once you are done exploring SparkR, you can again use the VirtualBox pull-down menu option Machine > ACPI Shutdown to shutdown your SparkR development VM.

About

Dan Putler — Mon, 08 Dec 2014 02:29:12 -0700

The Blog

The Adventures in Advanced Analytics blog is focused on topics in predictive and spatial analytics technologies that impact business organizations. A particular area of focus is on R, Spark, and emerging storage technologies such as Hadoop and Cassandra, and how they can be enhanced via the Alteryx platform.

The blog is created using the Hugo static site generator, and uses Andrei Mihu’s Hyde-X theme. The site logo was created by Tara McCoy Giovenco (many thanks Tara), who maintains the copyright, and is used with her permission.

All opinions expressed in this blog are those of the author and do not necessarily reflect those of Alteryx, Inc., its employees, or its partners.

The Author

Dan Putler is Alteryx’s Chief Scientist, and leads product strategy and development for Alteryx’s R-based predictive analytics offering. He is also the co-author (with Robert Krider) of the book Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R, and has 30 years of experience with predictive analytics, conducting projects across a wide range of industry verticals. Prior to joining Alteryx, Dan spent 20 years as a professor of marketing and marketing research at both the Sauder School of Business at the University of British Columbia and the Krannert School of Management at Purdue University.

Recent Content on Adventures in advanced analytics