Shifting sands: March 2015

More and more frequently I come across people who express an interest in R, and I thought I would share some advice to help people decide if R is something they should use, as well as some high level advice on getting started.

Most of these people are developers with at least few years experience writing code for a living, and this advice is directed to people like that. Perhaps you have a specific project in mind, or perhaps you’ve just seen the increasing amount of hype around data science and are wondering if you should take a look.

R has some great strengths, but there are some definite quirks, and I hope I can provide some advice and perhaps avoid some pitfalls along the way.

I first picked up R around 2007 or 2008, and have grown to use it more or less daily over the last 3-4 years.

As a generalisation I find statisticians a generally friendly bunch of people, very open to discussions on all things and somewhat introspective. They are also generally very smart. Statistics can be quite an intricate field, especially as you get deeper into it.

Just as some background on me, I have worked as a developer for roughly 15 years now, and the R world makes a change from angry UNIX programmers and aggressive investment banking types that I spent the first 10 years or so working with.

I am sure some people will disagree with at least some of my views here. There may or may not be a technically correct or true answer, instead it is a subjective outlook, so keep in mind this is just one persons view.

A brief history of R, and why it has become so popular in certain circles

R is certainly becoming more and more popular, and seems to have found widespread adoption within many statistical research communities.

This is a great thing as it means as new statistical methods or practices come out of the research world, they are often implemented and available in R. In many cases they have been written by the person who “wrote the book” (or paper) on a given topic.

Why has it become so popular? I read a story somewhere about life in the academic research world.

In the past someone would do some cool research and publish their results, but they used proprietary statistical packages that might cost thousands of dollars.

This meant it was very hard for other researchers to replicate and carry further the work, and it is easy to see this would be a real pain.

People realized that by adopting R they could make their work freely available to others. This resolved an issue that was arguably a material roadblock to advances in the field.

R has really opened up a new world of statistical computing. It can do some very advanced things, and is all freely available.

Lucky us!

I never did any statistical work before R, and academic statistics is not really my world so I am a little hesitant to write on this topic. If you are interested in this subject, you might also like to read Why Use the R Language.

Should I use R ?

I made a small flowchart on draw.io to assist with this question

Where “statistics” can mean machine learning, predictive analytics, data science, anything that falls under a rather broad umbrella.

I view R as a statistical computing environment, versus a general purpose programming language like C++.

You probably don’t want to choose R if you are developing cloud based microservices, low latency/high availability trading systems, desktop applications etc.

I am not saying it is technically impossible to do these things with R, but for a variety of reasons, other languages are likely a better choice.

However, if you have a big blob of data you want to understand better, glean insight from, or use for predication, R is an excellent choice.

This is the second requirement before deciding R is the right choice, that is you actually have some data you wish to analyze or model.

I would actually take this a step further and say your data is in a tabular format, i.e. you can represent it with columns and rows. If not, you can of course still use R, but you will be doing a lot of legwork that may be overwhelming.

R is designed for statistics, and a big part of applied statistics involves analyzing observations. Usually this manifests as a table like data structure such as a matrix or data frame in R.

Much of R is built around the assumption you are working with a table-like data structure.

This is different from conventional software development where you think about the task at hand, design a bunch of data structures/objects then define relations and operations between them to achieve the desired outcome.

Now again, it’s not that you can’t do this in R, and I will talk about OO a little later, but you might be better off doing heavy lifting in some other language and saving R for just the statistics parts.

It's just not really the sort of language you pick up and write a twitter clone/text editor/mp3 player or whatever your favourite "get to know a language" project might be.

If you can agree with this sentence: “I have some data that makes sense to represent in a tabular like structure, and I want to do some cool statistics stuff with it” R is definitely a good choice.

From here on in I will just provide some general advice as bullet points. I could write at length about all of them, and am happy to expand on any of them, but I will try to keep it brief.

General Points

Get some books. The two I would recommend are the R Cookbook by Paul Teetor and Advanced R by Hadley Wickham. The latter in particular is the type of documentation people with experience programming will value. If you try to get yourself up to speed with just the built in documentation you are making things a lot harder for yourself.

The principle of least surprise does not apply. The type system and scoping rules takes some getting used to, and the way R works might be quite different from what you are used to.
Don’t assume anything and be sure to check details. If you do any serious work with R you will likely run into weird type or scope related errors at some point. Hadley’s Advanced R book is very valuable here.

R will not teach you statistics. For example if you don’t know what variance is or what homoscedasticity is and why it matters (like me when I was starting out), you will find limited value in R or any other statistics/machine learning software.
I had the luxury of being able to go off and do an MSc in Math/Stats, which I whole-heartedly recommend if possible. Nowadays there are some great books and MOOCs. There’s no quick or easy answer to this one unfortunately. You can learn statistics with R, but you can’t expect R to fill in the blanks for you.

Sometimes R tries to be helpful. Statisticians probably appreciate this but programmers probably wont. For example it will automatically convert strings to the factor data type by default when reading in data. It may also do implicit type conversion where other languages would generate an error. Understand the basic types and how to check your assumptions wrt type. I wrote about this a bit here.

Defensive programming is not common. Many packages do not check data is in the format or type it expects. This can lead to weird and/or inscrutable error messages.
There are some built in debugging tools that can help, you might need to dive into the source to find out what’s going on now and then. I describe some here, and there is some official documentation here.

Think in terms of tables, vectors and lists. At a high level you are doing statistics on data, and this is generally what R is expecting. I found working with R got a bit easier once I made this cognitive shift.

Learn to use apply and friends. I wrote an intro to this here. It is a more “functional” way of working. You have your data as a table or whatever, and you apply functions to the rows and columns.

People use ‘.’ in variable names. This is annoying but the best advice is to just get over it. I believe this is becoming less prevalent, but you will likely still come across it.

The OO system is a bit of a mess. There are three different types of OO in R. Yes really. The whole way it works is a bit smoke and mirrors and basically I am not really a fan. Reference classes are probably the closest conceptually to OO in other languages.
I personally would not develop a big OO system in R, however other people certainly have and been successful with it. Again, Hadley’s book is a great resource.

Read R Bloggers. This is a great resource where people write and share code about all sorts of cool stuff. Search for packages names or technologies (like “hadoop” or “ec2”) to get going.

Packages

The package system is really good. There are all sorts of packages that do all sorts of things. In general, the distribution and update mechanisms all work pretty well.
Packaging is a hard and thankless task, the big linux distros might have hundreds of people solely dedicated to packaging and distribution, and R does a very good job of it with relatively few people. The task views provide an overview of packages available for various types of tasks (time series, machine learning).

Often the answer to “how do I do x” is “install package y to do it for you.” Sometimes the way base R works can seem a bit convoluted or difficult. Usually, someone has written a package that makes it a lot easier. Just install the package, use it and move on.

You will end up using a lot of packages. This used to really bother me. In the enterprise world, using third party packages usually required a lengthy approval process as lawyers checked the licensing and potential IP conflicts, as well as a separate process to get things installed and deployed to production by the admin teams.
If you are working in such an environment, be aware you are likely going to end up using a bunch of packages. Thankfully the packages mostly use standard free licenses, which may ease the process somewhat.

Another consideration is that this can make dependency management a bit involved. Typically though, most packages are quite small and do only a few specific things.

Package code quality can be variable. Many of these are developed by academics/academic statisticians. As a rule these are smart people but their code might not live up to your internal standards for software engineering practice. Many of the packages are excellent, and most that you are likely to use are very stable.
This can become more of an issue as you get into the more obscure areas of statistics, at which point you might want to look over the package source before committing to it.

I do not want to come across as dismissive here as there are people working on R who have been doing statistics on computers since before I was born, and I'm not that young.

They do generally know what is what and in some ways R is constrained as it is broadly an implementation of the S language, created in 1976. However if you find yourself struggling to use a package that has three functions and 17 phd students listed as authors maybe it's not you, its them.

Embrace the “Hadleyverse.” Hadley Wickham has written a bunch of great packages that are very useful when working with R day to day. Someone else wrote more about the Hadleyverse here.

If you are doing machine learning, use caret. In a nutshell it provides an abstraction over the various machine learning algorithms, and a whole bunch of useful stuff for model building, tuning and evaluation. It has good docs and there is also a great book Applied Predictive Modelling in R you should check out if you are doing ML in R, I wrote a small review here

Rcpp lets you easily use snippets of C++ in R. It’s really cool. I wrote a cheatsheet for common linear algebra related R operations here, and there are more detailed/complete/practical examples available in the Rcpp Gallery.
Some people say R is slow, and there is a small element of truth there relatively speaking, but in general I feel if someone complains that a language is slow they should probably write better programs and/or buy a faster computer. Rcpp can help though. There is a good book by the package author available as well.

Outro

R is a niche language, but if your work falls within that niche it is a wonderful tool. There are some great packages, cutting edge methods, and a generally enthusiastic and welcoming community.

You can also find many of these things in other languages like Python and Java, but I still find myself turning to R when I have a bunch of data I want to explore and just try out a bunch of different things.

Some other people have written on this topic, if you would like to read more see R Language for Programmers, R The Good Parts & Why R is hard to learn.

[Update] A previous version of this post said Hadley Wickham and Dirk Eddelbuettel were R Core Developers, which is not the case. They are members of the R Foundation.

As always, you can find me on twitter here.

Shifting sands

Pages

Tuesday, March 24, 2015

Simulation and relative performance

Saturday, March 14, 2015

Adopting R for experienced developers

A brief history of R, and why it has become so popular in certain circles

Should I use R ?

General Points

Packages

Outro