Tuesday, March 24, 2015
Saturday, March 14, 2015
I am sure some people will disagree with at least some of my views here. There may or may not be a technically correct or true answer, instead it is a subjective outlook, so keep in mind this is just one persons view.
A brief history of R, and why it has become so popular in certain circles
Should I use R ?
I made a small flowchart on draw.io to assist with this question
- Get some books. The two I would recommend are the R Cookbook by Paul Teetor and Advanced R by Hadley Wickham. The latter in particular is the type of documentation people with experience programming will value. If you try to get yourself up to speed with just the built in documentation you are making things a lot harder for yourself.
- The principle of least surprise does not
apply. The type system and scoping rules takes some getting used to, and the
way R works might be quite different from what you are used to.
Don’t assume anything and be sure to check details. If you do any serious work with R you will likely run into weird type or scope related errors at some point. Hadley’s Advanced R book is very valuable here.
- R will not teach you statistics. For
example if you don’t know what variance is or what homoscedasticity is and why
it matters (like me when I was starting out), you will find limited value in R
or any other statistics/machine learning software.
I had the luxury of being able to go off and do an MSc in Math/Stats, which I whole-heartedly recommend if possible. Nowadays there are some great books and MOOCs. There’s no quick or easy answer to this one unfortunately. You can learn statistics with R, but you can’t expect R to fill in the blanks for you.
- Sometimes R tries to be helpful. Statisticians probably appreciate this but programmers probably wont. For example it will automatically convert strings to the factor data type by default when reading in data. It may also do implicit type conversion where other languages would generate an error. Understand the basic types and how to check your assumptions wrt type. I wrote about this a bit here.
- Defensive programming is not common. Many
packages do not check data is in the format or type it expects. This can lead
to weird and/or inscrutable error messages.
There are some built in debugging tools that can help, you might need to dive into the source to find out what’s going on now and then. I describe some here, and there is some official documentation here.
- Think in terms of tables, vectors and lists. At a high level you are doing statistics on data, and this is generally what R is expecting. I found working with R got a bit easier once I made this cognitive shift.
- Learn to use apply and friends. I wrote an intro to this here. It is a more “functional” way of working. You have your data as a table or whatever, and you apply functions to the rows and columns.
- People use ‘.’ in variable names. This is annoying but the best advice is to just get over it. I believe this is becoming less prevalent, but you will likely still come across it.
- The OO system is a bit of a mess. There are three different types of OO in R. Yes really. The whole way it works is a bit smoke and mirrors and basically I am not really a fan. Reference classes are probably the closest conceptually to OO in other languages.
- Read R Bloggers. This is a great resource where people write and share code about all sorts of cool stuff. Search for packages names or technologies (like “hadoop” or “ec2”) to get going.
- The package system is really good. There
are all sorts of packages that do all sorts of things. In general, the
distribution and update mechanisms all work pretty well.
Packaging is a hard and thankless task, the big linux distros might have hundreds of people solely dedicated to packaging and distribution, and R does a very good job of it with relatively few people. The task views provide an overview of packages available for various types of tasks (time series, machine learning).
- Often the answer to “how do I do x” is “install package y to do it for you.” Sometimes the way base R works can seem a bit convoluted or difficult. Usually, someone has written a package that makes it a lot easier. Just install the package, use it and move on.
- You will end up using a lot of packages.
This used to really bother me. In the enterprise world, using third party
packages usually required a lengthy approval process as lawyers checked the
licensing and potential IP conflicts, as well as a separate process to get
things installed and deployed to production by the admin teams.
If you are working in such an environment, be aware you are likely going to end up using a bunch of packages. Thankfully the packages mostly use standard free licenses, which may ease the process somewhat.
Another consideration is that this can make dependency management a bit involved. Typically though, most packages are quite small and do only a few specific things.
- Package code quality can be variable. Many
of these are developed by academics/academic statisticians. As a rule these are smart people but their code might not live up to your
internal standards for software engineering practice. Many of the packages are
excellent, and most that you are likely to use are very stable.
This can become more of an issue as you get into the more obscure areas of statistics, at which point you might want to look over the package source before committing to it.
I do not want to come across as dismissive here as there are people working on R who have been doing statistics on computers since before I was born, and I'm not that young.
They do generally know what is what and in some ways R is constrained as it is broadly an implementation of the S language, created in 1976. However if you find yourself struggling to use a package that has three functions and 17 phd students listed as authors maybe it's not you, its them.
- Embrace the “Hadleyverse.” Hadley Wickham has written a bunch of great packages that are very useful when working with R day to day. Someone else wrote more about the Hadleyverse here.
- If you are doing machine learning, use caret. In a nutshell it provides an abstraction over the various machine learning algorithms, and a whole bunch of useful stuff for model building, tuning and evaluation. It has good docs and there is also a great book Applied Predictive Modelling in R you should check out if you are doing ML in R, I wrote a small review here
- Rcpp lets you easily use snippets of C++ in
R. It’s really cool. I wrote a cheatsheet for common linear algebra related R
operations here, and there are more detailed/complete/practical examples available in the Rcpp Gallery.
Some people say R is slow, and there is a small element of truth there relatively speaking, but in general I feel if someone complains that a language is slow they should probably write better programs and/or buy a faster computer. Rcpp can help though. There is a good book by the package author available as well.