Pages

Thursday, September 5, 2013

Type conversion and you (or and R)

Types and type conversion can be a tricky and intricate topic, and sometimes can lead to some real head-scratcher issues in R. Hence a somewhat confusing title.

This is for people still relatively new to R, and I will skip some gory details. Actually I will skip most of them, the canonical source for type and conversion information is the official R documentation, and the help pages for the functions at hand.

Instead I thought I would walk through some examples of when the type engine can behave in seemingly odd ways, and take a look at what is going on when mysterious errors arise and what can be done to track down their source.

What are types. 


Types describe the nature of data R is dealing with, at least as far as R cares. If you want to use the + operator, R needs to look at the data type on either side and work out if + is defined for those, and what type the result should be.

For example 1 + 1 will behave as you might expect, summing the two number and returning a numeric result, but what should R do if you add a character to a matrix? It uses types to answer these questions, and give an error message if no answer can be found.

Let’s get started, first we will create a vector with some numeric data

a <- c(1,2,3)
sum(a)
#[1] 6
a
#[1] 1 2 3

Nothing too exciting, we created a vector with three numeric values, and the builtin sum function behaves as we would expect. Let’s see what happens when we try to mix types in the vector:

a <- c(a, "hi")
a
#[1] "1"  "2"  "3"  "hi"

When we look at the contents of our vector, the numeric values have been converted to strings. This is because a vector is for homogenous storage, that is, it can only hold data of the same type. When we appended the string to it, R converted the numeric values to strings.

This is an example of implicit type conversion aka type coercion. R knows the vector can only hold data of the same type, and since it does not know how to turn ‘hi’ into a numeric value, it has turned the integers into strings. This has done magically (or implicitly) in the background.

What happens when we try to sum our vector?

sum(a)
#Error in sum(a) : invalid 'type' (character) of argument

We get an error, because R does not know how to add strings, at least in the conventional numerical sense of addition.

The general rule of thumb for conversion is “the bigger type wins.” Strings can represent more data than numeric values, so the numeric values are converted to strings. There aren’t that many data types in R, so often you will end up with strings when types are being mixed.

Type conversion is not particularly smart, which is generally a good idea in programming languages. If, instead of “hi”, we passed in the character representation of a numeric one “1”, it still would have converted the existing numeric values to strings, even though in theory the literal 1 could be converted to a numeric type. The type engine just sees strings and numerics, the biggest wins and the numerics are coerced.

Type conversion and apply


Sometimes when using apply and friends, you will get errors that at first glance must clearly be bugs in R. Let’s take a look at another example, this time using a data frame.

Unlike a vector or a matrix, a data frame is a heterogeneous type container; it is possible for it to store columns with different types. We will have five rows, a categorical factor column and two columns of numeric data.

n <- 5
#make some dummy data
df <- data.frame(cbind(rbinom(n, 1, 0.5), rnorm(n, 10, 5), rnorm(n, 20, 10)))
#make the first column a factor
df[,1] <- as.factor(df[,1])
head(df)
#  X1        X2       X3
#1  1  8.911567 27.28325
#2  1  9.933021 13.74879
#3  0 10.177231 20.65490
#4  0  6.368177 27.10183
#5  1 12.084135 14.54369

Now let’s try and sum the two numeric columns using apply

apply(df, 1, function(x) x[2] + x[3] )
#Error in x[2] + x[3] : non-numeric argument to binary operator

That’s weird. I’m pretty sure they are numeric, so addition should work

df[,2] + df[,3]
#[1] 36.19481 23.68181 30.83213 33.47001 26.62783

And it does. What gives?

We can see what datatype R thinks it is working with by using the mode() function.

apply(df, 1, function(x) mode(x))
#[1] "character" "character" "character" "character" "character"

Each row is being passed as character types, not the mixed data types we were expecting (i.e. factor, numeric, numeric).

In a nutshell, apply will first convert our data frame to a matrix before passing the rows to the defined function. Which it says in the help page, and I never took any notice of until things started getting weird.

The matrix container wants to store data all of the same type, and much like our initial vector example, our numeric values are being coerced into strings. Only once this coercion has taken place do the rows of the matrix get passed to the function we supplied apply with.

This is why we see our “non-numeric argument to binary operator” error message. What we thought was numeric data has been converted to character data, which we subsequently try to add.

Just to really drive this home, let’s look at the first row of our data frame:

a <- df[1,]
a
#  X1       X2       X3
#1  1 8.911567 27.28325

Now let’s see what happens when we convert it to a matrix as apply does:

b <- as.matrix(a)
b
#  X1  X2         X3      
#1 "1" "8.911567" "27.28325"
b[2] + b[3]
#Error in b[2] + b[3] : non-numeric argument to binary operator

We can see it has been converted to strings, and we end up with the same error message we saw when we used apply.

In this case the resolution is quite simple, we use the index operator in our call to apply

apply(df[,2:3], 1, function(x) x[1] + x[2])
#[1] 36.19481 23.68181 30.83213 33.47001 26.62783

Which works as we expect. Note in our function we changed the indexes to 1 and 2 as we are now only passing two columns of data to apply.

So, remember that in many cases apply will convert what you pass it to a matrix. The matrix wants all data to be of the same type, and will coerce as required. For the specific conditions see the help page for apply.

A second look


I’d like to take a look at another example. This time instead of a factor, we will have one column as a string type, and two numeric columns as before:

df <- data.frame(cbind(paste("subject", 1:n, sep=''), rnorm(n, 10, 5), rnorm(n, 20, 10)))
#        X1               X2               X3
#1 subject1 14.6619839711866 6.94472759446703
#2 subject2  11.603910222178 27.6225162121889
#3 subject3 5.21881004622993 20.3409476386206
#4 subject4 16.3574724782284 39.0904723579448
#5 subject5 9.35407053787977 23.8568796326835

We know that apply will coerce its input data to a matrix, so pass in only the numeric columns:

apply(df[,2:3], 1, function(x) x[1] + x[2])
#Error in x[1] + x[2] : non-numeric argument to binary operator

Uh oh!

Lets take a look at the data we are passing in

a <- df[,2:3]
mode(a[,1])
#[1] "numeric"
as.matrix(a)
#     X2                  X3              
#[1,] "-3.89274205212847" "12.7336046818466"
#[2,] "12.3494043977024"  "17.9329667214396"
#[3,] "4.7419241278816"   "16.0664073330786"
#[4,] "8.50784944656814"  "8.65139145569206"
#[5,] "9.56191506080518"  "21.2114650777001"

What. Why are you strings? mode says you're numeric!

a[,1]
#[1] -3.89274205212847 12.3494043977024  4.7419241278816   8.50784944656814  9.56191506080518
#Levels: -3.89274205212847 12.3494043977024 4.7419241278816 8.50784944656814 9.56191506080518

Why has our numeric data turned into a factor??

Generally at this point, I take a quiet moment to reflect on what I’m doing with my life and why I'm not living in a yurt on a mountain somewhere, I mean people do it and they seem happy enough? Could it really be that bad? But where would I charge my laptop? Oh well, back to the task at hand. I bet the coffee would be terrible as well.

What is going on here? We created a data frame with a column of strings and two numeric columns. Why have we ended up with factors?

Lets take a look at our string column

df[,1]
#[1] subject1 subject2 subject3 subject4 subject5
#Levels: subject1 subject2 subject3 subject4 subject5

The strings were converted to factors, checking the data.frame help page we see a stringsAsFactors option. A likely culprit, lets see how we go

df <- data.frame(cbind(paste("subject", 1:n, sep=''), rnorm(n, 10, 5), rnorm(n, 20, 10)), stringsAsFactors=FALSE)
as.matrix(df[,2:3])
#     X2                 X3              
#[1,] "7.19530271823023" "26.4186991862312"
#[2,] "13.6715492467442" "25.452128137706"
#[3,] "8.89363806613213" "20.1618970554355"
#[4,] "16.296512734304"  "16.2581582721134"
#[5,] "11.6454577442585" "17.5241594066948"

Why are they still strings?

In this case, the culprit is actually the cbind call, which is coercing our mixed strings and numerics into strings. When cbind finishes, the resulting matrix is passed to data.frame and our now string data gets converted to factors as due to stringsAsFactors being TRUE.

cbind(paste("subject", 1:n, sep=''), rnorm(n, 10, 5), rnorm(n, 20, 10))
#     [,1]       [,2]                [,3]            
#[1,] "subject1" "14.0342542696833"  "30.5672885598002"
#[2,] "subject2" "8.44141744459018"  "35.1337567509022"
#[3,] "subject3" "11.6550656524794"  "10.1554349193507"
#[4,] "subject4" "18.0303214118231"  "14.9638066872277"
#[5,] "subject5" "0.180686583194847" "11.7124424267387"

Dropping the cbind gives us what we are after

df <- data.frame(paste("subject", 1:n, sep=''), rnorm(n, 10, 5), rnorm(n, 20, 10))
as.matrix(df[,2:3])
#     rnorm.n..10..5. rnorm.n..20..10.
#[1,]       13.557665        33.519719
#[2,]       15.086483        41.457651
#[3,]        7.010492         1.757224
#[4,]       11.008779        29.707944
#[5,]       15.777351        10.280138
apply(df[,2:3], 1, function(x) x[1] + x[2])
#[1] 47.077384 56.544134  8.767716 40.716723 26.057489

Phew.

The cbind call is not necessary when creating a data frame, which is designed to take a variable amount of data. It's excessive use is a bad habit I picked up learning to navigate the waters of R, I must be sure to use it only when necessary.

Summary


As you have seen, type conversion and coercion happens quite frequently, and usually you may not even realize it has happened. At least until mysterious error messages start appearing.

When they do, using mode() is the best way to see what is going on. If you are after specific type information I am wary of the is.* family, they are useful, but tend to be fairly generous when you are after specifics.

Also be sure to check the help pages for the functions you are using, they do usually note when conversion takes place.

A good rule of thumb is vector and matrix are used for data of the same type, list and data frame are used for data of mixed types.

The reason mode prints “numeric” for factors is that internally, factors are numeric. They just come with a list of strings (the factor levels), which are printed out in lieu of the internal numeric representation. You might also notice that calling mode on a matrix tells you it is a list, at which point I make a dry coughing sound and busy myself with a yurt catalogue.

You can find technical details about types here http://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects

Finally, it is possible to force R to use a specific mode using the storage.mode() function. In this case values that cannot be represented in the given mode will be converted to NA

a <- c(1:3, "hi")
storage.mode(a)
#[1] "character"
storage.mode(a) <- 'integer'
#Warning message:
#In storage.mode(a) <- "integer" : NAs introduced by coercion
a
#[1]  1  2  3 NA

Code in one file is here.

7 comments:

  1. Have you tried to use lapply() instead of apply() on your examples?

    ReplyDelete
    Replies
    1. I haven't really as it does not seem to make intuitive sense to do so. I have data in a tabular format (e.g. matrix, data frame), which I want to traverse row-wise and perform operations on subsets of each row.

      It seems to me apply is the right choice, lapply would be a more linear traversal. I don't know though, how would you use lapply in this case?

      Delete
  2. In general, storage mode is a bad way to check objects. You want to check the attributes of the object instead. For example, structure(a[,1]) in example 2 ("A Second Look") tells you that the columns are factors. attr(a[,1], "class") tells you that the variable is a factor, and attributes(a[,1]) will tell you both the levels and the class.

    ReplyDelete
    Replies
    1. Hey thanks, I do agree, especially that there are more direct ways in the case of example 2. I was not aware of structure() either, i usually use str() but it can be a little verbose, so thanks for mentioning it. There always seems to be new things to discover in R.

      Delete
  3. Thanks for this wonderful post. I was stuck with the "non-numeric argument to binary operator" error for some time. The culprit were some string columns as you suggested. Now I can sleep in peace!

    ReplyDelete
  4. Thanks, this post was invaluable to me today. Particularly the bit about questioning life choices, because I was doing the same just minutes ago!

    ReplyDelete
    Replies
    1. Haha! Thanks for the comment, glad it helped!

      Delete