Pages

Monday, September 23, 2013

Building models over rolling time periods

Often I have some idea for a trading system that is of the form “does some particular aspect of the last n periods of data have any predictive use for subsequent periods.”

I generally like to work with nice units of time, such as 4 weeks or 6 months, rather than 30 or 126 days. It probably doesn’t make a meaningful difference in most cases, but it’s nice to have the option.

At first this seemed like something rollapply() from the zoo package could help with, but there are a number of preconditions that need to be met and frankly I find them to be a bit of a pain.

In a nutshell I have not been able to find a nice way for it to apply a function to a rolling subset of time series data nicely aligned to weekly or monthly boundaries.

All is not lost, there is a neat function in xts called endpoints(), which takes a time series and a period (e.g. “weeks”, “months”) and returns the indexes into that time series for the corresponding periods.

Using this information it becomes easy to subset the time set data using normal row subset operations.

The xts package also has period.apply but it runs on non-overlapping intervals, which is close but still not quite what I want.

In the script for this post there are 4 or so functions of note.

The main one is roll_model, which takes a set of data to be subsetted and passed to the model, the size of per model training and test sets and the period size to split things up, which is anything valid for use with endpoints().

A utility function is train_test_split which also uses endpoints() to split a subset of data into 2 sets, one for training the model, one for testing. In practice it needs to be the same period type as you expect to use with roll_model.

The function that actually builds the model and returns some results is run_model(), which calls train_test_split to get the training and test set, builds a model using ksvm in this example, and sees how it goes based on the test set.

Another utility function is called before that, data_prep which builds the main data object to be passed to roll_model. In this example it takes a set of log closes to close returns, sets Y to be the return at time t, X1 the log return at t-1, X2 at t-2 and so on.


The example model is not a particularly useful way of looking at things, which is not surprising given close to close returns are effectively random noise. But perhaps the script itself is useful for other ideas, and if anyone knows better/easier/alternate ways of doing the same thing I would love to hear about them.

The script is available here.

Sunday, September 8, 2013

Snapshot of the global zeitgeist



John le Carré, author of Tinker Tailor Solider Spy (among many others) and Miley Cyrus give a timely reminder that Australia is a small fish in a big pond.

Thursday, September 5, 2013

Type conversion and you (or and R)

Types and type conversion can be a tricky and intricate topic, and sometimes can lead to some real head-scratcher issues in R. Hence a somewhat confusing title.

This is for people still relatively new to R, and I will skip some gory details. Actually I will skip most of them, the canonical source for type and conversion information is the official R documentation, and the help pages for the functions at hand.

Instead I thought I would walk through some examples of when the type engine can behave in seemingly odd ways, and take a look at what is going on when mysterious errors arise and what can be done to track down their source.

What are types. 


Types describe the nature of data R is dealing with, at least as far as R cares. If you want to use the + operator, R needs to look at the data type on either side and work out if + is defined for those, and what type the result should be.

For example 1 + 1 will behave as you might expect, summing the two number and returning a numeric result, but what should R do if you add a character to a matrix? It uses types to answer these questions, and give an error message if no answer can be found.

Let’s get started, first we will create a vector with some numeric data

a <- c(1,2,3)
sum(a)
#[1] 6
a
#[1] 1 2 3

Nothing too exciting, we created a vector with three numeric values, and the builtin sum function behaves as we would expect. Let’s see what happens when we try to mix types in the vector:

a <- c(a, "hi")
a
#[1] "1"  "2"  "3"  "hi"

When we look at the contents of our vector, the numeric values have been converted to strings. This is because a vector is for homogenous storage, that is, it can only hold data of the same type. When we appended the string to it, R converted the numeric values to strings.

This is an example of implicit type conversion aka type coercion. R knows the vector can only hold data of the same type, and since it does not know how to turn ‘hi’ into a numeric value, it has turned the integers into strings. This has done magically (or implicitly) in the background.

What happens when we try to sum our vector?

sum(a)
#Error in sum(a) : invalid 'type' (character) of argument

We get an error, because R does not know how to add strings, at least in the conventional numerical sense of addition.

The general rule of thumb for conversion is “the bigger type wins.” Strings can represent more data than numeric values, so the numeric values are converted to strings. There aren’t that many data types in R, so often you will end up with strings when types are being mixed.

Type conversion is not particularly smart, which is generally a good idea in programming languages. If, instead of “hi”, we passed in the character representation of a numeric one “1”, it still would have converted the existing numeric values to strings, even though in theory the literal 1 could be converted to a numeric type. The type engine just sees strings and numerics, the biggest wins and the numerics are coerced.

Type conversion and apply


Sometimes when using apply and friends, you will get errors that at first glance must clearly be bugs in R. Let’s take a look at another example, this time using a data frame.

Unlike a vector or a matrix, a data frame is a heterogeneous type container; it is possible for it to store columns with different types. We will have five rows, a categorical factor column and two columns of numeric data.

n <- 5
#make some dummy data
df <- data.frame(cbind(rbinom(n, 1, 0.5), rnorm(n, 10, 5), rnorm(n, 20, 10)))
#make the first column a factor
df[,1] <- as.factor(df[,1])
head(df)
#  X1        X2       X3
#1  1  8.911567 27.28325
#2  1  9.933021 13.74879
#3  0 10.177231 20.65490
#4  0  6.368177 27.10183
#5  1 12.084135 14.54369

Now let’s try and sum the two numeric columns using apply

apply(df, 1, function(x) x[2] + x[3] )
#Error in x[2] + x[3] : non-numeric argument to binary operator

That’s weird. I’m pretty sure they are numeric, so addition should work

df[,2] + df[,3]
#[1] 36.19481 23.68181 30.83213 33.47001 26.62783

And it does. What gives?

We can see what datatype R thinks it is working with by using the mode() function.

apply(df, 1, function(x) mode(x))
#[1] "character" "character" "character" "character" "character"

Each row is being passed as character types, not the mixed data types we were expecting (i.e. factor, numeric, numeric).

In a nutshell, apply will first convert our data frame to a matrix before passing the rows to the defined function. Which it says in the help page, and I never took any notice of until things started getting weird.

The matrix container wants to store data all of the same type, and much like our initial vector example, our numeric values are being coerced into strings. Only once this coercion has taken place do the rows of the matrix get passed to the function we supplied apply with.

This is why we see our “non-numeric argument to binary operator” error message. What we thought was numeric data has been converted to character data, which we subsequently try to add.

Just to really drive this home, let’s look at the first row of our data frame:

a <- df[1,]
a
#  X1       X2       X3
#1  1 8.911567 27.28325

Now let’s see what happens when we convert it to a matrix as apply does:

b <- as.matrix(a)
b
#  X1  X2         X3      
#1 "1" "8.911567" "27.28325"
b[2] + b[3]
#Error in b[2] + b[3] : non-numeric argument to binary operator

We can see it has been converted to strings, and we end up with the same error message we saw when we used apply.

In this case the resolution is quite simple, we use the index operator in our call to apply

apply(df[,2:3], 1, function(x) x[1] + x[2])
#[1] 36.19481 23.68181 30.83213 33.47001 26.62783

Which works as we expect. Note in our function we changed the indexes to 1 and 2 as we are now only passing two columns of data to apply.

So, remember that in many cases apply will convert what you pass it to a matrix. The matrix wants all data to be of the same type, and will coerce as required. For the specific conditions see the help page for apply.

A second look


I’d like to take a look at another example. This time instead of a factor, we will have one column as a string type, and two numeric columns as before:

df <- data.frame(cbind(paste("subject", 1:n, sep=''), rnorm(n, 10, 5), rnorm(n, 20, 10)))
#        X1               X2               X3
#1 subject1 14.6619839711866 6.94472759446703
#2 subject2  11.603910222178 27.6225162121889
#3 subject3 5.21881004622993 20.3409476386206
#4 subject4 16.3574724782284 39.0904723579448
#5 subject5 9.35407053787977 23.8568796326835

We know that apply will coerce its input data to a matrix, so pass in only the numeric columns:

apply(df[,2:3], 1, function(x) x[1] + x[2])
#Error in x[1] + x[2] : non-numeric argument to binary operator

Uh oh!

Lets take a look at the data we are passing in

a <- df[,2:3]
mode(a[,1])
#[1] "numeric"
as.matrix(a)
#     X2                  X3              
#[1,] "-3.89274205212847" "12.7336046818466"
#[2,] "12.3494043977024"  "17.9329667214396"
#[3,] "4.7419241278816"   "16.0664073330786"
#[4,] "8.50784944656814"  "8.65139145569206"
#[5,] "9.56191506080518"  "21.2114650777001"

What. Why are you strings? mode says you're numeric!

a[,1]
#[1] -3.89274205212847 12.3494043977024  4.7419241278816   8.50784944656814  9.56191506080518
#Levels: -3.89274205212847 12.3494043977024 4.7419241278816 8.50784944656814 9.56191506080518

Why has our numeric data turned into a factor??

Generally at this point, I take a quiet moment to reflect on what I’m doing with my life and why I'm not living in a yurt on a mountain somewhere, I mean people do it and they seem happy enough? Could it really be that bad? But where would I charge my laptop? Oh well, back to the task at hand. I bet the coffee would be terrible as well.

What is going on here? We created a data frame with a column of strings and two numeric columns. Why have we ended up with factors?

Lets take a look at our string column

df[,1]
#[1] subject1 subject2 subject3 subject4 subject5
#Levels: subject1 subject2 subject3 subject4 subject5

The strings were converted to factors, checking the data.frame help page we see a stringsAsFactors option. A likely culprit, lets see how we go

df <- data.frame(cbind(paste("subject", 1:n, sep=''), rnorm(n, 10, 5), rnorm(n, 20, 10)), stringsAsFactors=FALSE)
as.matrix(df[,2:3])
#     X2                 X3              
#[1,] "7.19530271823023" "26.4186991862312"
#[2,] "13.6715492467442" "25.452128137706"
#[3,] "8.89363806613213" "20.1618970554355"
#[4,] "16.296512734304"  "16.2581582721134"
#[5,] "11.6454577442585" "17.5241594066948"

Why are they still strings?

In this case, the culprit is actually the cbind call, which is coercing our mixed strings and numerics into strings. When cbind finishes, the resulting matrix is passed to data.frame and our now string data gets converted to factors as due to stringsAsFactors being TRUE.

cbind(paste("subject", 1:n, sep=''), rnorm(n, 10, 5), rnorm(n, 20, 10))
#     [,1]       [,2]                [,3]            
#[1,] "subject1" "14.0342542696833"  "30.5672885598002"
#[2,] "subject2" "8.44141744459018"  "35.1337567509022"
#[3,] "subject3" "11.6550656524794"  "10.1554349193507"
#[4,] "subject4" "18.0303214118231"  "14.9638066872277"
#[5,] "subject5" "0.180686583194847" "11.7124424267387"

Dropping the cbind gives us what we are after

df <- data.frame(paste("subject", 1:n, sep=''), rnorm(n, 10, 5), rnorm(n, 20, 10))
as.matrix(df[,2:3])
#     rnorm.n..10..5. rnorm.n..20..10.
#[1,]       13.557665        33.519719
#[2,]       15.086483        41.457651
#[3,]        7.010492         1.757224
#[4,]       11.008779        29.707944
#[5,]       15.777351        10.280138
apply(df[,2:3], 1, function(x) x[1] + x[2])
#[1] 47.077384 56.544134  8.767716 40.716723 26.057489

Phew.

The cbind call is not necessary when creating a data frame, which is designed to take a variable amount of data. It's excessive use is a bad habit I picked up learning to navigate the waters of R, I must be sure to use it only when necessary.

Summary


As you have seen, type conversion and coercion happens quite frequently, and usually you may not even realize it has happened. At least until mysterious error messages start appearing.

When they do, using mode() is the best way to see what is going on. If you are after specific type information I am wary of the is.* family, they are useful, but tend to be fairly generous when you are after specifics.

Also be sure to check the help pages for the functions you are using, they do usually note when conversion takes place.

A good rule of thumb is vector and matrix are used for data of the same type, list and data frame are used for data of mixed types.

The reason mode prints “numeric” for factors is that internally, factors are numeric. They just come with a list of strings (the factor levels), which are printed out in lieu of the internal numeric representation. You might also notice that calling mode on a matrix tells you it is a list, at which point I make a dry coughing sound and busy myself with a yurt catalogue.

You can find technical details about types here http://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects

Finally, it is possible to force R to use a specific mode using the storage.mode() function. In this case values that cannot be represented in the given mode will be converted to NA

a <- c(1:3, "hi")
storage.mode(a)
#[1] "character"
storage.mode(a) <- 'integer'
#Warning message:
#In storage.mode(a) <- "integer" : NAs introduced by coercion
a
#[1]  1  2  3 NA

Code in one file is here.