Friday, October 25, 2013

Another Rcpp/RcppArmadillo success story

I finally had an excuse to try out RcppArmadillo and was amazed at just how well it worked out.

The exact details are a bit beyond a blog post, but basically I had a matrix of roughly 80,000x2 double valued samples (call it A) and needed to find where in another, much smaller matrix (call it B) each of the samples was closest too.

There was some structure to the B matrix, which is 500x2, so it wasn't necessary to iterate through all 500, each lookup jumped around a bit and took 8 iterations.

The pseduocode in R looked a bit like this

find_closest <- function(A, B) {
    for(i in 1:8) {
        ...voodoo...
    }
    return index into B that A is closest too
}

calc_distance <- function(A, B)  apply(A, 1, find_closest, B=B) 

All up it took about 9.5 seconds on my system, and 15 seconds on another.

The first pass I implemented find_closest using RcppArmadillo, and saw a healthy speed up to around 400 ms, which was a big improvement.

Then I realised I might as well do calc_distance/the apply in C++ as it was just a simple vector of integers being returned.

This gave an amazing performance leap, the function now takes around 12 milliseconds all up, down from 9.5 seconds. On another machine it would take 15 seconds, and ended up taking 10 milliseconds.

I was very surprised at this. I haven't have a chance to dig into the details, but I am assuming there is a reasonable amount of overhead passing the data from R to RcppArmadillo. In the case of apply, this would be incurred for every row/column the apply was running find_closest on. By moving the apply to C++, all the data was passed from R to C++ only once, giving the large speedup. Or so I guess.

I tried two versions, one that traversed A & B row wise, and the other by column. The column one generally seemed faster, 12 ms vs 19 ms for rows. According to the Arma docs it stores matrices in column-major order, which might explain that difference.

Would appreciate any comments or pointers to documentation so I can better understand what is going on under the hood there.



The case for data snooping

When we are backtesting automated trading systems, accidental data snooping or look forward errors are an easy mistake to make. The nature of the error in this context is making our predictions using the data we are trying to predict. Typically, it comes from a mistake with our calculations of time offsets somewhere.

However, it can be a useful tool. If we give our system perfect forward knowledge:

1) We establish an upper bound for performance.
2) We can get a quick read if something is worth pursuing further, and
3) It can help highlight other coding errors.

The first two are pretty closely related. If our wonderful model is built using the values it is trying to predict, and still performs no better than random guessing, it’s probably not worth the effort trying to salvage it.

The flip side is when it performs well, that will be as good as it will ever get.

There are two main ways it can help identifying errors. Firstly, if our subsequent testing on non-snooped data provides comparable performance, we probably have another look ahead bug lurking somewhere.


Secondly, things like having amazing accuracy yet still performing poorly is another sign of a bug lurking somewhere.

Example

I wanted to compare SVM models when trained with actual prices vs a series of log returns, using the rolling model code I put up earlier. As a baseline, I also added in a 200 day simple moving average model.

(S) Indicates snooped data

A few things strike me about this.

For the SMA system, peeking ahead by a day only provides a small increase in accuracy. Given the longer-term nature of the 200 day SMA this is probably to be expected.

For the SVM trained systems, the results are somewhat contradictory.

For the look forward models, training on price data had much lower accuracy than the log returns, and the log return model performed much better. Note that both could have achieved 100% accuracy by predicting its first column of training data.

However, when not snooping, the models trained on closing prices did much better than those trained on returns. I’m not 100% sure there isn’t still some bug lurking somewhere, but hey if the code was off it would’ve shown up in the forward tested results no?

Feel free to take a look and have a play around with the code, which is up here.