Friday, October 25, 2013

Another Rcpp/RcppArmadillo success story

I finally had an excuse to try out RcppArmadillo and was amazed at just how well it worked out.

The exact details are a bit beyond a blog post, but basically I had a matrix of roughly 80,000x2 double valued samples (call it A) and needed to find where in another, much smaller matrix (call it B) each of the samples was closest too.

There was some structure to the B matrix, which is 500x2, so it wasn't necessary to iterate through all 500, each lookup jumped around a bit and took 8 iterations.

The pseduocode in R looked a bit like this

find_closest <- function(A, B) {
    for(i in 1:8) {
    return index into B that A is closest too

calc_distance <- function(A, B)  apply(A, 1, find_closest, B=B) 

All up it took about 9.5 seconds on my system, and 15 seconds on another.

The first pass I implemented find_closest using RcppArmadillo, and saw a healthy speed up to around 400 ms, which was a big improvement.

Then I realised I might as well do calc_distance/the apply in C++ as it was just a simple vector of integers being returned.

This gave an amazing performance leap, the function now takes around 12 milliseconds all up, down from 9.5 seconds. On another machine it would take 15 seconds, and ended up taking 10 milliseconds.

I was very surprised at this. I haven't have a chance to dig into the details, but I am assuming there is a reasonable amount of overhead passing the data from R to RcppArmadillo. In the case of apply, this would be incurred for every row/column the apply was running find_closest on. By moving the apply to C++, all the data was passed from R to C++ only once, giving the large speedup. Or so I guess.

I tried two versions, one that traversed A & B row wise, and the other by column. The column one generally seemed faster, 12 ms vs 19 ms for rows. According to the Arma docs it stores matrices in column-major order, which might explain that difference.

Would appreciate any comments or pointers to documentation so I can better understand what is going on under the hood there.


  1. If you really need to speed it up, you could first sort both dataframes. And then look up the nearest values by binary search.
    After that, compare only some values (not all the dataframe).

    1. Yes those are very sensible suggestions. As it is there was a bit more going on than I really wrote about, which for various reasons made those approaches a bit more difficult.