Monday, December 17, 2012

Density Plot with ggplot

This is a follow on from the post Using apply sapply and lappy in R.

The dataset we are using was created like so:

m <- matrix(data=cbind(rnorm(30, 0), rnorm(30, 2), rnorm(30, 5)), nrow=30, ncol=3)

Three columns of 30 observations, normally distributed with means of 0, 2 and 5. We want a density plot to compare the distributions of the three columns using ggplot.

First let's give our matrix some column names:

colnames(m) <- c('method1', 'method2', 'method3')
#         method1    method2  method3
#[1,]  0.06288358  2.7413567 4.420209
#[2,] -0.11240501  3.4126550 4.827725
#[3,]  0.02467713  1.0868087 4.044101

ggplot has a nice function to display just what we were after geom_density and it's counterpart stat_density which has more examples. 

ggplot likes to work on data frames and we have a matrix, so let's fix that first

df <-
#       method1    method2  method3
#1   0.06288358  2.7413567 4.420209
#2  -0.11240501  3.4126550 4.827725
#3   0.02467713  1.0868087 4.044101
#4  -0.73854932 -0.4618973 3.668004

Enter stack

What we would really like is to have our data in 2 columns, where the first column contains the data values, and the second column contains the method name. 

Enter the base function stack, which is a great little function giving just what we need:

dfs <- stack(df)
#        values     ind
#1   0.06288358 method1
#2  -0.11240501 method1
#88  5.55704736 method3
#89  6.40128267 method3
#90  3.18269138 method3

We can see the values are in one column named values, and the method names (the previous column names) are in the second column named ind. We can confirm they have been turned into a factor as well:

#[1] TRUE

stack has a partner in crime, unstack, which does the opposite:

#       method1    method2  method3
#1   0.06288358  2.7413567 4.420209
#2  -0.11240501  3.4126550 4.827725
#3   0.02467713  1.0868087 4.044101
#4  -0.73854932 -0.4618973 3.668004

Back to ggplot

So, lets try plot our densities with ggplot:

ggplot(dfs, aes(x=values)) + geom_density()

The first argument is our stacked data frame, and the second is a call to the aes function which tells ggplot the 'values' column should be used on the x-axis.

However, our plot is not quite looking how we wish:


We want to group the values by each method used. To do this we will use the 'ind' column, and we tell ggplot about this by using aes in the geom_density call:

ggplot(dfs, aes(x=values)) + geom_density(aes(group=ind))

This is getting closer, but it's not easy to tell each one apart. Let's try colour the different methods, based on the ind column in our data frame.

ggplot(dfs, aes(x=values)) + geom_density(aes(group=ind, colour=ind))

Looking better. I'd like to have the density regions stand out some more, so will use fill and an alpha value of 0.3 to make them transparent.

ggplot(dfs, aes(x=values)) + geom_density(aes(group=ind, colour=ind, fill=ind), alpha=0.3)

That is much more in line with what I wanted to see. Note that the alpha argument is passed to geom_density() rather than aes().

That's all for now.

Using apply, sapply, lapply in R

This is an introductory post about using apply, sapply and lapply, best suited for people relatively new to R or unfamiliar with these functions. There is a part 2 coming that will look at density plots with ggplot, but first I thought I would go on a tangent to give some examples of the apply family, as they come up a lot working with R.

I have been comparing three methods on a data set. A sample from the data set was generated, and three different methods were applied to that subset. I wanted to see how their results differed from one another.

I would run my test harness which returned a matrix. The columns values were the metric used for evaluation of each method, and the rows were the results for a given subset. We have three columns, one for each method, and lets say 30 rows, representing 30 different subsets that the three methods were applied to. 

It looked a bit like this

        method1  method2    method3 
[1,] 0.05517714 0.014054038 0.017260447
[2,] 0.08367678 0.003570883 0.004289079
[3,] 0.05274706 0.028629661 0.071323030
[4,] 0.06769936 0.048446559 0.057432519
[5,] 0.06875188 0.019782518 0.080564474 
[6,] 0.04913779 0.100062929 0.102208706

We can simulate this data using rnorm, to create three sets of observations. The first has mean 0, second mean of 2, third of mean of 5, and with 30 rows.

m <- matrix(data=cbind(rnorm(30, 0), rnorm(30, 2), rnorm(30, 5)), nrow=30, ncol=3)


When do we use apply? When we have some structured blob of data that we wish to perform operations on. Here structured means in some form of matrix. The operations may be informational, or perhaps transforming, subsetting, whatever to the data.

As a commenter pointed out, if you are using a data frame the data types must all be the same otherwise they will be subjected to type conversion. This may or may not be what you want, if the data frame has string/character data as well as numeric data, the numeric data will be converted to strings/characters and numerical operations will probably not give what you expected. 

Needless to say such circumstances arise quite frequently when working in R, so spending some time getting familiar with apply can be a great boon to our productivity.

Which actual apply function and which specific incantion is required depends on your data, the function you wish to use, and what you want the end result to look like. Hopefully the right choice should be a bit clearer by the end of these examples.

First I want to make sure I created that matrix correctly, three columns each with a mean 0, 2 and 5 respectively. We can use apply and the base mean function to check this. 

We tell apply to traverse row wise or column wise by the second argument. In this case we expect to get three numbers at the end, the mean value for each column, so tell apply to work along columns by passing 2 as the second argument. But let's do it wrong for the point of illustration:

apply(m, 1, mean)
# [1] 2.408150 2.709325 1.718529 0.822519 2.693614 2.259044 1.849530 2.544685 2.957950 2.219874
#[11] 2.582011 2.471938 2.015625 2.101832 2.189781 2.319142 2.504821 2.203066 2.280550 2.401297
#[21] 2.312254 1.833903 1.900122 2.427002 2.426869 1.890895 2.515842 2.363085 3.049760 2.027570

Passing a 1 in the second argument, we get 30 values back, giving the mean of each row. Not the three numbers we were expecting, try again. 

apply(m, 2, mean)
#[1] -0.02664418  1.95812458  4.86857792

Great. We can see the mean of each column is roughly 0, 2, and 5 as we expected.

Our own functions

Let's say I see that negative number and realise I wanted to only look at positive values. Let's see how many negative numbers each column has, using apply again:

apply(m, 2, function(x) length(x[x<0]))
#[1] 14  1  0

So 14 negative values in column one, 1 negative value in column two, and none in column three. More or less what we would expect for three normal distributions with the given means and sd of 1. 

Here we have used a simple function we defined in the call to apply, rather than some built in function. Note we did not specify a return value for our function. R will magically return the last evaluated value. The actual function is using subsetting to extract all the elements in x that are less than 0, and then counting how many are left are using length

The function takes one argument, which I have arbitrarily called x. In this case x will be a single column of the matrix. Is it a 1 column matrix or a just a vector? Let's have a look:

apply(m, 2, function(x) is.matrix(x))

Not a matrix. Here the function definition is not required, we could instead just pass the is.matrix function, as it only takes one argument and has already been wrapped up in a function for us. Let's check they are vectors as we might expect.

apply(m, 2, is.vector)

Why then did we need to wrap up our length function? When we want to define our own handling function for apply, we must at a minimum give a name to the incoming data, so we can use it in our function.

apply(m, 2, length(x[x<0]))
#Error in : object 'x' not found

We are referring to some value x in the function, but R does not know where that is and so gives us an error. There are other forces at play here, but for simplicity just remember to wrap any code up in a function. For example, let's look at the mean value of only the positive values:

apply(m, 2, function(x) mean(x[x>0]))
#[1] 0.4466368 2.0415736 4.8685779

Using sapply and lapply

These two functions work in a similar way, traversing over a set of data like a list or vector, and calling the specified function for each item.

Sometimes we require traversal of our data in a less than linear way. Say we wanted to compare the current observation with the value 5 periods before it. Use can probably use rollapply for this (via quantmod), but a quick and dirty way is to run sapply or lapply passing a set of index values.

Here we will use sapply, which works on a list or vector of data. 

sapply(1:3, function(x) x^2)
#[1] 1 4 9

lapply is very similar, however it will return a list rather than a vector:

lapply(1:3, function(x) x^2)
#[1] 1
#[1] 4
#[1] 9

Passing simplify=FALSE to sapply will also give you a list:

sapply(1:3, function(x) x^2, simplify=F)
#[1] 1
#[1] 4
#[1] 9

And you can use unlist with lapply to get a vector.

unlist(lapply(1:3, function(x) x^2))
#[1] 1 4 9

However the behviour is not as clean when things have names, so best to use sapply or lapply as makes sense for your data and what you want to receive back. If you want a list returned, use lapply. If you want a vector, use sapply.

Dirty Deeds

Anyway, a cheap trick is to pass sapply a vector of indexes and write your function making some assumptions about the structure of the underlying data. Let's look at our mean example again:

sapply(1:3, function(x) mean(m[,x]))
[1] -0.02664418  1.95812458  4.86857792

We pass the column indexes (1,2,3) to our function, which assumes some variable m has our data. Fine for quickies but not very nice, and will likely turn into a maintainability bomb down the line. 

We can neaten things up a bit by passing our data in an argument to our function, and using the special argument which all the apply functions have for passing extra arguments:

sapply(1:3, function(x, y) mean(y[,x]), y=m)
#[1] -0.02664418  1.95812458  4.86857792

This time, our function has 2 arguments, x and y. The x variable will be as it was before, whatever sapply is currently going through. The y variable we will pass using the optional arguments to sapply

In this case we have passed in m, explicitly naming the y argument in the sapply call. Not strictly necessary but it makes for easier to read & maintain code. The y value will be the same for each call sapply makes to our function. 

I don't really recommend passing the index arguments like this, it is error prone and can be quite confusing to others reading your code. 

I hope you found these examples helpful. Please check out part 2 where we create a density plot of the values in our matrix.

If you are working with R, I have found this book very useful day-to-day R Cookbook (O'Reilly Cookbooks)

Tuesday, August 28, 2012

Generative Sphere 2

Hey all, thought I would take my fancy new hardware for a spin and made this

For those interested, it uses 2 GLSL Shaders, one that takes the spectral curve as input and uses a ping-pong FBO to calculate the offsets, and the other to turn those offsets into displacement, colour etc.

Tuesday, July 31, 2012

Apple is really cool

My secret shame

I am a hardware luddite, and have basically zero interest in gadgets for gadgets sake. Only in February this year did I upgrade from my Nokia 6600. I am also terrible at assembling things. Once I bought a plastic peg basket consisting of three separate parts, two sides and a flexible body bit you snapped in to the sides. Despite the apparent simplicity I somehow managed to fail while attempting its assembly. Got it the second time though. In my high school woodshop class, my final project got an A for the idea and a C for the execution ... it's just not my forte.

My new macbook finally arrived. When I am looking for hardware, I try to do the bare minimum to ensure what that I buy will do what I actually need, and that's it. I only really upgraded as I needed some specific hardware for some graphics stuff.

So for a one paragraph review, and as someone who doesn't really care about hardware, it is amazing. It is light, quiet and incredibly fast. The retina display is markedly better than anything I have seen, it is more or less impossible to not notice the difference, despite how little one thinks they care about such matters.


One of the things that has engaged my time the last month or so was visiting my parents and cleaning out a lot of my old junk, which has lain dormant at their house for who knows how many years. Although it pained me greatly at the time, I threw out the following machines:

These are all UNIX workstations. If you don't know what that is don't worry, they are all horribly obsolete and all of the companies that made them are either bust, bought out or have abandoned the ideas behind them. 

The second one from the bottom of that stack is a HP Visualize C200 running HP-UX. It's huge and clocks in at 17 kg, all for a 200 MHz cpu. As a ballpark reference the A5 cpu in the iPhone can run at 800 - 1000 Mhz. The blue machines come from SGI, and behind them are 2 CRT monitors also from SGI which weigh about 20kg each. All up it's probably over 100 kg of hardware.

Having spent so much time with these machines, this macbook sitting next to me is seems other worldly. It is orders of magnitude faster than all of those machines combined. Due to the larger screen, it is bigger than my current laptop yet lighter, and height wise it's about half the width of my thumb. I can't help but be impressed by it's engineering, how did they get so much stuff in something so small and light?   

Apple is cool for other reasons

I spent some time the last few days seeing just how much of an outlier Apple really is relative to other companies. It can be a little hard to get your head around why that may be, but this is how I think of it, and why I think they will continue to be successful for the foreseeable future.

Take the microcosm of a free to air TV station as an example. In a nutshell, a production company will make a show and license it to the station for a fee. The station will show it for free and sell advertising to companies wanting to reach the viewers who watch it.

Now instead, imagine that the production company bought the equipment to make the show from the station, paid the station to broadcast its show, and gives the station a cut of any revenue it generates. That revenue is generated from consumers who also bought their TV from station. Throw some advertisers giving the station money in the mix there as well. Actual TV is not yet where it could be, but for software and music, that's how it works for Apple. I can't help but find it kinda cool. TV and movies are there in some form, but so far only nerds and youngish people watch that stuff on their computer.

This also says nothing about Apple's control over supply of the components used to make their hardware, as well as their patents. There's a reason why the trackpad on every other brand of laptop sucks. The success could easily breed complacency, but that hasn't happened yet. They make good products that do what they say on the box, which is something that seems simple but so many other companies fail at. 

No company is infallible, but they have a very strong position and have made a great success out of a new way to do things. I don't know of any other company that has been so successful while bringing about such a huge change (and it is a huge change). Amazon and Google have done a lot too, but I think Apple has gone further. I would love to hear of any others.

Monday, July 30, 2012

Multidimensional Scaling and Company Similarity

Background and idea

Often we are looking at a particular sector, and want to get a quick overview of a group of companies relative to one another. I thought I might apply Multidimensional Scaling (MDS) to various financial ratios and see if it gave us anything useful.

The premise is that companies in similar industries should all have a degree of sameness, so MDS might be useful to highlight the companies that stand out from the crowd, perhaps in some literal sense ...


I mostly use the data functions from quantmod to retrieve the financial statements from Google Finance. As always with free data, the quality is variable, but good enough for our purpose today. We need to do a bit of dancing to get the market price at the time the results were released, and this uses data from Yahoo Finance. It was a little bit more work to implement, but worth it so we can include P/E in the comparison.

I looked at two groups of companies, tech stocks and financials/banks.

For the tech stocks I used ROE, EPS, P/E, Operating Margin, Current Ratio, Gearing, Asset Turnover and Debt Ratio. For the financials, I used ROE, EPS, P/E, Gearing and Debt Ratio, mainly because the data available did not have the line items required to calculate the other ratios. 

The data from Google gives the last four periods, with the most recent coming first. It also gives Annual and Quarterly data and the charts below use the annual results. Annual Period 1 means the most recent results. Due to the scaling function, the actual scales on the graphs are not particularly meaningful, so I took them out. 


These are the charts for the most recent results (so end of year 2011). Overall, I am quite pleased with the results. We can see how most of the companies cluster together, while a few seem to be quite different. This shows at a glance the companies that might be worthy of further investigation. 

Tech Stocks



Code is up here MDS Company Similarity with R, it should hopefully be documented enough for others to mess around with. Any questions, comments or suggestions are very much appreciated as always.

As an aside, this is the first R program I wrote devoid of any for loops. I finally feel I am coming to grips with the language.

Tuesday, July 24, 2012

Particle swarm and Perlin noise

It has been a little too quiet here for my liking. For better or worse I have just been caught up with real life matters, mostly boring stuff like studying and moving and helping people move, so not even any juicy gossip to share.

Actually last weekend my credit card number somehow got stolen, which meant the new laptop I had on order has been postponed while I wait for my new card. I had been planning to do a bunch more stuff once it arrived as my current one is showing its age a bit, but I ordered it in June and there was a 4 week wait till it shipped, then the card thing happened the day before it was going to ship! But that's about it.

I did manage to get this done over the weekend, it's 150k particles traversing a vector field generated from Perlin Noise. When the new machine arrives I want to delve into GPU programming and 1 million particles is the goal.

Anyway, it's a bit cold and wet here right now, hope you're somewhere sunny and warm.

Music is Olson by Boards of Canada.

Sunday, May 20, 2012

Another cut at market randomness

I have some background in computer security and one day found myself tasked with assessing the quality of randomness for session id tokens generated by popular web frameworks (namely Java and .NET). As it turns out, NIST have developed a series of tests for just this purpose detailed here.

As a non-believer in the absolute randomness of markets, I thought I might take a look at the series of returns from SPX/GSPC to see how they held up.

About the tests

The NIST tests are designed to be run on bit streams from a random number generator, so some liberties had to be taken. I am mainly interested in the predictability of up days vs down days, so I first generated a series of returns using quantmod then encoded them using a 1 for an up day and -1 for a down day and run the tests on the encoded stream. The NIST tests use -1 in place of a 0.

The main test is called the monobit test, to quote from the document:
The focus of the test is the proportion of zeroes and ones for the entire sequence.  The purpose of this test is to determine whether the number of ones and zeros in a sequence are approximately the same as would be expected for a truly random sequence.  The test assesses the closeness of the fraction of ones to ½, that is, the number of ones and zeroes in a sequence should be about the same.  
The tests are done using a significance level of 0.01, so come with a good degree in confidence assuming the underlying method is sound. 

One caveat is the length of the runs compared and how it relates to the distributions used to model the results. For the monobit test, the suggested input size is 100, requiring 101 days of data to determine 100 up or down days. If we were looking at 32 bit integers, 100 bits would only be 3 "full" random numbers, so arguably we would want to look at shorter time periods (e.g. 3-5 days of data). Given the difficulties around distributions which require a large n, I thought I would vary the significance level instead of a lower n, as our requirements are not as stringent as those for cryptographic random numbers.   


At a basic level, this series does appear to be random, at least the vast majority of the time with n = 100 and alpha = 0.01. My confirmation bias was very upset with this. 

However, if we plot the proportion of runs deemed random vs the significance level, we see the proportion rising as one might expect. One thing that remains unexplained is why this appears to rise in steps rather than something more linear, though I expect this to be a side affect of either methodology or the normalisation done by the tests. I also took a look at the weekly data, which tends to a greater proportion of non-random runs quicker than daily data.

I am interested in the applications of machine learning to financial markets. Close to close returns that we have been looking at here are not the only information we have available, nor are they what I trade on a personal level. Also this is only one price series, and one could argue that in practise it is not actually a tradable series. 

Close to close returns are very useful in lots of applications, but if we are trying to build some predictive model we might need to look for more predictable pastures. Machine learning algorithms are great, but can only do so much. Finding some better potential inputs is what I will take a look at next.

The test code is available on github here: R Monobit test. Would be very interested to hear if anyone else takes a look.

I also took a visual and binomial look at randomness of the series in this post: A visual look at market randomness.

Oh and in case you were wondering, the web session tokens all turned out to be very strong.

A visual look at market randomness

I recently did some statistical testing to see if markets were random (details in the post Another cut at market randomness). It turns out they were, at least close to close returns for SPX/GSPC. My confirmation bias wasn't going to stand for that, so I thought about taking a different look.

Two things interested me. Firstly, I am looking at up vs down (i.e a higher or lower close than previous), rather than trying to predict an exact price. If markets are random then up or down have a probability of 0.5 each and are independent. A run of 5 consecutive ups or downs has a probability of 0.03125 or roughly 3%. How would that pan out looking at historical data?

Secondly, how could one visualise seemingly random data without it ending up looking like noise?

I came up with the following chart:

Each square represents one week and each line represents one year. If the close was higher than the previous week, it is blue, otherwise it is red. As the count of successive higher or lower weeks rise, the boxes get deeper in colour, up to a maximum of 5. As a side effect of date calculations and the definition of "week" some years have 53 weeks, which is why some lines are longer than others.

In total there were 138 runs of 5 weeks in the same direction out of 2208 samples, or around 6%, roughly double what we might expect.

Looking at that, I wondered what it would look like comparing weeks across years, comparing week 1 of year n with week 1 of year n + 1. That lead to the second chart:

This time we had 175 runs of length 5 out of 2208, just under 8%, again quite a bit more than the 3% we were expecting.  

That is all well and good, but these charts only represent the direction of the week to week moves, not the magnitude of the moves which is probably more important. Finally I took a look at the return over 5 periods. 

Again if it is positive the squares are blue, negative they are red. The colours are scaled as a proportion of the largest positive and negative returns for blue and red squares respectively. The very pale squares are where the returns were proportionally so close to zero they would not otherwise be visible, so I set a minimum level to ensure they displayed.

We can see that positive returns tend to follow positive returns and vice versa, at least for this 5 week look back period. This is somewhat deceptive as a negative return, though negative, may still be higher than the previous one implying a loss. 

What does all this mean? Not too much in practise, as it is another thing to know in advance if a series will have consecutive up days or down days. In this case a tradable edge is not so easily won.

However, it does reflect my understanding of how prices move a little better, in that they trend for a while then range for a while and vice versa, and things may not be as random as we might expect. My confirmation bias somewhat sated. 

The charts were done in Processing using the free weekly data from Yahoo! finance for GSPC. If you would like a chart for a given ticker, let me know.

Sunday, April 22, 2012

Friday, April 13, 2012

Mebane Faber Tactical Asset Allocation in R

In 2006 Mebane Faber published a great piece of research detailing an asset allocation system that was both very easy to understand and implement, as well as carrying very respectable risk adjusted returns.

The details are available in his paper A Quantitative Approach to Tactical Asset Allocation and were further expanded on in his book The Ivy Portfolio both of which are must reads.

The short version is to use diversified asset classes, long only, and only long when the price is above the 10 month simple moving average (approx 200 day). The assets he tests are U.S. Stocks, International Stocks, U.S. Government Bonds, Commodities and Real Estate, accessible via ETFs.

A rotational extension can also be added by investing only in the top 1-3 asset classes showing some degree of relative strength, which is defined as the average of 3, 6 and 12 month returns. They must also be over the 10 month SMA to be candidates.

The system updates monthly at the end of the month, it is about as hands off as you can get for active management.

There is an ETF for those so inclined, GTAA, but I am experimenting with a put selling implementation, which I might start tracking here month to month. I wrote a small R script using quantmod to display the relevant information for given symbols, which should be available here: Tactical Asset Allocation R script

The output looks like this:

  Sym         R3m         R6m        R12m  Close     AvgRet OverMA
4 VNQ  0.09295631  0.22412597  0.08488552  63.65 0.13398927   TRUE
1 VTI  0.11671109  0.22466699  0.05037598  72.26 0.13058469   TRUE
2 VEU  0.10908623  0.13282091 -0.10915250  44.22 0.04425155   TRUE
5 DBC  0.07048208  0.11194076 -0.05767911  28.80 0.04158124   TRUE
3 IEF -0.02193049 -0.01718305  0.10473673 103.28 0.02187440   TRUE

Let me know if you have any comments or find any bugs. 

Sunday, February 12, 2012

Machine Learning Examples in R

This is a post that has been a long time in the making. Following on from the excellent Stanford Machine Learning Course I have made examples of the main algorithms covered in R.

We have Linear Regression

Followed by Neural Networks
And Support Vector Machines

One remaining item is Logistic Regression, I am yet to find a library in R that behaves as I want, so that will come at some future date. I've been sitting on this post for ages and got sick of waiting. As an aside I find the documentation in R to be variable at best, which can make it somewhat of a pain to work with. When it is good, yes it can be very good but often it is quite poor ...

R is great for data analysis and exploration, but I have found myself moving back to python for many more mundane tasks.

Anyway for those interested in the code, I have put it on Github. The data is from an exercise in the Stanford course, and by tweaking the parameters I really got a good feel for how the various algorithms work in practise.

Once I finish my backtesting engine I will probably put it up on Github as well, and then I can start digging into the applications of ML techniques for trading systems.

Friday, February 10, 2012

A short diversion

I have been busy learning machine learning techniques, writing a market data replay/limit book backtesting framework in python, and messing around with the Processing graphics environment.

More to come on the first two later, but here is a sketch of something I made in Java/Processing