Shifting sands: Density Plot with ggplot

This is a follow on from the post Using apply sapply and lappy in R.

The dataset we are using was created like so:

m <- matrix(data=cbind(rnorm(30, 0), rnorm(30, 2), rnorm(30, 5)), nrow=30, ncol=3)

Three columns of 30 observations, normally distributed with means of 0, 2 and 5. We want a density plot to compare the distributions of the three columns using ggplot.

First let's give our matrix some column names:

colnames(m) <- c('method1', 'method2', 'method3')

head(m)

# method1 method2 method3

#[1,] 0.06288358 2.7413567 4.420209

#[2,] -0.11240501 3.4126550 4.827725

#[3,] 0.02467713 1.0868087 4.044101

ggplot has a nice function to display just what we were after geom_density and it's counterpart stat_density which has more examples.

ggplot likes to work on data frames and we have a matrix, so let's fix that first

df <- as.data.frame(m)

# method1 method2 method3

#1 0.06288358 2.7413567 4.420209

#2 -0.11240501 3.4126550 4.827725

#3 0.02467713 1.0868087 4.044101

#4 -0.73854932 -0.4618973 3.668004

Enter stack

What we would really like is to have our data in 2 columns, where the first column contains the data values, and the second column contains the method name.

Enter the base function stack, which is a great little function giving just what we need:

dfs <- stack(df)

dfs

# values ind

#1 0.06288358 method1

#2 -0.11240501 method1

#…

#88 5.55704736 method3

#89 6.40128267 method3

#90 3.18269138 method3

We can see the values are in one column named values, and the method names (the previous column names) are in the second column named ind. We can confirm they have been turned into a factor as well:

is.factor(dfs[,2])

#[1] TRUE

stack has a partner in crime, unstack, which does the opposite:

unstack(dfs)

# method1 method2 method3

#1 0.06288358 2.7413567 4.420209

#2 -0.11240501 3.4126550 4.827725

#3 0.02467713 1.0868087 4.044101

#4 -0.73854932 -0.4618973 3.668004

Back to ggplot

So, lets try plot our densities with ggplot:

ggplot(dfs, aes(x=values)) + geom_density()

The first argument is our stacked data frame, and the second is a call to the aes function which tells ggplot the 'values' column should be used on the x-axis.

However, our plot is not quite looking how we wish:

Hmm.

We want to group the values by each method used. To do this we will use the 'ind' column, and we tell ggplot about this by using aes in the geom_density call:

ggplot(dfs, aes(x=values)) + geom_density(aes(group=ind))

This is getting closer, but it's not easy to tell each one apart. Let's try colour the different methods, based on the ind column in our data frame.

ggplot(dfs, aes(x=values)) + geom_density(aes(group=ind, colour=ind))

Looking better. I'd like to have the density regions stand out some more, so will use fill and an alpha value of 0.3 to make them transparent.

ggplot(dfs, aes(x=values)) + geom_density(aes(group=ind, colour=ind, fill=ind), alpha=0.3)

That is much more in line with what I wanted to see. Note that the alpha argument is passed to geom_density() rather than aes().

That's all for now.

4 comments:

dM/December 19, 2012 at 3:32 AM
great post. remember when i first used ggplot, the idea of needing to stack initially seemed odd. then i likened it to created flat table data appropriate for pivot tables, and all conceptual difficulties with it passed.
Robert AdamsDecember 21, 2012 at 3:30 AM
Nicely explained. I wondered how the stack function differs from the melt function in reshape / reshape2 package. Is there any known difference or are both equally to be used? Here is an example:

library(reshape2)
melt(df)

should be the same (although different column names) as stack(df).

Also the concept of pivot, as dM mentioned, could be done with the cast function. Anyway, it is nice to see that there is no need to load additional packages - wasn´t aware of that before.

Shifting sands

Pages

Monday, December 17, 2012

Density Plot with ggplot

Enter stack

Back to ggplot

4 comments: