Violin plots in ggplot2

Use geom_violin() to quickly plot a visual summary of variables, using the Boston dataset, MASS library.

Use geom_violin() to quickly plot a visual summary of variables, using the Boston dataset from the MASS library.

1. Upload the relevant libraries:


2. Load data and use the tidyr package to transform wide into long format:

dt.long <- gather(Boston, "variable",
"value", crim:medv)

3. Create some color palettes:

col <- colorRampPalette(c("red", "blue"))(14)
# col.bp <- brewer.pal(9, "Set1") # brewer.pal only has a max of 9 colors
col.rc <- as.vector(distinctColorPalette(14))

4. Plot(s):

  • With the standard colors produced by ggplot2:
ggplot(dt.long,aes(factor(variable), value))+
geom_boxplot(alpha=0.3, color="black", width=.1)+
labs(x = "", y = "")+
theme(legend.title = element_blank())+
facet_wrap(~variable, scales="free")


  • With the color palette produced by colorRampPalette:
ggplot(dt.long,aes(factor(variable), value))+
geom_boxplot(alpha=0.3, color="black", width=.1)+
labs(x = "", y = "")+
scale_fill_manual(values = col, name="")+
facet_wrap(~variable, scales="free")


  • With the color palette produced by randomcoloR library:
ggplot(dt.long,aes(factor(variable), value))+
geom_boxplot(alpha=0.3, color="black", width=.1)+
labs(x = "", y = "")+
scale_fill_manual(values = col.rc, name="")+
facet_wrap(~variable, scales="free")


Plot maps with base mapping tools and ggmap in R

Plot maps with ‘base’ mapping tools in R

Understanding what kind of data you have (polygons or points?) and what you want to map is pivotal to start your mapping.

  1. First you need a shapefile of the area you want to plot, such as metropolitan France. There are various resources where to get them from: DIVA-GIS and EUROSTAT are those that I use the most. It’s always important to have a .prj file included, as your final map ‘should’ be projecte. I say “should” as sometimes it is just not possible, especially if you work with historical maps.
  2. Upload libraries

Load and prepare data

fr.prj <- readOGR(".", "FRA_adm2")
## OGR data source with driver: ESRI Shapefile
## Source: ".", layer: "FRA_adm2"
## with 96 features
## It has 18 fields
## NOTE: rgdal::checkCRSArgs: no proj_defs.dat in PROJ.4 shared files
## Warning in SpatialPolygons2map(database, namefield = namefield): database
## does not (uniquely) contain the field 'name'.

##   ID_0 ISO NAME_0 ID_1    NAME_1  ID_2         NAME_2   VARNAME_2
## 0   76 FRA France  989    Alsace 13755       Bas-Rhin  Unterelsaá
## 1   76 FRA France  989    Alsace 13756      Haut-Rhin   Oberelsaá
## 2   76 FRA France  990 Aquitaine 13757       Dordogne        <NA>
## 3   76 FRA France  990 Aquitaine 13758        Gironde Bec-D'Ambes
## 4   76 FRA France  990 Aquitaine 13759         Landes      Landas
## 5   76 FRA France  990 Aquitaine 13760 Lot-Et-Garonne        <NA>
## 0      <NA>  FR.BR <NA> Département Department  17900226   Unknown
## 1      <NA>  FR.HR <NA> Département Department  17900226   Unknown
## 2      <NA>  FR.DD <NA> Département Department  17900226   Unknown
## 3      <NA>  FR.GI <NA> Département Department  17900226   Unknown
## 4      <NA>  FR.LD <NA> Département Department  17900226   Unknown
## 5      <NA>  FR.LG <NA> Département Department  17900226   Unknown
##   REMARKS_2 Shape_Leng Shape_Area
## 0      <NA>   4.538735  0.5840273
## 1      <NA>   3.214178  0.4198797
## 2      <NA>   5.012795  1.0389622
## 3      <NA>   9.200047  1.1489822
## 4      <NA>   5.531231  1.0372815
## 5      <NA>   4.489830  0.6062017
# load or create data
myvar <- rnorm(1:96)
# manipulate data for the plot
france.geodata  <- data.frame(id=rownames(fr.prj@data), mapvariable=myvar)
##   id mapvariable
## 1  0  1.12200636
## 2  1  0.05912043
## 3  2 -1.05873510
## 4  3 -1.31513865
## 5  4  0.32392954
## 6  5  0.09152878

Use ggmap

# fortify prepares the shape data for ggplot
france.dataframe <- fortify(fr.prj) # convert to data frame for ggplot
## Regions defined for each Polygons
##       long      lat order  hole piece id group
## 1 7.847912 49.04728     1 FALSE     1  0   0.1
## 2 7.844539 49.04495     2 FALSE     1  0   0.1
## 3 7.852439 49.04510     3 FALSE     1  0   0.1
## 4 7.854333 49.04419     4 FALSE     1  0   0.1
## 5 7.855955 49.04431     5 FALSE     1  0   0.1
## 6 7.856299 49.03776     6 FALSE     1  0   0.1
#now combine the values by id values in both dataframes
france.dat <- join(france.geodata, france.dataframe, by="id")
##   id mapvariable     long      lat order  hole piece group
## 1  0    1.122006 7.847912 49.04728     1 FALSE     1   0.1
## 2  0    1.122006 7.844539 49.04495     2 FALSE     1   0.1
## 3  0    1.122006 7.852439 49.04510     3 FALSE     1   0.1
## 4  0    1.122006 7.854333 49.04419     4 FALSE     1   0.1
## 5  0    1.122006 7.855955 49.04431     5 FALSE     1   0.1
## 6  0    1.122006 7.856299 49.03776     6 FALSE     1   0.1
# Plot 3
p <- ggplot(data=france.dat, aes(x=long, y=lat, group=group))
p <- p + geom_polygon(aes(fill=mapvariable)) +
       geom_path(color="white",size=0.1) +
       coord_equal() +
       scale_fill_gradient(low = "#ffffcc", high = "#ff4444") +
       labs(title="Our map",fill="My variable")
# plot the map


Use plot basic

nclassint <- 5 #number of colors to be used in the palette
cat <- classIntervals(myvar, nclassint,style = "jenks") #style refers to how the breaks are created
colpal <- brewer.pal(nclassint,"RdBu")
color <- findColours(cat,rev(colpal)) #sequential
bins <- cat$brks
lb <- length(bins)
plot(fr.prj, col=color,border=T)
legend("bottomleft",fill=rev(colpal),legend=paste(round(bins[-length(bins)],1),":",round(bins[-1],1)),cex=1, bg="white")


Find color breaks for mapping (fast)

I’ve stumbled upon a little trick to compute jenks breaks faster than with the classInt package, just be sure to use n+1 instead of n as the breaks are computed a little bit differently. That is to say, if you want 5 breaks, n=6, no biggie there.

For more on the Bayesian Analysis of Macroevolutionary Mixtures see BAMMtools library

system.time(getJenksBreaks(mydata$myvar, 6))
> user system elapsed
> 0.970 0.001 0.971

On the other hand this takes way more time with large datasets
system.time(classIntervals(mydata$myvar, n=5, style="jenks"))
> Timing stopped at: 1081.894 1.345 1083.511

Upload files in R

Upload files from Excel, STATA, SAS, SPSS and text

First set the working directory (or check it)

getwd() # get working directory
 [1] "/Users/me/My Folder/"
 setwd("./My Subfolder/") # set working directory

1. .csv and .txt files

the read.csv function has many options, some of them are header=T which sets the first line as column names, sep=“,” the field separator character (in this case the semicolon), dec=“.” decimal sep character, skip=2 number of lines to skip (in this case 2).

read.csv2 is identical to read.csv except it assumes commas to be the decimal operators and semicolon as field separator

read.table works similarly to read.csv, but reads text files.

 mydata <- read.csv("mydata.csv", header=T)

When importing data in R, if any column’s name is a number, R will add an X to it (as in general it is a very bad idea to have numbers for column names, but can be handy). You can replace column names with:

 colnames(mydata) <- c("name1", "name2", "name3", "2017", "2018", "2019")

If you change or add anything to your data and want to save it then ( write.table for txt output):

 write.csv(mydata, "mydata.csv", row.names=FALSE)

2. STATA files .dta

 write.dta(mydata, "mydata.dta")

3. SPSS files .sav

use.value.labels by default is TRUE and converts value labels into factors. The mydata.txt is the name for data output, while the mydata.sps is the code output.

 mydata <- read.spss("mydata",, use.value.labels = FALSE)
 write.foreign(, "mydata.txt", "mydata.sps", package="SPSS")

4. SAS files .sas

Note that by default it converts value labels into factors

## to read from SAS
 mydata <- sasxport.get("mydata.xpt")

## to save in SAS format
 write.foreign(, "mydata.txt", "", package="SAS")

5. Excel spreadsheet

# library(xlsx)
 mydata <- read.xlsx("c:/myexcel.xlsx", 1) # 1 refers to the first worksheet-page altrenatively...
 mydata <- read.xlsx("c:/myexcel.xlsx", sheetName="Data input")
 write.xlsx(mydata, "mydata.xlsx")
# library(readxl)
mydata <-system.file("mypath/myexcel.xlsx", package = "readxl")
mydata <- read_excel(mydata, 1)

(A few) quick tricks

# head(mydata, n=10) # first 10 rows
 tail(mydata, n=10) # last 10 rows
 mydata[1,1:10] # print first row and first 10 columns
 names(mydata) # variable names
 nrow(mydata) # number of rows
 ncol(mydata) # number of columns
 str(mydata) # list structure of data
 class(mydata) # class of data
 view(mydata) # opens viewer window

A map of the US election results

  1. Upload libraries:
rm(list = ls(all=T)) #clear workspace
library(maptools) #to read shapefiles

2. Download the data files (note they are not ready for use but need some cleaning as there are more areas in the excel files than polygons in the shape file). I copy here the code as I have used it in my script but it’s available at RPubs thanks to David Robinson.

download.file("", "LND01.xls")
download.file("", "POP01.xls")

according to metadata, this is Land Area in 2010 and resident population in 2010:

us_county_area <- read_excel("LND01.xls") 
transmute(CountyCode = as.character(as.integer(STCOU)), Area = LND110210D)

us_county_population <- read_excel("POP01.xls") 
transmute(CountyCode = as.character(as.integer(STCOU)),Population = POP010210D)

3. Adjust data

election_url <- ""
county_data <- read_csv(election_url) 
group_by(CountyCode = as.character(fips)) 
mutate(TotalVotes = sum(votes)) 
mutate(name = str_replace(name, ".\\. ", "")) 
filter(name %in% c("Trump", "Clinton", "Johnson", "Stein")) 
transmute(County = str_replace(geo_name, " County", ""),
State = state,
CountyCode = as.character(fips),
Candidate = name,
Percent = vote_pct / 100,
spread(Candidate, Percent, fill = 0) 
inner_join(us_county_population, by = "CountyCode") 
inner_join(us_county_area, by = "CountyCode")

you can save the data into a csv file:

# write_csv(county_data, "county_election_2016.csv")

You can download the cleaned datafile here: data_election_2016_by_county

4. Upload data and shape files

dt <- read.csv("new_county_election_2016.csv", header=T)
us <- readShapePoly("./USA_adm/USA_adm2.shp")
us0 <- readShapePoly("./USA_adm/USA_adm0.shp")
us.m <- us[-c(which(us$NAME_1=="Alaska")),] #get rid of Alaska
us.d <- us.m[-c(67:71),]

5. Prepare the color palette(s)

nclassint <- 5 #number of colors to be used in the palette
cat.T <- classIntervals(dt$Trump[-c(67:71)], nclassint,style = "jenks") #style refers to how the breaks are created
colpal.T <- brewer.pal(nclassint,"Reds")
color.T <- findColours(cat.T,colpal.T) #sequential
bins.T <- cat.T$brks
lb.T <- length(bins.T)

5. Plot the maps with map basic

# pdf("Where are the trump voters.pdf")
# plot(us.d, col=color.T, border=F)
# plot(us0,add=T, lwd=0.1)
# legend("bottomleft",fill=colpal.T,legend=paste(round(bins[-length(bins.T)],1),":",round(bins.T[-1],1)),cex=1, bg="white")
% Votes for Clinton
% Votes for Trump

… or ggplot2


ggplot(county_data, aes(Population / Area, Trump)) +
  geom_point() +
  geom_point(data=county_data[which(county_data$State=="Texas"),], aes(x=Population/Area, y=Trump), colour="red")+
  scale_x_log10() +
  scale_y_continuous(labels = percent_format()) +
  xlab("Population density (ppl / square mile)") +
  ylab("% of votes going to Trump") +
  geom_text(aes(label = County), vjust = 1, hjust = 1, check_overlap = TRUE) +
  geom_smooth(method = "lm") +
  ggtitle("Population density vs Trump voters by county (Texas Counties in red)")

This is the code to plot in red points according to State (in red) and to add red labels to those points. The check_overlap=T avoids overlapping labels.

# ggplot(county_data, aes(Population / Area, Trump)) +
#   geom_point() +
#   geom_point(data=county_data[which(county_data$State=="California"),], aes(x=Population/Area, y=Trump), colour="red")+
#   scale_x_log10() +
#   scale_y_continuous(labels = percent_format()) +
#   xlab("Population density (ppl / square mile)") +
#   ylab("% of votes going to Trump") +
#   geom_text(data=county_data[which(county_data$State=="California"),], aes(label = ifelse(Trump&amp;gt;.5, as.character(dt$County), "" )), color= "red",size=5,vjust = 1, hjust = 1, check_overlap = TRUE) +
#   geom_smooth(method = "lm") +
#   ggtitle("Population density vs Trump voters by county (California in red)")



Saving graphics in R

A brief minimal guide on saving graphics in R

This is intended to be a minimalistic guide on how to save graphics in an R environment giving tips on formats and codes.

1. What format?

a. Vector files like PDF, EPS, PS, SVG: high quality, easility resizable and works in any anvironment. In particular, I find PDF to work great with LaTeX, ppt, and word. pdf(“mygraph.pdf”).
b. WMF: easily resizable but works only in a Windows environment. I don’t own or work with Windows, so I have never used this format. The general command is win.metafile(“mygraph.wmf”). I personally despise word as a writing tool, I wrote my master thesis in it and it was a nightmare, but if you really need to use it… If you have a MAC (and you are still using word) I recommend you take a look at this website for inspiration. If you work in a Windows environment free alternatives here, and here (mostly for reports or lecture notes, but I know people who write entire articles), not free here, …And LaTeX for All. If you work with Linux you’re porbably laughing.
c. JPG   –> never use jpg formats
d. PNG, TIFF are bitmap (or raster) formats, preferable for raster graphics, such as photos. png(“myplot.png”) or tiff(“myplot.tiff”). Good to know: to make more than one page of  graphs add the -%d. as in png(“plot-%d.png”) see example 3.
e. svg is another vector format, like pdf or eps. Default settings for svg() does not allow for multiple pages in a single file

f. one extra mention for the .eps format, the one I normally use and that I find the most practical. I use it to store all graphs for the most disparate purposes: to include them into a LaTeX document (it will just transform your .eps files into .pdf(s) and add them to your library), for presentations in ppt, keynote or LaTeX (again) and publications. Windows usually does not visualize authomatically encapsulated scripts **but** if you own a Windows machine, you can always download a program such as Ghostscript ,GIMP , Photoshop , or EPS viewer


2. How to use it?

This works with most plotting libraries: (1) first call your format saving line (e.g. pdf, png, jpeg, postscript…), (2) plot commands, (3) tells to stop saving whatever you are plotting, meaning that if you don’t call it you may end up with a bunch of graphs on the same page.

example 1:
plot(x, y) closes the graphics device, it stops the saving of any further plotting commands, so be sure to add it when you are done with plotting

Alternatively you can use the dev.print command, which produces postscript prints:

example 2:
plot(x, y)
dev.print(pdf, "myplot.pdf") #here I use pdf, but it can be any other format... see this link ()

example 3:  Multiple pages: 1 plot per page…
plot(x, y)

plot(x, y)

3. What if I am using ggplot2 or ggmap?
ggsave You only need to specify the filename, it’s very convenient for quick plots. It saves the last plot that you displayed, the default size is the size of the current graphics device, unless otherwise specified in height and width and the unit measure can be in cm, inches or mm units =(“cm”, “inches”, “mm”) . It guesses the type of graphics device from the extension: see this for more details
Of course, you need plot in ggplot2 to use ggsave…

4. What size?
The codes above format plots according to the size in which they are diplayed in R or by default values (in inches). Sizing can be controlled via width, height.

pdf("myplot.pdf", width=10, height=5)
a # I am using code from this post

5. How to customize graphsThis works with pdf() and postscript() -I always use postscript…-
As usual, like most things in R, everything is highly customizable, for instance you can:
1. Define the font family to be used via the family option. The default is Helvetica but you can find an exhaustive list of fonts here
2. bg changes the background color (I usually set it to transparent bg= “transparent”, so that I don’t have problems when using graphs for presentations, especially if they are in powerpoint or keynote)
3. horizontal direction of the printed image, if set to FALSE it’s vertical

… and much more… Those three are the ones that I most commonly use.


To paraphrase Röyksopp, what else is there?

Well, a lot…

  1. the Cairo package (see this link)  to export anti-aliased, high resolution plots in R for Windows
  2. I have purposedly avoided mentioning the lattice package, mostly because I don’t use it. Lattice is a trellis graphics system that exists in parallel with the normal R graphics system and the graph exporting system is a bit different from that of other environments. A great intro for anyone who wants a go at lattice is this set of slides. In general, when plotting in R you have plenty of choice and usually one environment (either ggplot, plot basic, or lattice just to mention some) is enough to do everything.
  3. Some journals have strict requirements for  graphs quality, and the .eps format called via postscript(“myveryniceplot.epc”,  paper= “special” , onefile= FALSE) seems to do the trick
  4. The whole world of interactive plotting: see ggvis, plotly, htmlwidgets, googleVis, and shiny (just to mention a few)

Arranging ggplot2 graphs on a page

How to arrange graphs in ggplot2 without the help of the layout matrix

How do you arrange non-simmetric plots in ggplot2?
With the print command:

After installing these two packages: install.packages(“grid”, “ggplot2”), load the  libraries:

The data and code for the three graphs is taken from this website:

# create factors with value labels
mtcars$gear <- factor(mtcars$gear,levels=c(3,4,5), labels=c("3gears","4gears","5gears"))
mtcars$am <- factor(mtcars$am,levels=c(0,1), labels=c("Automatic","Manual"))
mtcars$cyl <- factor(mtcars$cyl,levels=c(4,6,8), labels=c("4cyl","6cyl","8cyl"))

# Kernel density plots for mpg
# grouped by number of gears (indicated by color)
a <- qplot(mpg, data=mtcars, geom="density", fill=gear, alpha=I(.5),
main="Distribution of Gas Milage", xlab="Miles Per Gallon",

# Scatterplot of mpg vs. hp for each combination of gears and cylinders
# in each facet, transmittion type is represented by shape and color
b <- qplot(hp, mpg, data=mtcars, shape=am, color=am,
facets=gear~cyl, size=I(3),
xlab="Horsepower", ylab="Miles per Gallon")

c <- qplot(gear, mpg, data=mtcars, geom=c("boxplot", "jitter"),
fill=gear, main="Mileage by Gear Number",
xlab="", ylab="Miles per Gallon")

a, b, and c are our graphs. Here we decide how to place the plots on the plotting surface:

grid.newpage() # Open a new page on grid device
pushViewport(viewport(layout = grid.layout(3, 1))) #this can really be anything... just remember to change accordingly the print commands below
print(a, vp = viewport(layout.pos.row = 1, layout.pos.col = 1:1))
print(b, vp = viewport(layout.pos.row = 2, layout.pos.col = 1:1))
print(c, vp = viewport(layout.pos.row = 3, layout.pos.col = 1:1))

The layout=grid.layout is the command dividing the plotting surface, in the example I have divided it into three rows and one column, hence the layout.pos.row = 1, 2, 3 and the layout.pos.row = 1:1 equal for all three plots.


What if I need something asymmetrical? For instance two small plots on one column and one taking up more space… The reasoning is very similar to that of the layout matrix: divide the space into 4 squares grid.layout(2, 2) and then plot the third graph over two rows layout.pos.row=1:2

grid.newpage() # Open a new page on grid device
pushViewport(viewport(layout = grid.layout(2, 2))) #this can really be anything... just remember to change accordingly the print commands below
print(a, vp = viewport(layout.pos.row = 1, layout.pos.col = 1:1))
print(b, vp = viewport(layout.pos.row = 2, layout.pos.col = 1:1))
print(c, vp = viewport(layout.pos.row = 1:2, layout.pos.col = 2:2))