I will present at PAA my WorldPop mapping of Demographic indicators in low-income settings at PAA in Chicago. “Advances in Mathematical, Spatial, and Small-Area Demography”, Thursday, April 27, 2017: 10:15 AM – 11:45 AM, Hilton, Joliet Room.
Use geom_violin() to quickly plot a visual summary of variables, using the Boston dataset, MASS library.
Use geom_violin() to quickly plot a visual summary of variables, using the Boston dataset from the MASS library.
1. Upload the relevant libraries:
require(tidyr) require(ggplot2) require(RColorBrewer) require(randomcoloR) require(MASS)
2. Load data and use the tidyr package to transform wide into long format:
data(Boston) dt.long <- gather(Boston, "variable", "value", crim:medv)
3. Create some color palettes:
col <- colorRampPalette(c("red", "blue"))(14) # col.bp <- brewer.pal(9, "Set1") # brewer.pal only has a max of 9 colors col.rc <- as.vector(distinctColorPalette(14))
- With the standard colors produced by ggplot2:
ggplot(dt.long,aes(factor(variable), value))+ geom_violin(aes(fill=factor(variable)))+ geom_boxplot(alpha=0.3, color="black", width=.1)+ labs(x = "", y = "")+ theme_bw()+ theme(legend.title = element_blank())+ facet_wrap(~variable, scales="free")
- With the color palette produced by colorRampPalette:
ggplot(dt.long,aes(factor(variable), value))+ geom_violin(aes(fill=factor(variable)))+ geom_boxplot(alpha=0.3, color="black", width=.1)+ labs(x = "", y = "")+ scale_fill_manual(values = col, name="")+ theme_bw()+ facet_wrap(~variable, scales="free")
- With the color palette produced by randomcoloR library:
ggplot(dt.long,aes(factor(variable), value))+ geom_violin(aes(fill=factor(variable)))+ geom_boxplot(alpha=0.3, color="black", width=.1)+ labs(x = "", y = "")+ scale_fill_manual(values = col.rc, name="")+ theme_bw()+ facet_wrap(~variable, scales="free")
Plot maps with ‘base’ mapping tools in R
Understanding what kind of data you have (polygons or points?) and what you want to map is pivotal to start your mapping.
- First you need a shapefile of the area you want to plot, such as metropolitan France. There are various resources where to get them from: DIVA-GIS and EUROSTAT are those that I use the most. It’s always important to have a .prj file included, as your final map ‘should’ be projecte. I say “should” as sometimes it is just not possible, especially if you work with historical maps.
- Upload libraries
Load and prepare data
setwd(paste(mypath)) fr.prj <- readOGR(".", "FRA_adm2")
## OGR data source with driver: ESRI Shapefile ## Source: ".", layer: "FRA_adm2" ## with 96 features ## It has 18 fields
## NOTE: rgdal::checkCRSArgs: no proj_defs.dat in PROJ.4 shared files
## Warning in SpatialPolygons2map(database, namefield = namefield): database ## does not (uniquely) contain the field 'name'.
## ID_0 ISO NAME_0 ID_1 NAME_1 ID_2 NAME_2 VARNAME_2 ## 0 76 FRA France 989 Alsace 13755 Bas-Rhin Unterelsaá ## 1 76 FRA France 989 Alsace 13756 Haut-Rhin Oberelsaá ## 2 76 FRA France 990 Aquitaine 13757 Dordogne <NA> ## 3 76 FRA France 990 Aquitaine 13758 Gironde Bec-D'Ambes ## 4 76 FRA France 990 Aquitaine 13759 Landes Landas ## 5 76 FRA France 990 Aquitaine 13760 Lot-Et-Garonne <NA> ## NL_NAME_2 HASC_2 CC_2 TYPE_2 ENGTYPE_2 VALIDFR_2 VALIDTO_2 ## 0 <NA> FR.BR <NA> Département Department 17900226 Unknown ## 1 <NA> FR.HR <NA> Département Department 17900226 Unknown ## 2 <NA> FR.DD <NA> Département Department 17900226 Unknown ## 3 <NA> FR.GI <NA> Département Department 17900226 Unknown ## 4 <NA> FR.LD <NA> Département Department 17900226 Unknown ## 5 <NA> FR.LG <NA> Département Department 17900226 Unknown ## REMARKS_2 Shape_Leng Shape_Area ## 0 <NA> 4.538735 0.5840273 ## 1 <NA> 3.214178 0.4198797 ## 2 <NA> 5.012795 1.0389622 ## 3 <NA> 9.200047 1.1489822 ## 4 <NA> 5.531231 1.0372815 ## 5 <NA> 4.489830 0.6062017
# load or create data set.seed(100) myvar <- rnorm(1:96)
# manipulate data for the plot france.geodata <- data.frame(id=rownames(fr.prj@data), mapvariable=myvar) head(france.geodata)
## id mapvariable ## 1 0 1.12200636 ## 2 1 0.05912043 ## 3 2 -1.05873510 ## 4 3 -1.31513865 ## 5 4 0.32392954 ## 6 5 0.09152878
# fortify prepares the shape data for ggplot france.dataframe <- fortify(fr.prj) # convert to data frame for ggplot
## Regions defined for each Polygons
## long lat order hole piece id group ## 1 7.847912 49.04728 1 FALSE 1 0 0.1 ## 2 7.844539 49.04495 2 FALSE 1 0 0.1 ## 3 7.852439 49.04510 3 FALSE 1 0 0.1 ## 4 7.854333 49.04419 4 FALSE 1 0 0.1 ## 5 7.855955 49.04431 5 FALSE 1 0 0.1 ## 6 7.856299 49.03776 6 FALSE 1 0 0.1
#now combine the values by id values in both dataframes france.dat <- join(france.geodata, france.dataframe, by="id") head(france.dat)
## id mapvariable long lat order hole piece group ## 1 0 1.122006 7.847912 49.04728 1 FALSE 1 0.1 ## 2 0 1.122006 7.844539 49.04495 2 FALSE 1 0.1 ## 3 0 1.122006 7.852439 49.04510 3 FALSE 1 0.1 ## 4 0 1.122006 7.854333 49.04419 4 FALSE 1 0.1 ## 5 0 1.122006 7.855955 49.04431 5 FALSE 1 0.1 ## 6 0 1.122006 7.856299 49.03776 6 FALSE 1 0.1
# Plot 3 p <- ggplot(data=france.dat, aes(x=long, y=lat, group=group)) p <- p + geom_polygon(aes(fill=mapvariable)) + geom_path(color="white",size=0.1) + coord_equal() + scale_fill_gradient(low = "#ffffcc", high = "#ff4444") + labs(title="Our map",fill="My variable") # plot the map p
Use plot basic
nclassint <- 5 #number of colors to be used in the palette cat <- classIntervals(myvar, nclassint,style = "jenks") #style refers to how the breaks are created colpal <- brewer.pal(nclassint,"RdBu") color <- findColours(cat,rev(colpal)) #sequential bins <- cat$brks lb <- length(bins)
plot(fr.prj, col=color,border=T) legend("bottomleft",fill=rev(colpal),legend=paste(round(bins[-length(bins)],1),":",round(bins[-1],1)),cex=1, bg="white")
I’ve stumbled upon a little trick to compute jenks breaks faster than with the classInt package, just be sure to use n+1 instead of n as the breaks are computed a little bit differently. That is to say, if you want 5 breaks, n=6, no biggie there.
For more on the Bayesian Analysis of Macroevolutionary Mixtures see BAMMtools library
> user system elapsed
> 0.970 0.001 0.971
On the other hand this takes way more time with large datasets
system.time(classIntervals(mydata$myvar, n=5, style="jenks"))
> Timing stopped at: 1081.894 1.345 1083.511
Upload files from Excel, STATA, SAS, SPSS and text
First set the working directory (or check it)
getwd() # get working directory
 "/Users/me/My Folder/"
setwd("./My Subfolder/") # set working directory
1. .csv and .txt files
the read.csv function has many options, some of them are header=T which sets the first line as column names, sep=“,” the field separator character (in this case the semicolon), dec=“.” decimal sep character, skip=2 number of lines to skip (in this case 2).
read.csv2 is identical to read.csv except it assumes commas to be the decimal operators and semicolon as field separator
read.table works similarly to read.csv, but reads text files.
mydata <- read.csv("mydata.csv", header=T)
When importing data in R, if any column’s name is a number, R will add an X to it (as in general it is a very bad idea to have numbers for column names, but can be handy). You can replace column names with:
colnames(mydata) <- c("name1", "name2", "name3", "2017", "2018", "2019")
If you change or add anything to your data and want to save it then ( write.table for txt output):
write.csv(mydata, "mydata.csv", row.names=FALSE)
2. STATA files .dta
library(foreign) write.dta(mydata, "mydata.dta")
3. SPSS files .sav
use.value.labels by default is TRUE and converts value labels into factors. The mydata.txt is the name for data output, while the mydata.sps is the code output.
library(foreign) mydata <- read.spss("mydata", to.data.frame=T, use.value.labels = FALSE) write.foreign(as.data.frame(mydata), "mydata.txt", "mydata.sps", package="SPSS")
4. SAS files .sas
Note that by default it converts value labels into factors
## to read from SAS library(Hmisc) mydata <- sasxport.get("mydata.xpt") ## to save in SAS format library(foreign) write.foreign(as.data.frame(mydata), "mydata.txt", "mydata.sas", package="SAS")
5. Excel spreadsheet
# library(xlsx) mydata <- read.xlsx("c:/myexcel.xlsx", 1) # 1 refers to the first worksheet-page altrenatively... mydata <- read.xlsx("c:/myexcel.xlsx", sheetName="Data input") write.xlsx(mydata, "mydata.xlsx")
# library(readxl) mydata <-system.file("mypath/myexcel.xlsx", package = "readxl") mydata <- read_excel(mydata, 1)
(A few) quick tricks
# head(mydata, n=10) # first 10 rows tail(mydata, n=10) # last 10 rows mydata[1,1:10] # print first row and first 10 columns names(mydata) # variable names length(mydata) nrow(mydata) # number of rows ncol(mydata) # number of columns str(mydata) # list structure of data class(mydata) # class of data view(mydata) # opens viewer window
- Upload libraries:
rm(list = ls(all=T)) #clear workspace library(dplyr) library(readr) library(stringr) library(tidyr) library(readxl) library(classInt) library(RColorBrewer) library(maptools) #to read shapefiles
2. Download the data files (note they are not ready for use but need some cleaning as there are more areas in the excel files than polygons in the shape file). I copy here the code as I have used it in my script but it’s available at RPubs thanks to David Robinson.
download.file("http://www2.census.gov/prod2/statcomp/usac/excel/LND01.xls", "LND01.xls") download.file("http://www2.census.gov/prod2/statcomp/usac/excel/POP01.xls", "POP01.xls")
according to metadata, this is Land Area in 2010 and resident population in 2010:
us_county_area <- read_excel("LND01.xls") transmute(CountyCode = as.character(as.integer(STCOU)), Area = LND110210D) us_county_population <- read_excel("POP01.xls") transmute(CountyCode = as.character(as.integer(STCOU)),Population = POP010210D)
3. Adjust data
election_url <- "https://raw.githubusercontent.com/Prooffreader/election_2016_data/master/data/presidential_general_election_2016_by_county.csv" county_data <- read_csv(election_url) group_by(CountyCode = as.character(fips)) mutate(TotalVotes = sum(votes)) ungroup() mutate(name = str_replace(name, ".\\. ", "")) filter(name %in% c("Trump", "Clinton", "Johnson", "Stein")) transmute(County = str_replace(geo_name, " County", ""), State = state, CountyCode = as.character(fips), Candidate = name, Percent = vote_pct / 100, TotalVotes) spread(Candidate, Percent, fill = 0) inner_join(us_county_population, by = "CountyCode") inner_join(us_county_area, by = "CountyCode")
you can save the data into a csv file:
# write_csv(county_data, "county_election_2016.csv")
You can download the cleaned datafile here: data_election_2016_by_county
4. Upload data and shape files
dt <- read.csv("new_county_election_2016.csv", header=T) us <- readShapePoly("./USA_adm/USA_adm2.shp") us0 <- readShapePoly("./USA_adm/USA_adm0.shp") us.m <- us[-c(which(us$NAME_1=="Alaska")),] #get rid of Alaska us.d <- us.m[-c(67:71),]
5. Prepare the color palette(s)
nclassint <- 5 #number of colors to be used in the palette cat.T <- classIntervals(dt$Trump[-c(67:71)], nclassint,style = "jenks") #style refers to how the breaks are created colpal.T <- brewer.pal(nclassint,"Reds") color.T <- findColours(cat.T,colpal.T) #sequential bins.T <- cat.T$brks lb.T <- length(bins.T)
5. Plot the maps with map basic…
# pdf("Where are the trump voters.pdf") # plot(us.d, col=color.T, border=F) # plot(us0,add=T, lwd=0.1) # legend("bottomleft",fill=colpal.T,legend=paste(round(bins[-length(bins.T)],1),":",round(bins.T[-1],1)),cex=1, bg="white") # dev.off()
… or ggplot2
library(ggplot2) library(scales) theme_set(theme_bw()) ggplot(county_data, aes(Population / Area, Trump)) + geom_point() + geom_point(data=county_data[which(county_data$State=="Texas"),], aes(x=Population/Area, y=Trump), colour="red")+ scale_x_log10() + scale_y_continuous(labels = percent_format()) + xlab("Population density (ppl / square mile)") + ylab("% of votes going to Trump") + geom_text(aes(label = County), vjust = 1, hjust = 1, check_overlap = TRUE) + geom_smooth(method = "lm") + ggtitle("Population density vs Trump voters by county (Texas Counties in red)")
This is the code to plot in red points according to State (in red) and to add red labels to those points. The check_overlap=T avoids overlapping labels.
# ggplot(county_data, aes(Population / Area, Trump)) + # geom_point() + # geom_point(data=county_data[which(county_data$State=="California"),], aes(x=Population/Area, y=Trump), colour="red")+ # scale_x_log10() + # scale_y_continuous(labels = percent_format()) + # xlab("Population density (ppl / square mile)") + # ylab("% of votes going to Trump") + # geom_text(data=county_data[which(county_data$State=="California"),], aes(label = ifelse(Trump&gt;.5, as.character(dt$County), "" )), color= "red",size=5,vjust = 1, hjust = 1, check_overlap = TRUE) + # geom_smooth(method = "lm") + # ggtitle("Population density vs Trump voters by county (California in red)")
A brief minimal guide on saving graphics in R
This is intended to be a minimalistic guide on how to save graphics in an R environment giving tips on formats and codes.
1. What format?
a. Vector files like PDF, EPS, PS, SVG: high quality, easility resizable and works in any anvironment. In particular, I find PDF to work great with LaTeX, ppt, and word. pdf(“mygraph.pdf”).
b. WMF: easily resizable but works only in a Windows environment. I don’t own or work with Windows, so I have never used this format. The general command is win.metafile(“mygraph.wmf”). I personally despise word as a writing tool, I wrote my master thesis in it and it was a nightmare, but if you really need to use it… If you have a MAC (and you are still using word) I recommend you take a look at this website for inspiration. If you work in a Windows environment free alternatives here, and here (mostly for reports or lecture notes, but I know people who write entire articles), not free here, …And LaTeX for All. If you work with Linux you’re porbably laughing.
c. JPG –> never use jpg formats
d. PNG, TIFF are bitmap (or raster) formats, preferable for raster graphics, such as photos. png(“myplot.png”) or tiff(“myplot.tiff”). Good to know: to make more than one page of graphs add the -%d. as in png(“plot-%d.png”) see example 3.
e. svg is another vector format, like pdf or eps. Default settings for svg() does not allow for multiple pages in a single file
f. one extra mention for the .eps format, the one I normally use and that I find the most practical. I use it to store all graphs for the most disparate purposes: to include them into a LaTeX document (it will just transform your .eps files into .pdf(s) and add them to your library), for presentations in ppt, keynote or LaTeX (again) and publications. Windows usually does not visualize authomatically encapsulated scripts **but** if you own a Windows machine, you can always download a program such as Ghostscript ,GIMP , Photoshop , or EPS viewer
2. How to use it?
This works with most plotting libraries: (1) first call your format saving line (e.g. pdf, png, jpeg, postscript…), (2) plot commands, (3) dev.off()
dev.off() tells to stop saving whatever you are plotting, meaning that if you don’t call it you may end up with a bunch of graphs on the same page.
dev.off() #dev.off() closes the graphics device, it stops the saving of any further plotting commands, so be sure to add it when you are done with plotting
Alternatively you can use the dev.print command, which produces postscript prints:
dev.print(pdf, "myplot.pdf") #here I use pdf, but it can be any other format... see this link
example 3: Multiple pages: 1 plot per page…
3. What if I am using ggplot2 or ggmap?
ggsave You only need to specify the filename, it’s very convenient for quick plots. It saves the last plot that you displayed, the default size is the size of the current graphics device, unless otherwise specified in height and width and the unit measure can be in cm, inches or mm units =(“cm”, “inches”, “mm”) . It guesses the type of graphics device from the extension: see this for more details
Of course, you need plot in ggplot2 to use ggsave…
4. What size?
The codes above format plots according to the size in which they are diplayed in R or by default values (in inches). Sizing can be controlled via width, height.
pdf("myplot.pdf", width=10, height=5)
a # I am using code from this post
5. How to customize graphs? This works with pdf() and postscript() -I always use postscript…-
As usual, like most things in R, everything is highly customizable, for instance you can:
1. Define the font family to be used via the family option. The default is Helvetica but you can find an exhaustive list of fonts here
2. bg changes the background color (I usually set it to transparent bg= “transparent”, so that I don’t have problems when using graphs for presentations, especially if they are in powerpoint or keynote)
3. horizontal direction of the printed image, if set to FALSE it’s vertical
… and much more… Those three are the ones that I most commonly use.
To paraphrase Röyksopp, “what else is there?“
Well, a lot…
- the Cairo package (see this link) to export anti-aliased, high resolution plots in R for Windows
- I have purposedly avoided mentioning the lattice package, mostly because I don’t use it. Lattice is a trellis graphics system that exists in parallel with the normal R graphics system and the graph exporting system is a bit different from that of other environments. A great intro for anyone who wants a go at lattice is this set of slides. In general, when plotting in R you have plenty of choice and usually one environment (either ggplot, plot basic, or lattice just to mention some) is enough to do everything.
- Some journals have strict requirements for graphs quality, and the .eps format called via postscript(“myveryniceplot.epc”, paper= “special” , onefile= FALSE) seems to do the trick
- The whole world of interactive plotting: see ggvis, plotly, htmlwidgets, googleVis, and shiny (just to mention a few)