A match made in R: checking the order of geographical areas in shape files and in your data frames

Not every shape file is as nice as those provided in libraries. Sometimes we have to deal with historical maps, which have been hand-drawn, re-touched and what not. To work with geo-referenced data it is essential to have a variable in both shape file and dataframe with unique coding that has exactly the same number of areas and the same ordering in both files.

A quick way to check if shapefile and dataframe have the same number of areas:

nrow(df) == length(shape.file$Code)

In the shapefile, one can also select a couple of areas big enough so that they can easily be located, and plot them as “control” areas.
For instance, I want to select the area with code “15078” in the shapefile:
[1] 271

which is the area in the 271-th position (same way shape.file$Code[271] gives the code of area 271).

this is an easy way to locate your “control” area(s).
Ideally, you should have some variable that is identical to the one in the shapefile, a codification of some sort, providing a unique Code, the name of the area or some factors that allow you to locate the area in space.

An easy way to check if both shape file and data frame have the same ordering of geographical areas is to test it:
>code.sh <- cbind(c(1:length(shape.file$Code)),as.vector(shape.file$Code))
>code.df <- cbind(c(1:nrow(df)),df$Code)
[,1]  [,2]

What if it’s not?
First option: the inelegant solution
Manually change the order of the areas in a csv file according to the exact order they have in the shape file. It’s easy as you can create an ordinal index for the shapefile codes, paste it in excel, and assign it with a vlookup function.
Second option: the smart R match
In R there is a function called match that returns a vector of the positions of first matches of the first argument in its second:
>my.match <- match(df$Code, shape.file$Code)
NB: to use match the two variables providing the code for the areas have to have the very same unique and identical codes, or else funny stuff happens. To check that everything is in its right place, you can plot the two “control” spatial polygons we chose in the beginning, using their position in the dataframe rather than in the shapefile:

Game of Thrones maps in R…

The map of GOT world with rivers, roads, lakes, the Wall, and main cities:


Neighborhood relations according to Sphere of Influence pretty much coincide with roads and rivers (package spdep):


Paste some images to locate the (surviving) Stark family members, using rasterImage from the png library:


Creating neighborhood matrices for Spatial Polygons

One of the first steps in spatial analysis is to create a neighborhood matrix, that is to say create a relationship/connection between each and (ideally!) every polygon. Why? Well, given that the premise for spatial analysis is that neighboring locations are more similar than far away locations, we need to define what is “near”, a set of neighbors for each location capturing such dependence.

There are many ways to define neighbors, and usually, they are not interchangeable, meaning that one neighborhood definition will capture spatial autocorrelation differently from another.

In R the package spdep allows to create a neighbor matrix according to a wide range of definitions: contiguity, radial distance, graph based, and triangulation (and more). There are 3 main and most used neighbors: 1) Contiguity based of order 1 or higher, 2) Distance based, and 3) Graph based.

Install and load the maptools and spdep libraries shapefile from North Carolina counties:

>NC<- readShapePoly(system.file("shapes/sids.shp", package="maptools")[1], IDvar="FIPSNO", proj4string=CRS("+proj=longlat +ellps=clrk66"))

1. Contiguity based relations are the most used in the presence of irregular polygons  with varying shape and surface, since contiguity ignores distance and focuses instead on the location of an area. The function poly2nb allows to create 2 types of contiguity based relations:

  1. First Order Queen Contiguity defines a neighbor when at least one point on the boundary of one polygon is shared with at least one point of its neighbor (common border or corner);
    >nb.FOQ <- poly2nb(NC, queen=TRUE, row.names=NC$FIPSNO) #row.names refers to the unique names of each polygon
    Calling nb.FOQ you get a summary of the neighbor matrix, including the total number of areas/counties, and average number of links:

    Neighbour list object:
    Number of regions: 100
    Number of nonzero links: 490
    Percentage nonzero weights: 4.9
    Average number of links: 4.9
  2. First Order Rook Contiguity does not include corners, only borders, thus comprising only polygons sharing more than one boundary point;
    >nb.RK <- poly2nb(NC, queen=F,row.names=NC$FIPSNO)
    > nb.RK
    Neighbour list object:
    Number of regions: 100
    Number of nonzero links: 462
    Percentage nonzero weights: 4.62
    Average number of links: 4.62
    NB: if there is a region without any link, there will be a message like this:Neighbour list object:
    Number of regions: 910
    Number of nonzero links: 4906
    Percentage nonzero weights: 0.5924405
    Average number of links: 5.391209
    10 regions with no links:
    1014 3507 3801 8245 9018 10037 22125 30005 390299 390399

    where you can identify the regions with no links (1014, 3507,…), and in R it is possible to manually connect them or change the neighbor matrix so that they can be included in the neighbor matrix (such as graph based neighbors).
  3. Higher order neighbors are useful when looking at the effect of lags on spatial autocorrelation and in spatial autoregressive models like SAR with a more global spatial autocorrelation:

>nb.FOQ <- poly2nb(NC, queen=TRUE, row.names=NC$FIPSNO) #first define the first order queen to get to further lags
# Second Order Queen
>nb.SOQ <- nblag(nb.FOQ,2) # 2 is the lag, if you want 6th order neighbors you'd have nblag(nb,6)
>nb.RK <- poly2nb(NC, queen=F,row.names=NC$FIPSNO) #same here
# Second Order Rook
>nb.SRC <- nblag(nb.RK,2)

2. Distance based neighbors defines a set of connections between polygons either based on a (1) defined Euclidean distance between centroids dnearneigh or a certain (2) number of neighbors knn2nb (e.g. 5 nearest neighbors);

>coordNC <- coordinates(NC) #get centroids coordinates
d05m <- dnearneigh(coordNC, 0.5) #define the distance (here 1/2 mile)
>nb.5NN <- knn2nb(knearneigh(coordNC,k=5),row.names=NC$FIPSNO) #set the number of neighbors (here 5)


a little trick: if you want information on neighbor distances whatever the type of neighborhood may be:
>distance <- unlist(nbdists(nb.5NN, coordNC))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.1197  0.3323  0.3956  0.4095  0.4716  0.9327

3. Graph based

  1. Delauney triangulation tri2nb constructs neighbors through Voronoi triangles such that each centroid is a triangle node. As a consequence, DT ensures that every polygon has a neighbor, even in presence of islands. The “problem” with this specification is that it treats our area of study as if it were an island itself, without any neighbors (as if North Carolina were an island with no Virginia or South Carolina)… Therefore, distant points that would not be neighbors (such as Cherokee and Brunswick counties) become such;
  2. Gabriel Graph gabrielneigh is a particular case of the DT, where a and b are two neighboring points/centroids if in the circles passing by  a and b with diameter ab does not lie any other point/centroid;
  3. Sphere of Influence soi.graph: twopoints a and b are SOI neighbors if the circles centered on a and b, of radius equal to the a and b nearest neighbour distances, intersect twice. It is a sort of Delauney triangulation without the longest connections;
  4. Relative Neighbors relativeneigh is a particular case of GG. A border belongs to RN if the intersection formed by the two circles centered in a and b with radius ab does not contain any other point.

>IDs <- row.names(as(NC, "data.frame")) #create a vector with the names of each polygon NC$FIPSNO
>delTrinb <- tri2nb(coordNC, row.names = IDs) #Delauney triangulation
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.1197  0.3473  0.4154  0.5673  0.5187  5.5830
>SOInb <- graph2nb(soi.graph(delTrinb, coordNC), row.names = IDs) #Sphere of influence
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1197 0.3257 0.3919 0.3958 0.4629 0.6460
>GGnb <- graph2nb(gabrielneigh(coordNC), row.names = IDs) #Gabriel graph
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1197 0.3191 0.3715 0.3813 0.4364 0.6777
>RNnb <- graph2nb(relativeneigh(coordNC), row.names = IDs) #Relative neighbor (or relative graph)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1197 0.2984 0.3464 0.3382 0.3815 0.5187


A space-time box plot of Spain’s TFR for 910 comarcas.

The idea behind spatial analysis is that space matters and near things are more similar: a variable measured in city A is (ideally) different from the same variable measured in city B. A simple way to get a feeling and to represent this hypothesis is through graphical visualization, usually a map(s).


However, when dealing with time series maps are cumbersome and  with sometimes some information is lost, such as the national average or path convergence. Box plots are a simple yet very effective way to synthesize a lot of information in one graph. The following plot depicts TFR over a 30 years period for 910 Spanish areas with respect to the national average value (thick black line in the middle of the boxes).

p <- ggplot(dat, aes(x=factor(YEAR), y=dat$TFR))
p <- p + geom_boxplot()
p <- p + scale_y_continuous(limits=c(0,2.5)) + scale_x_discrete("YEAR", breaks=seq(1981,2011,by=5))


A ggmap of 2015 Israeli elections by city

IL_el_percThe recent Israeli elections are a reminder of how Demography and Space play a crucial role in the outcome of the 20th Knesset. For more insight, read the full Demotrends blog post by Ashira Menashe-Oren the demographics of the Israeli electorate here. The map has been done using ggmap and ggplot, two simple mapping tools I really like. If you are interested in the code, below you can find the relative syntax and data.

To start upload the libraries:

library(maptools) #reads the shape file



Download the shape file (I normally use Diva-GIS website) and read it:

map.ogr<- readOGR(".","ISR_adm1")

Data set:

df <- structure(list(lon = c(35.148529, 35.303546, 34.753934, 34.781768,34.989571, 34.824785, 34.808871, 34.883879, 34.844675, 34.90761, 35.010397, 34.871326, 35.21371, 34.655314, 34.887762, 34.792501, 34.574252, 34.791462, 34.748019, 34.787384, 34.853196, 34.811272, 34.919652, 34.888075, 35.098051, 35.119773, 34.872938, 34.835226, 34.988099, 35.002462), lat = c(32.517127, 32.699635, 31.394548, 32.0853, 32.794046, 32.068424, 32.072176, 32.149961, 32.162413, 32.178195, 31.890267, 32.184781, 31.768319, 31.804381, 32.084041, 31.973001, 31.668789, 31.252973, 32.013186, 32.015833, 32.321458, 31.892773, 32.434046, 31.951014, 33.008536, 32.809144, 31.931566,32.084932, 31.747041, 31.90912), City = structure(c(30L, 19L,24L, 29L, 9L, 25L, 7L, 11L, 10L, 14L, 16L, 23L, 13L, 1L, 21L,28L, 2L, 4L, 3L, 12L, 20L, 27L, 8L, 15L, 18L, 22L, 26L, 6L, 5L, 17L), .Label = c("Ashdod", "Ashkelon", "Bat yam", "Beersheva",  "Beit  Shemesh", "Bnei brak", "Giv'atayim", "Hadera", "Haifa",  "Herzliyya", "Hod HaSharon", "Holon", "Jerusalem", "Kefar Sava",  "Lod", "Modi'in - Makkabbim - Re'ut", "Modi'in Illit", "Nahariyya", "Nazareth ", "Netanya", "Petach Tikva", "Qiryat Atta", "Ra'annana",  "Rahat", "Ramat gan", "Ramla", "Rehovot", "Rishon", "Tel-Aviv",  "Umm Al-Fahm"), class = "factor"), most.votes = c(96.28, 91.41,  87.62, 34.03, 24.98, 30.93, 40.1, 38.77, 34.2, 34.66, 28.95,  32.75, 23.9, 30.96, 27.87, 29.78, 39.31, 37.17, 32.88, 30.86,  33.14, 26.95, 31.77, 32.22, 34.25, 35.01, 39.1, 57.56, 27.89,  71.63), party = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,  2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("joint list", "labour", "likud", "yahadut hatora"), class = "factor")), .Names = c("lon", "lat",  "City", "most.votes", "party"), class = "data.frame", row.names = c(NA,  -30L))

get the map using “get_map"

gmap <- get_map(location=c(34.2,29.4,36,33.5),zoom=7,source="stamen",maptype="watercolor")

and plot the map:


geom_polygon(aes(x = long, y = lat, group=id), data = map.ogr, color ="blue", fill ="white", alpha = .8, size = .4)+

geom_point(aes(x=lon,y=lat,color=party,size=most.votes),data=df)+ scale_colour_discrete("Coalition", labels = c("Joint List", "Labour","Likud","United Torah Judaism"), breaks = c("joint list", "labour","likud","yahadut hatora")) + scale_size_continuous("Coalition", labels = c("Joint List", "Labour","Likud","United Torah Judaism"), breaks = c("joint list", "labour","likud","yahadut hatora"), range=c(10,15), guide = FALSE)+ theme(axis.text=element_text(size=18), plot.title=element_text(size=rel(3)), legend.key = element_rect(fill = "white"), legend.background =element_rect("white"), legend.text = element_text(size = 25), legend.title = element_text(size = 25))+ guides(colour = guide_legend(override.aes = list(size=8)))+ labs(x="",y="")

IL_el_perc_city_names_color If you want to add city names you can use the “annotate” option, adding the code below after guides(...)+. I have modified the coordinates to avoid overlapping of labels and colored names to match the color of the winner party.

annotate("text",x=c(35.14853+ 0.2,35.21371+0.15,35.00246+ 0.15,34.79146+0.15, 34.98957-0.08,34.78177-0.14), y=c(32.51713,31.76832,31.90912,31.25297, 32.79405,32.08530),size=5,font=3, label=c("Umm Al-Fahm","Jerusalem","Modin  Illit","Beersheva","Haifa","Tel Aviv"), color=c("darkred","blue4","deeppink4", "blue4","springgreen4","green4"))+

For beginners I highly recommend ggplot2 mailing list, a great and shame-free place to learn.