set.seed(123)
library(tidyverse)
library(sf)
library(spdep)
library(INLA)INLA model Tutorial
Overview
This tutorial will cover setting up, running, and interpreting a spatial Bayesian INLA model for small area estimation.
Step 1 — Load packages
Visit https://www.r-inla.org/download/index.html if issues arise dl INLA package
Step 2 — Import data
This tutorial will utilize fake data - census tracts from the state of GA. Below, we can see this data contains 2,791 observations (GA tracts) identified by column GEOID. There is also an outcome column - “Count”, two fake covariate columns, one continuous and one categorical - “SES_INDEX” and “RURAL”. We also have the column “POP_EST” which is the tract population according to the 2023 ACS.
GA <- st_read("GA_inla_example.shp")Reading layer `GA_inla_example' from data source
`C:\Users\steph\OneDrive\Documents\inla\GA_inla_example.shp'
using driver `ESRI Shapefile'
Simple feature collection with 2791 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -85.60516 ymin: 30.35785 xmax: -80.84038 ymax: 35.00124
Geodetic CRS: NAD83
glimpse(GA)Rows: 2,791
Columns: 7
$ GEOID <chr> "13101880100", "13271950500", "13185011100", "13277960700", …
$ NAME <chr> "8801", "9505", "111", "9607", "15", "104.02", "232.12", "20…
$ Count <int> 0, 5, 6, 1, 5, 10, 2, 3, 3, 1, 4, 7, 5, 8, 1, 4, 7, 6, 10, 7…
$ POP_EST <dbl> 1358, 2897, 3643, 4639, 1803, 2988, 3542, 3546, 4144, 5141, …
$ SES_INDEX <dbl> 46.56105, 62.73499, 57.07618, 53.68892, 76.19922, 57.75992, …
$ Rural <chr> "Urban", "Urban", "Rural", "Urban", "Rural", "Urban", "Rural…
$ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-83.05618 3..., MULTIPOLYGON ((…
Step 3a - Check Possible Population Issues
INLA can be vague if it errors out. This is a good point to ensure 1) no geography has a population is zero, 2) missing, or 3) negative.
GA %>%
summarise(missing_pop = sum(is.na(POP_EST)),
zero_pop = sum(POP_EST == 0, na.rm=TRUE),
negative_pop= sum(POP_EST<0, na.rm=TRUE))Simple feature collection with 1 feature and 3 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -85.60516 ymin: 30.35785 xmax: -80.84038 ymax: 35.00124
Geodetic CRS: NAD83
missing_pop zero_pop negative_pop geometry
1 0 7 0 MULTIPOLYGON (((-81.27939 3...
Step 3b - Clean Population Issues
From the above output, we have 1 tract missing population, and 7 tracts with 0 population. These will need to be removed.
GA <- GA %>%
filter(!is.na(POP_EST), POP_EST!=0)Step 4 - Apply New ID
INLA requires each geography to have a unique ID that can be match to a neighbor matrix. We can also take the oppurtunity to transform the Rural column into a factor variable.
GA <- GA %>%
arrange(GEOID) %>%
mutate(new_id = row_number(),
Rural = as.factor(Rural))Step 5 - Neighbor Mesh Graph
A neighbor mesh or graph is similar to a spatial matrix. It allows INLA to discern what geographies are neighbors. The below uses a simple queen neighbor definition and transforms it so it can be used by INLA. However, because we likely have isolates due to removal of zero population tracts, were going to assign these isolates a nearest neighbor.
nb <- poly2nb(GA, queen=TRUE)
islands <- which(card(nb)==0)
coords <- st_coordinates(st_centroid(GA))
knn1 <- knearneigh(coords, k=1)
nb_knn1<- knn2nb(knn1)
for(i in islands){
nb[[i]] <- nb_knn1[[i]]
}
nb <- make.sym.nb(nb)
graph <- nb2INLA("graph.adj", nb)Step 6 - Formula
Now we can build our formula for our model and select prior parameters.
model_formula <- Count ~ SES_INDEX + Rural + f(new_id,
model = "bym2",
graph = "graph.adj",
scale.model=T, constr=T,
hyper=list(phi=list(prior="pc", param=c(0.5, 0.5), initial = 1),
prec=list(prior="pc.prec", param=c(1, 0.1), initial =1)))There’s a lot going on in the above formula so let’s review. The first line is a standard model formula, where “Count” is being predicted by “SES_INDEX” and “Rural” variables. In addition, we also have the “+f”… this is the spatial random effect where we identify locations according to the “new_id” we created. Next, were using a “bym2” framework which is the spatial mixed model for INLA and telling INLA where/what the neighbor graph is. Scale,model and constr functions mean we want to standardize the spatial effect and constrain it from 0 to 1 to make it easier to interpret.
The next lines are the Bayesian “meat and potatoes” - the priors. For INLA, the priors consist of two hyperparameters: 1) phi and 2) precision. Phi is the probability variation is due to spatial structure, compared to random noise. So in the above (0.5, 0.5) were saying there is a 50% prior probability that less than half of the variance is spatially structured, or the median prior value of phi is 0.5. Precision, or prec, is the inverse of that variance and thus controls the amount of smoothing of the random effect, or how much variability is allowed in the spatial effect. Above, we have (1, 0.1) which means we estimate there is only a 10% prior probability that the random-effect standard deviation exceeds 1 and thus estimate a smaller variability. Lastly, “pc” and “pc.prec” are a type of hyperparamter penalization that is standard with bym2 models. These are priors that encourage simpler models.
Step 7 - Model
Next, we use our formula to run our model. We input the formula above and select to offset our modek vt population since our objective is to calculate rates.
model <- inla(model_formula,
data = GA,
family="poisson",
offset = log(POP_EST),
control.predictor = list(compute=TRUE), verbose=TRUE)Step 8 - Model Summary
After the model finishes, we can interpret results.
summary(model)Time used:
Pre = 19.6, Running = 1.93, Post = 0.445, Total = 22
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) -8.894 0.053 -8.998 -8.894 -8.789 -8.894 0
SES_INDEX 0.040 0.001 0.038 0.040 0.042 0.040 0
RuralUrban -0.005 0.020 -0.045 -0.005 0.035 -0.005 0
Random effects:
Name Model
new_id BYM2 model
Model hyperparameters:
mean sd 0.025quant 0.5quant 0.975quant mode
Precision for new_id 775.519 1630.182 54.938 350.982 4237.348 124.205
Phi for new_id 0.227 0.231 0.006 0.137 0.836 0.011
Marginal log-Likelihood: -4366.48
is computed
Posterior summaries for the linear predictor and the fitted values are computed
(Posterior marginals needs also 'control.compute=list(return.marginals.predictor=TRUE)')
According to our model results, SES_INDEX appears to be correlated with the outcome of Count since the 95% CrI is all on one side of zero (0.038 to 0.042). To expand on this relationship we need to exponeniate the mean effect of 0.04 where e^(0.04) = 1.04. This means for every 1 unit increase in SES_INDEX, the outcome, or Count, is likely to increase by 4%. Comparatively, the factor variable Rural does not appear to be associated with the outcome since the 95% CrI does cross zero (-0.045 to 0.035).
Next, we can examine our random effects. First, notice the high mean value for precision. A high precision suggests a low variance, stronger smoothing, and relatively stable spatial structure. However, the 95% CrI for precision is extremely wide (54.938 to 4237.348) indicating the variability is not estimated well and may warrent rexamination of priors. Phi tells us the amount of residuals due to spatial structure compared to random noise. The phi in the model is fairly low, 0.227. This suggest after acounting for covariates, only 23% of the “noise” is due to the spatial structure. This may be due to the covariates explaining much of the variance or it could indicate Count is not very spatially dependent.