User Tools

Site Tools


wiki:spdistr1

Species Distribution Modeling I

In this session we will:

  1. Read in species occurrence and range data
  2. Pre-process environmental data
  3. Fit a generalized linear model to estimate the species distribution
  4. Predict across the landscape and write the results to disk (for use in GIS, etc.)


R code courtesy of Adam M. Wilson ( original version) and adapted by Marta Jarzyna, October 2015

SDM_I_code_final.R

 wget http://www.spatial-ecology.net/ost4sem/exercise/SDM/SDM_I_code_final.R
 R
 # Set working directory
 rm(list=ls(all=TRUE))
 wdir <- "/media/sf_LVM_share" 
 setwd(wdir)
 
 # Define data directotry - this is where we store the data, but also where we will save the output to
 datadir <- "/media/sf_LVM_share/data" 
 
 # if your shared folder is not writable, create a different folder for the output
 if(!file.exists("~/sdm_out")) dir.create("~/sdm_out", recursive=T)
 outdir <- "~/sdm_out"
 
 # Loading libraries
 require(rgdal)
 require(lattice)
 require(ggplot2)
    
 packages=c("hSDM","dismo","maptools","sp","maps","coda","rgdal","rgeos","doParallel","reshape",
         "ggplot2","knitr","rasterVis","texreg")
         
 l=lapply(packages, library, character.only=T,quietly=T)
 
 rasterOptions(chunksize=1000,maxmemory=1000)

We will use Montane woodcreper (Lepidocolaptes lacrymiger) as example species. This species has a large range, occurring from the coastal cordillera of Venezuela along the Andes south to south-east Peru and central Bolivia.

 # Read in species occurrence data and species range
 # Species: Montane woodcreper, Lepidocolaptes lacrymiger
 
 # Point occurrence data
 points <- as.data.frame(read.csv("/media/sf_LVM_share/data/occurrence/Lepidocolaptes_lacrymiger_allpoints.csv", header=TRUE))
 head(points)
 
 # indicate that these data are presences
 presence <- matrix(1,nrow(points),1)
 points <- cbind(points,presence)
 
 # retrieving spatial coordinates from a Spatial object
 points <- SpatialPointsDataFrame(points[,c(1,2)], points, coords.nrs = numeric(0), 
              proj4string = CRS(as.character(NA)), match.ID = TRUE)
 # assign projection
 projection(points)="+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
 points
 plot(points)
 
 # Species' range
 range <- readOGR(".", "cartodb-query")
 range
 plot(range, add=TRUE)

Loading eBird sampling dataset, in order to obtain “absence” data

 # link to global sampling raster
 gsampling=raster(file.path(datadir,"eBirdSampling_filtered.tif"))
 
 # crop to species range to create modelling domain
 sampling=crop(gsampling,range,file.path(outdir,"sampling.grd"),overwrite=T)   
 
 # assign projection
 projection(sampling)="+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
 
 # convert to points within data region
 samplingp=as(sampling,"SpatialPointsDataFrame")
 samplingp=samplingp[samplingp$eBirdSampling_filtered>0,]
 
 # edit column names to allow aligning with presence observations
 colnames(samplingp@data)=c("observation")
 samplingp$presence=0
 plot(samplingp, col="red")#absences
 plot(points, add=TRUE)#presences
 plot(range, col="blue",add=TRUE)#species range

Create a combined presence-nondetection point dataset

 pdata <- rbind(points[,"presence"],samplingp[,"presence"])
 pdata@data[,c("lon","lat")]=coordinates(pdata)
 table(pdata$presence)

Environmental data processing

 # you can add a lot of other variables, but in the interest of time we'll focus on these 4
 fenv <- c(cld="data/cloud/meanannual.tif", cld_intra="data/cloud/intra.tif", elev="data/elevation_mn_GMTED2010_mn.tif", forest="data/tree_mn_percentage_GFC2013.tif")

To facilitate modelling, let's crop the global rasters to a smaller domain. We can use the extent of the expert range and the crop() function in raster package.

 # crop to species domain and copy to project folder 
 
 env <- foreach(i=1:length(fenv))%do%{
 fo=file.path(outdir,paste0(names(fenv)[i],"_clipped.grd"))
 crop(raster(file.path(wdir,fenv[i])),range,file=fo,overwrite=T)   
 }

Read the environmental data in as a raster stack and plot layers

 env <- stack(list.files(path = outdir, pattern="*_clipped.grd$" , full.names = TRUE ))
 env
 
 # rename layers for convenience
 names(env)=names(fenv)
 
 # mask by elevation to set ocean to 0
 env=mask(env,env[["elev"]],maskvalue=0)
 
 # check out the plot
 plot(env)

Variable selection is tricky business and we're not going to dwell on it here… We'll use the following variables.

 vars=c("cld","cld_intra","elev","forest")

Scaling and centering the environmental variables to zero mean and variance of 1, using the scale function is typically a good idea. However, with so many people using this node at the same time, we'll skip this memory intensive step and use the unstandardized variables. The downside of this is the regression coefficients are more difficult to interpret.

 #senv=scale(env[[vars]])
 senv=env[[vars]]

Annotate the point records with the scaled environmental data; i.e., Add the (scaled) environmental data to each point

 pointsd=raster::extract(senv,pdata,sp=T) 
 pointsd=na.exclude(pointsd)
 
 # Look at the data table:
 kable(head(pointsd))

Explore the data

 # Plotting the response (presence/absence data) and the predictors:
 # convert to 'long' format for easier plotting
 pointsdl=reshape::melt(pointsd@data,id.vars=c("lat","lon","presence"),variable.name="variable"
 
 ggplot(pointsdl,aes(x=value,y=presence))+facet_wrap(~variable)+
 geom_point()+
 stat_smooth(method = "lm", formula = y ~ x + I(x^2), col="red")+
 geom_smooth(method="gam",formula=y ~ s(x, bs = "cs"))

Model Fitting. Here, we will fit a simple glm to the data. Choosing terms to include in a parametric model can be challenging, especially given the large number of possible interactions, etc. In this example we'll keep it fairly simple and include only a quadratic term for elevation (as suggested by the above plot). (Feel free to try various model formulas (adding or removing terms) and see how the model performs.)

 head(pointsd)
 
 #model 1
 m1 <- glm(presence~cld+elev,data=pointsd,family=binomial(logit))
 summary(m1)
 
 #model 2
 m2 <- glm(presence~cld+cld_intra+elev*I(elev^2)+forest,data=pointsd,family=binomial(logit))
 summary(m2)

Prediction; i.e., calculating estimates of probability of occurrence of Montane woodcreper for each cell. We can use the predict function in the raster package to make the predictions across the full raster grid and save the output.

    #predictions based on model 1
 p1 <- raster::predict(senv,m1,type="response", 
 file=file.path(outdir,"prediction_m1.grd"),overwrite=T)
 
 #predictions based on model 2
 p2 <- raster::predict(senv,m2,type="response",
 file=file.path(outdir,"prediction_m2.grd"),overwrite=T)
 
 p <- stack(p1,p2); names(p)=c("Model 1","Model 2")

Plot results of predictions on a map

 gplot(p,max=1e5)+geom_tile(aes(fill=value))+
 facet_wrap(~variable)+
 scale_fill_gradientn(colours=c("blue","green","yellow","orange","red"),
 na.value = "transparent")+
 geom_polygon(aes(x=long,y=lat,group=group),
 data=fortify(range),fill="transparent",col="darkred")+
 geom_point(aes(x = lon, y = lat), data = points@data,col="black",size=1)+
 coord_equal()

Model evaluation. gplot(p,max=1e5)+geomtile(aes(fill=value))+ facetwrap(~variable)+

scale_fill_gradientn(
  colours=c("blue","green","yellow","orange","red"),
  na.value = "transparent")+
geom_polygon(aes(x=long,y=lat,group=group),
             data=fortify(range),fill="transparent",col="darkred")+
geom_point(aes(x = lon, y = lat), data = points@data,col="black",size=1)+
coord_equal()

Model Evaluation. In general, it is a good idea to use k-fold data partitioning instead of using the data used for fitting. There is a function in the dismo package called kfold that makes this convenient. But for now, we'll just evaluate on the same data used for fitting.

 # Summarize model output using `screenreg` to print a more visually pleasing summary.
 screenreg(list(m1,m2),digits = 7,doctype=FALSE,align.center=TRUE)
 #htmlreg(list(m1,m2),digits = 7,doctype=FALSE,align.center=TRUE)

Caveats of SDMs using glms:

(1) In this example we treated eBird non-detections as absences when the probability of detection given presence can be much less than zero. What are the chances that an observer would see a species in a 1km grid cell if it were present there?
(2) We ignored the spatial autocorrelation in species presences and treated each observation as an independent sample. How can we account for this in SDMs?

Summary:

In this script we have illustrated a complete workflow, including:

(1) Reading in species data

(2) Pre-processing large spatial datasets for analysis

(3) Running a (simple) logistic GLM Species Distribution Model to make a prediction of p(occurrence|environment)

(4) Writing results to disk.

wiki/spdistr1.txt · Last modified: 2017/12/05 22:53 (external edit)