2.6. Afroditi Grigoropoulou: Species Distribution Model with Random Forest

Video recording

2.6.1. Taxon: genus Prebaetodes (Mayfly), occurring in a drainage basin in Colombia

2.6.2. Goal: Predict which subcatchments provide suitable habitats for this genus

2.6.3. Classify the subcatchments of the basin as 1 or 0, based on whether their habitat is suitable or not

2.6.4. Random forest classification

2.6.4.1. 1. Extract predictors for each subcatchment of this drainage basin

2.6.4.2. 2. Extract the predictors for the subcatchment where the genus occurs, plus for 10,000 random subcatchments that will serve as pseudoabsence data

2.6.4.3. 3. Run SDM with random forest

picture 8
(https://ucmp.berkeley.edu/arthropoda/uniramia/ephemeroptera/mayfly.jpg)

2.6.5. 1. Extract predictors for each subcatchment of this drainage basin

Set paths of output and input directory

export WDIR=xx
export CUDIR=xx
mkdir $WDIR/subc

Visualize species dataset

 head -5 $WDIR/taxa/Prebaetodes_points.csv
# order_ID dataset_ID   occurrence_ID   genus       family       longitude       latitude country year    subcatchment_ID basin_ID  CU     elevation
# E       051     000647545       Prebaetodes     Baetidae        -73.847944      6.0235  COL     2016    426156707       514490  33      1833.1
# E       051     000668399       Prebaetodes     Baetidae        -72.9787        5.94842 COL     2018    426173837       514490  33      3193.5
# E       051     000668413       Prebaetodes     Baetidae        -72.97908       5.94863 COL     2018    426173837       514490  33      3193.5
# E       051     001118376       Prebaetodes     Baetidae        -75.48049       4.45153 COL     2013    426478538       514490  33      2389.9

Define Computational Unit (CU)

export CU=$(awk -F'\t' 'FNR>1 {print $12}' $WDIR/taxa/Prebaetodes_points.csv  | sort | uniq)
echo CU $CU 
# CU 33

2.6.6. Computational unit

picture 1

Define basin

export BASIN=$(awk -F'\t' 'FNR>1 {print $11}' $WDIR/taxa/*.csv  | sort | uniq)
echo BASIN $BASIN
# BASIN 514490

2.6.7. Basin

picture 2

echo CU $CU BASIN $BASIN
# CU 33 BASIN 514490

2.6.8. Subcatchments

picture 3

Inspect bioclim statistics file

head -5 $CUDIR/CU_${CU}/out/stats_${CU}_bio5.txt
# subcID min max range mean sd
# 425395915 3018 3018 0 3018 0
# 425395916 3018 3018 0 3018 0
# 425395917 3018 3018 0 3018 0
# 425395918 3017 3019 2 3017.5393258427 0.520504935526613

Each row corresponds to a subcatchment. We need to extract only the subcatchments belonging to the basin 514490

Inspect file with subcatchment and basin IDs

 head -2 $CUDIR/CU_${CU}/out/stats_${CU}_BasinsIDs.txt
# subcID        X                   Y               MacrobasinID
# 425395915  -74.8245833333333 11.0579166666667      514490
# 425395916  -74.82125          11.0570833333333     514490

 tail 2 $CUDIR/CU_${CU}/out/stats_${CU}_BasinsIDs.txt
# 428235083 -75.4554166666667 -14.99375             504121
# 428235084 -75.4479166666667 -14.9929166666667     485074

A computational unit includes many basins. We need to extract the subcatchments of only 1 basin.

First extract all the subcatchment IDs of the given basin, based on the basin ID

grep -w $BASIN  $CUDIR/CU_${CU}/out/stats_${CU}_BasinsIDs.txt | awk '{print $1}'  \
    > $WDIR/subc/CU_${CU}_basin_${BASIN}_subc.txt

See output file of subcatchment ids

head -5 $WDIR/subc/CU_${CU}_basin_${BASIN}_subc.txt
# 425395915
# 425395916
# 425395917
# 425395918
# 425395919

Identify number of subcatchments in the basin 514490

wc -l $WDIR/subc/CU_${CU}_basin_${BASIN}_subc.txt
# 1278276 

Define the path of the file including the subcatchment IDs as a variable. We are going to use this file to extract the predictors for all the subcathcments of the Magdalena basin

export SUBCFILE=$WDIR/subc/CU_${CU}_basin_${BASIN}_subc.txt

2.6.9. Extract BIOCLIM data

See files

ls $CUDIR/CU_${CU}/out/stats*bio*.txt   | xargs -n 1 basename | head
# stats_33_bio10.txt
# stats_33_bio11.txt
# stats_33_bio12.txt
# stats_33_bio13.txt
# stats_33_bio14.txt
# stats_33_bio15.txt
# stats_33_bio16.txt
# stats_33_bio17.txt
# stats_33_bio18.txt
# stats_33_bio19.txt

Inspect bioclim statistics file

head -5 $CUDIR/CU_${CU}/out/stats_${CU}_bio5.txt
# subcID min max range mean sd
# 425395915 3018 3018 0 3018 0
# 425395916 3018 3018 0 3018 0
# 425395917 3018 3018 0 3018 0
# 425395918 3017 3019 2 3017.5393258427 0.520504935526613

Create output directory

mkdir $WDIR/bioclim

grep -W matches exact pattern, -f searches based on patterns given by a file

for ibio in {5,6,9,10,11} ; do
    grep -wFf  $SUBCFILE   $CUDIR/CU_${CU}/out/stats_${CU}_bio${ibio}.txt  \
    > $WDIR/bioclim/stats_CU_${CU}_basin_${BASIN}_bio${ibio}.txt
    ## insert header
    sed -i "1 i$(head -1 $CUDIR/CU_${CU}/out/stats_${CU}_bio${ibio}.txt)" $WDIR/bioclim/stats_CU_${CU}_basin_${BASIN}_bio${ibio}.txt

done

check output file

head -5 $WDIR/bioclim/stats_CU_${CU}_basin_${BASIN}_bio5.txt
# subcID bio5_min bio5_max bio5_range bio5_mean bio5_sd
# 425395915 3018 3018 0 3018 0
# 425395916 3018 3018 0 3018 0
# 425395917 3018 3018 0 3018 0
# 425395918 3017 3019 2 3017.5393258427 0.520504935526613

Same procedure but with xargs

for ibio in 5 6 9 10 11 ; do
    echo  $ibio 
done   > $WDIR/biovars_xargs.txt

cat $WDIR/biovars_xargs.txt
# 5
# 6
# 9
# 10
# 11
cat $WDIR/biovars_xargs.txt | xargs -n 1 -P 5  bash -c $'
ibio=$1

    echo "subcID" bio${ibio}_min bio${ibio}_max bio${ibio}_range bio${ibio}_mean bio${ibio}_sd >  $WDIR/bioclim/stats_CU_${CU}_basin_${BASIN}_bio${ibio}.txt  

    grep -wFf $SUBCFILE $CUDIR/CU_${CU}/out/stats_${CU}_bio${ibio}.txt >> $WDIR/bioclim/stats_CU_${CU}_basin_${BASIN}_bio${ibio}.txt

' _

wc -l $WDIR/bioclim/* 
#   1278277 /home/ag2688/project/GC_2022//bioclim/stats_CU_33_basin_514490_bio10.txt
#   1278277 /home/ag2688/project/GC_2022//bioclim/stats_CU_33_basin_514490_bio11.txt
#   1278277 /home/ag2688/project/GC_2022//bioclim/stats_CU_33_basin_514490_bio5.txt
#   1278277 /home/ag2688/project/GC_2022//bioclim/stats_CU_33_basin_514490_bio6.txt
#   1278277 /home/ag2688/project/GC_2022//bioclim/stats_CU_33_basin_514490_bio9.txt

2.6.10. Extract land cover mean proportion per subcatchment

Inspect file

head -5  $CUDIR/CU_${CU}/out/mean_LCprop_${CU}.txt
# subcID c100 c10 c110 c120 c130 c140 c150 c160 c170 c180 c190 c200 c20 c210 c220 c30 c40 c50 c60 c70 c80 c90
# 425395915 0 0 0 0 0 0 0 0 0.256098 0 0 0 0 0.365854 0 0 0.0121951 0.365854 0 0 0 0
# 425395916 0 0 0 0 0 0 0 0 0.509524 0 0 0 0 0.128571 0 0 0.0904762 0.271429 0 0 0 0
# 425395917 0 0 0 0 0 0 0 0 0.333333 0 0 0.0641026 0 0.602564 0 0 0 0 0 0 0 0
# 425395918 0 0 0 0 0 0 0 0 0.91469 0 0 0 0 0.011236 0 0 0.0740741 0 0 0 0 0

create output directory

mkdir $WDIR/LC

Create output file by first inserting the header

echo $(head -1 $CUDIR/CU_${CU}/out/mean_LCprop_${CU}.txt) > $WDIR/LC/stats_CU_${CU}_basin_${BASIN}_mean_LCprop.txt
Extract the land cover values per subcatchment
```sh
grep -wFf  $SUBCFILE  $CUDIR/CU_${CU}/out/mean_LCprop_${CU}.txt  \
    >> $WDIR/LC/stats_CU_${CU}_basin_${BASIN}_mean_LCprop.txt

2.6.11. Extract mean and sd of elevation, flow accumulation and slope gradient per subcatchment

Inspect tables

head -5  $CUDIR/CU_${CU}/out/stats_${CU}_streamorder.txt
# subcID next_stream prev_str01 prev_str02 prev_str03 prev_str04 strahler horton shreve hack topo_dim scheidegger drwal_old length stright sinosoid cum_length flow_accum out_dist source_elev outlet_elev elev_drop out_drop gradient
# 425395915 425395926 0 0 0 0 1 1 1 2 6 2 1 1514.850301 1233.716941 1.227875 1514.850301 59446.808594 4630.377188 3.5 0.1 3.4 0 0.002244
# 425395916 425395933 0 0 0 0 1 1 1 2 7 2 1 1532.600832 1303.739763 1.175542 1532.600832 34791.675781 4739.185951 3.3 0.2 3.2 0.1 0.002088
# 425395917 425395921 0 0 0 0 1 1 1 2 2 2 1 1607.382511 1445.505848 1.111986 1607.382511 57713.539062 2827.177338 2 0 2 0 0.001244
# 425395918 425395964 0 0 0 0 1 1 1 3 19 2 1 2050.53521 1751.405198 1.170794 2050.53521 0.983474 8825.73779 7.2 0.5 6.7 0 0.003267

for loop

for VAR in elev slopgrad flow chandistdwseg chandistupseg streamorder; do 

    # delete output file if it already exists
    [ -s $WDIR/$VAR/stats_CU_${CU}_basin_${BASIN}_${VAR}.txt ] && rm $WDIR/$VAR/stats_CU_${CU}_basin_${BASIN}_${VAR}.txt
    # create directory for the output
    [ ! -d $WDIR/$VAR ] && mkdir $WDIR/$VAR

    if [ $VAR != "streamorder" ] ; then 

        # create output file by first inserting the header
        echo $(cat $CUDIR/CU_${CU}/out/stats_${CU}_${VAR}.txt | head -1 | awk '{print $1, $5, $6}')  \
            > $WDIR/$VAR/stats_CU_${CU}_basin_${BASIN}_${VAR}.txt 

        # extract the values per subcatchment
        grep -wFf  $SUBCFILE  <(awk '{print $1, $5, $6}' $CUDIR/CU_${CU}/out/stats_${CU}_${VAR}.txt)  \
            >> $WDIR/$VAR/stats_CU_${CU}_basin_${BASIN}_${VAR}.txt


    if [ $VAR == "streamorder" ] ; then 

        # create output file by first inserting the header
        echo "subcID strahler" > $WDIR/$VAR/stats_CU_${CU}_basin_${BASIN}_${VAR}.txt 

        # extract the strahler column for each subcatchment
        grep -wFf  $SUBCFILE \
        <(awk -v col1=subcID -v col2=strahler 'NR==1{for(i=1;i<=NF;i++){if($i==col1)c1=i; if ($i==col2)c2=i;}} {print $c1 " " $c2}' \
        $CUDIR/CU_${CU}/out/stats_${CU}_${VAR}.txt ) >> $WDIR/$VAR/stats_CU_${CU}_basin_${BASIN}_${VAR}.txt
       
    fi

done

Same procedure but with xargs

for VAR in elev slopgrad flow chandistdwseg chandistupseg streamorder ; do
    echo  $VAR 
done   > $WDIR/vars_xargs.txt

cat $WDIR/vars_xargs.txt
# elev
# slopgrad
# flow
# chandistdwseg
# chandistupseg
# streamorder
cat $WDIR/vars_xargs.txt | xargs -n 1 -P 6  bash -c $'
VAR=$1

    [ -s $WDIR/$VAR/stats_CU_${CU}_basin_${BASIN}_${VAR}.txt ] && rm $WDIR/$VAR/stats_CU_${CU}_basin_${BASIN}_${VAR}.txt

    if [ $VAR != "streamorder" ] ; then 

        [ ! -d $WDIR/$VAR ] && mkdir $WDIR/$VAR

        echo "subcID ${VAR}_mean ${VAR}_sd" > $WDIR/$VAR/stats_CU_${CU}_basin_${BASIN}_${VAR}.txt 

        grep -wFf  $SUBCFILE  <(awk \'{print $1, $5, $6}\' $CUDIR/CU_${CU}/out/stats_${CU}_${VAR}.txt)  >> $WDIR/$VAR/stats_CU_${CU}_basin_${BASIN}_${VAR}.txt
    fi


    if [ $VAR == "streamorder" ] ; then 
        [ ! -d $WDIR/$VAR ] && mkdir $WDIR/$VAR

        echo "subcID strahler" > $WDIR/$VAR/stats_CU_${CU}_basin_${BASIN}_${VAR}.txt 

        grep -wFf  $SUBCFILE <(awk -v col1=subcID -v col2=strahler \'NR==1{for(i=1;i<=NF;i++){if($i==col1)c1=i; if ($i==col2)c2=i;}} {print $c1 " " $c2}\' $CUDIR/CU_${CU}/out/stats_${CU}_${VAR}.txt ) >> $WDIR/$VAR/stats_CU_${CU}_basin_${BASIN}_${VAR}.txt
       
    fi
    
    echo $CU $BASIN $VAR complete


' _

2.6.12. Join all predictors in R

R
path <- "xx"
setwd(path)
library(dplyr)
library(data.table)
library(tidyverse)

# all directories in the data folder
dirnames <- list.dirs(path, full.names = F, recursive = F) 
dirnames
#  [1] "bioclim"       "chandistdwseg" "chandistupseg" "elev"         
#  [5] "flow"          "LC"            "slopgrad"      "streamorder"  
#  [9] "subc"          "taxa"  

# Loop iterating through all directories and importing files to be merged
d <- NULL
files_all <- list()
i=0
for(d in dirnames){
    i=i+1
    filenames <- list.files(d,full.names = T, all.files = T, pattern="stats")
    if(!identical(filenames,character(0))) {
        # Import file
        files <- lapply(filenames,  fread, header = T, stringsAsFactors = F, quote = "", keepLeadingZeros = T)
        # join based on subcID
        filesj <- files %>% reduce(left_join, by = "subcID")

        files_all[[i]] <- filesj
    }
}

files_allj <- files_all %>% reduce(left_join, by = "subcID")

# export table with all the predictors for all the subcatchments of the basin
fwrite(files_allj, "stats_CU_33_basin_514490_all.txt", sep=" ", row.names=F, quote=F)

q()

Visualize selected columns in the final file

awk '{print $1, $2, $3, $25, $30, $NF}'  stats_CU_33_basin_514490_all.txt  | head -5
# subcID bio10_min bio10_max bio9_mean chandistupseg_sd strahler
# 425395915 3007 3007 2994 424.215512470263 1
# 425395916 3006 3007 2994 436.726188032742 1
# 425395917 3007 3007 2994 451.771830664826 1
# 425395918 3006 3007 2994 577.236937482746 1

2.6.13. 2. Extract the predictors for the subcatchments where the genus occurs, plus for 10,000 random subcatchments that will serve as pseudoabsence data

Get the IDs of the subcatchments where the genus occurs, in a file

awk -F'\t' 'NR>1 {print $10}' $WDIR/taxa/Prebaetodes_points.csv | sort | uniq > $WDIR/taxa/Prebaetodes_subc.txt

head -5 $WDIR/taxa/Prebaetodes_subc.txt
# 426156707
# 426173837
# 426475409
# 426475754
# 426475915
## Get the subcIDs for 10,000 random rows for the pseudoabsences
cat $SUBCFILE | shuf -n 10000 > taxa/pseudo_subc.txt

Extract the predictors for the subcatchments of the file***

for TAXON in Prebaetodes ; do 

    ## add header
    head -1  $WDIR/stats_CU_${CU}_basin_${BASIN}_all.txt \
        > $WDIR/taxa/stats_CU_${CU}_basin_${BASIN}_${TAXON}.txt

    awk  'NR==FNR{searchstr[$1]; next} $1 in searchstr' $WDIR/taxa/${TAXON}_subc.txt  $WDIR/stats_CU_${CU}_basin_${BASIN}_all.txt \
        >> $WDIR/taxa/stats_CU_${CU}_basin_${BASIN}_${TAXON}.txt
    
done

Extract the predictors for the subcatchments of the pseudoabsence file

head -1  $WDIR/stats_CU_${CU}_basin_${BASIN}_all.txt \
    > $WDIR/taxa/stats_CU_${CU}_basin_${BASIN}_pseudo.txt    # Add header
awk  'NR==FNR{searchstr[$1]; next} $1 in searchstr' $WDIR/taxa/pseudo_subc.txt $WDIR/stats_CU_${CU}_basin_${BASIN}_all.txt \
    >> $WDIR/taxa/stats_CU_${CU}_basin_${BASIN}_pseudo.txt

2.6.14. For the prediction, we need the subcatchment raster to be cropped to the extent of the basin

Load GRASS78

source /home/jg2657/bin/grass78m

Set paths to hydrography data folder and the subcatchment data of the CU

export DATA=/gpfs/gibbs/pi/hydro/hydro/dataproces/MERIT_HYDRO
export SUBCATCHMENT=$DATA/CompUnit_basin_lbasin_clump_reclas/basin_lbasin_clump_${CU}.tif 

Extract basin of interest

grass78  -f -text --tmp-location  -c $DATA/lbasin_compUnit_tiles/bid${CU}.tif  <<EOF
r.external  input=$DATA/lbasin_compUnit_tiles/bid${CU}.tif  output=CU   --overwrite

r.external  input=$SUBCATCHMENT   output=subc   --overwrite

# mask out basins that don't have the desired basinID
r.mapcalc " CU_$BASIN = if (CU == $BASIN , $BASIN , null() ) "
g.region -a zoom=CU_$BASIN --o #### With the -a flag all four boundaries are adjusted to be even multiples of the resolution, aligning the region to the resolution supplied by the user. 
                                #### The default is to align the region resolution to match the region boundaries.

# repeat for the subcatchments
r.mapcalc " SUBC_$BASIN = if (CU_$BASIN == $BASIN , subc , null() ) "

# export
r.out.gdal --o -f -c -m createopt="COMPRESS=DEFLATE,ZLEVEL=9,INTERLEAVE=BAND,TILED=YES"  nodata=0   type=UInt32    format=GTiff input=SUBC_$BASIN output=$WDIR/subcatchment_${BASIN}.tif 


EOF

2.6.15. 3. Run SDM with random forest

R
path <- "xx"

setwd(path)
pacman::p_load(data.table,ranger,parallel,doParallel,vip,raster,stringr,dplyr)


# Import presence subcatchments with their predictors
pres <- fread("stats_CU_33_basin_514490_Prebaetodes.txt", header=T, keepLeadingZeros=T, stringsAsFactors = F, sep=" ") 
head(pres)

# Import pseudoabsence subcatchments with their predictors
pseudo <- fread("stats_CU_33_basin_514490_pseudo.txt", header=T)
head(pseudo)

# Add a column that indicates presence or absence of the genus (0|1)
pres$occurrence <- rep(1, nrow(pres))
pseudo$occurrence <- rep(0, nrow(pseudo))

# join presences-absences
rf_data <- rbind.data.frame(pres, pseudo)

# Exclude some predictor columns
rf_data <- rf_data %>% select(!contains(c("min", "max", "range")))

## Split data into train and test
set.seed(37)


# Rescale
# Define rescaling functions
slope_scale <- function(x, na.rm = F) (x*0.000001)
clim_scale <- function(x, na.rm = F) (x * 0.1)
offset <- function(x, na.rm = F) (x - 273.15)

# Apply rescaling functions
rf_data_rescale <- rf_data  %>%
    mutate(across(contains("slopgrad_mean"), slope_scale)) %>%
    mutate(across(matches("bio[0-9]+_mean"), clim_scale))  %>%
    mutate(across(starts_with("bio[0-9]+_mean"), offset))


rf_data <- rf_data_rescale

## Convert occurrence column to factor to run the random forest
rf_data$occurrence <- as.factor(rf_data$occurrence)


## Split data into train and test
set.seed(37)
# train.idx <- sample(nrow(rf_data), 2/3 * nrow(rf_data))
# data_train <- rf_data[train.idx, ]
# data_test <- rf_data[-train.idx, ]


# In random forests, there is no need for a separate test set to validate result.  
# It is estimated internally, during the run, as follows:
# As the forest is built on training data , each tree is tested on the 1/3rd of the samples (36.8%) not used in building that tree 
# (similar to validation data set). 
# This is the out of bag error estimate - an internal error estimate of a random forest as it is being constructed.


# For parallel processing
cl<-makePSOCKcluster(6)
registerDoParallel(cl)


## Down-sampling method for presence-only data
# calculate sub-samples
presNum <- nrow(rf_data[rf_data$occurrence==1,])


# Ranger with down-sampling
spsize <- c("0" = presNum/nrow(rf_data), "1" = presNum/nrow(rf_data)) 
rg <- ranger(rf_data$occurrence ~ ., 
                 data = rf_data[,2:ncol(rf_data)],
                 num.trees = 1000,
                 mtry=6,
                 replace = T,
                 sample.fraction = spsize,
                 oob.error = T,
                 keep.inbag = TRUE,
                 num.threads = 6,
                 importance ='impurity', 
                 probability=T) 
rg
rg$confusion.matrix
# Type:                             Probability estimation 
# Number of trees:                  1000 
# Sample size:                      10011 
# Number of independent variables:  43 
# Mtry:                             6 
# Target node size:                 10 
# Variable importance mode:         impurity 
# Splitrule:                        gini 
# OOB prediction error (Brier s.):  0.09524617 


(mtry: Number of variables to possibly split at in each node)


rg$confusion.matrix
#         predicted
# true      0    1
#    0    6025  642
#    1      2    5



v1 <- vip(rg_all,num_features=30, horizontal=T)
v1

picture 7

2.6.16. Predict with fitted model

subc_data <- fread("../stats_CU_33_basin_514490_all.txt", header=T)
# Exclude some columns 
subc_data <- subc_data %>% select(!contains(c("min", "max", "range")))

pred <- predict(rg, data=subc_data[,!1])
prediction <- data.frame(subc_data[,1], pred_occur=as.numeric(as.character(pred$predictions)))
head(prediction)
#     subcID    pred_occur
# 1 425395915          0
# 2 425395916          0
# 3 425395917          0
# 4 425395918          0
# 5 425395919          0
# 6 425395920          0


## Reclassify subcatchment raster in GRASS 
## create file with reclassification rules
setDT(prediction)
prediction$reclass <- str_c(prediction$subcID," = ", round(prediction$pred_occur,2)*100)
prediction <- prediction[,'reclass']
prediction <- rbind.data.frame(as.data.frame(prediction), "* = 0")
head(prediction)

fwrite(prediction, "reclass_rules_pred_prob.txt", sep="\t", row.names=F, quote=F, col.names=F)    

q()

2.6.17. Reclassify subcatchment raster based on predicted values

export MBFILE=$WDIR/subcatchment_512290.tif


grass78 -f  --tmp-location  -c $MBFILE   <<<EOF

    export filename=$(basename $MBFILE .tif  )
    r.in.gdal --o input=$MBFILE  output=subc    --overwrite

    # Reclassify
    r.reclass input=subc output=pred rules=$WDIR/taxa/reclass_rules_pred_prob.txt --overwrite

    # export maps
    r.out.gdal --o -f -m -c createopt="COMPRESS=DEFLATE,ZLEVEL=9" type=Int32  format=GTiff nodata=-9999 input=pred output=$WDIR/taxa/pred_prob.tif

EOF

2.6.18. Predicted habitat suitability

picture 9