# Gidske L. Andersen: Topographic and hydrological influence on vegetation in an arid environment [Video recording](https://youtu.be/fM4FdKlbu1I) ## Background Arid lands are highly variable environments due to their water dependency and the high spatio-temporal variability of precipitation. But water is not only available to plants as run-off precipation, it is also availble as soil moisture (shallow and deep) and groundwater. These water resources are redistributed in and by the landscape. Vegetation has adapted to the variability of water resources in different ways and are found at places where water conditions are optimal, either in short term perspective (ephemeral species; where runoff is important) or in a long-term perspective (shrubs and trees; where soil moisture is stable). In this case study I will look into which topographical and hydrological parameters that influence vegetation patterns and to what degree they can model these patterns. Landsat data is used to extract vegetation indices (vegetation pattern) and SRTM is used for derivation of hydrological and topographical predictors. The pictures below show the area for the case study in the Red Sea Hills of Sudan (an overview and a close-up to show how vegetation can grow in the wadis). ```python import matplotlib.pyplot as plt import matplotlib.image as mpimg from matplotlib import rcParams %matplotlib inline # figure size in inches optional rcParams['figure.figsize'] = 11 ,8 # read images img_A = mpimg.imread("area.jpg") img_B = mpimg.imread("area_detail.jpg") # display images fig, ax = plt.subplots(1,2) ax[0].imshow(img_A) ax[1].imshow(img_B) ``` ![png](output_2_1.png) **flow chart of processing** ```python from IPython.display import Image Image("figure_flowchart.png", width = 300) ``` ![png](output_4_0.png) ## 1 GLAD ARD Landsat images The Landsat Analysis Ready Data (GLAD ARD) developed by the Global Land Analysis and Discovery (GLAD) provides spatially and temporally consistent inputs for land cover mapping and change detection. The GLAD ARD represent a 16-day time-series of tiled Landsat normalized surface reflectance from 1997 to present (source https://glad.umd.edu/book/overview). The GLAD ARD data can be download from https://glad.umd.edu/dataset/landsat_v1.1/ previous autentifcation. In my project I'm using imagery from Red Sea Hills, Sudan, portion of TILE 037E_18N. Due to working on the virtual machine I only selected one year of data, i.e. **2018**. The temporal series for each tile is labeled following this numerical nomenclature - first column is year - remaining columns are name of 16day time series imagery ```python ! head -43 geodata/glad_ard/16d_intervals.csv | tail -6 ``` 2017 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 2018 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 2019 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 2020 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 2021 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 2022 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 ### Data download Data for 2018 was downloaded with this code, where user:password should be replace ```bash %%bash for INTER in $(grep -e ^2018 geodata/glad_ard//16d_intervals.csv | awk '{ $1=""; print $0}') ; do curl --connect-timeout 600 -u user:passwd -X GET https://glad.umd.edu/dataset/landsat_v1.1/18N/037E_18N/$INTER.tif -o project/glad_ard/$INTER.tif gdal_translate -co COMPRESS=DEFLATE -co ZLEVEL=9 -projwin 37 18.86666 37.4 18.63333 project/glad_ard/$INTER.tif project/glad_ard/${INTER}_crop.tif done ``` The images has 8 bands, where band 3 and 4 can be used to calculate vegetation indices. Band 8, BQA, is used for masking pixels with low quality (https://glad.umd.edu/ard/landsat-ard-methodology). In addition to clouds, there can also be low quality pixels seen as stripes (caused by sensor problems; see figure below). Due the arid characteristics of the area, the Soil Adjusted Vegetation Index (SAVI) is used: (NIR - Red)/(NIR + RED) + 0.5)) * 1.5 ```python %%capture #converting to png for display !gdal_translate -b 1 -b 2 -b 3 project/glad_ard/875_crop.tif project/glad_ard/875_crop.png ``` ```python from rasterio.plot import show src = rasterio.open("project/glad_ard/875_crop_m.tif") show(src, cmap='pink') ``` ![png](output_10_0.png) ### Calculate vegetation index ```bash %%bash for file in project/glad_ard/*_crop.tif; do filename=$(basename $file .tif ) #echo $file #VI gdal_calc.py --type=Float32 --overwrite -A $file --A_band=3 -B $file --B_band=4 \ --co COMPRESS=DEFLATE --co ZLEVEL=9 --calc="(B.astype(float))/(A.astype(float))" --outfile=project/glad_ard/${filename}_VI.tif \ --quiet #NDVI gdal_calc.py --type=Float32 --overwrite -A $file --A_band=3 -B $file --B_band=4 \ --co COMPRESS=DEFLATE --co ZLEVEL=9 --calc="(B.astype(float)- A.astype(float))/(B.astype(float) + A.astype(float))" \ --outfile=project/glad_ard/${filename}_NDVI.tif \ --quiet #SAVI gdal_calc.py --type=Float32 --overwrite -A $file --A_band=3 -B $file --B_band=4 \ --co COMPRESS=DEFLATE --co ZLEVEL=9 --calc="((B.astype(float)- A.astype(float))/(B.astype(float) + A.astype(float) + 0.5)) * 1.5" \ --outfile=project/glad_ard/${filename}_SAVI.tif \ --quiet done ``` ### Masking and statistics Only BQA pixels 1 and 2 are kept in the masking. Since counting starts from 0, BQA band is #7. Nodata value is set to Uint16 max (65535) since using 0 will cause error when estimating SAVI. All images were masked in the same way after calculating SAVI (since input nulldata could not be set in this version of gdalcalc.py). ```python %%capture %%bash for file in project/glad_ard/*_crop.tif; do filename=$(basename $file .tif ) echo $file echo $filename pkgetmask --overwrite -i $file -o project/glad_ard/${filename}_mask.tif -min 1 -max 2 -b 7 pksetmask -i project/glad_ard/${filename}_SAVI.tif -m project/glad_ard/${filename}_mask.tif -msknodata 0 -nodata -99 \ -o project/glad_ard/${filename}_SAVI_m.tif\ --quiet done ``` **Checking nodata and statistics for the VIs** ```bash %%bash gdalinfo project/glad_ard/875_crop_SAVI_m.tif |grep -E 'NoData' ``` NoData Value=-99 ```bash %%bash for file in project/glad_ard/*crop_SAVI_m.tif; do pkinfo -i $file -stats|awk '{print $0}' done ``` -min -0.211417 -max 0.611614 --mean 0.12386 --stdDev 0.0465675 -min 0.0631748 -max 0.210848 --mean 0.134271 --stdDev 0.0229662 -min -0.188215 -max 0.772617 --mean 0.126266 --stdDev 0.0579652 -min -0.31516 -max 0.715421 --mean 0.126566 --stdDev 0.0513284 -min -0.216388 -max 0.600698 --mean 0.124193 --stdDev 0.0397357 -min -0.274006 -max 0.610294 --mean 0.124246 --stdDev 0.0368988 -min -0.227176 -max 0.60059 --mean 0.124319 --stdDev 0.0350922 -min -0.0308992 -max 0.572593 --mean 0.123665 --stdDev 0.0308838 -min -0.197234 -max 0.620289 --mean 0.128406 --stdDev 0.0353955 -min -0.0114275 -max 0.636907 --mean 0.125323 --stdDev 0.0319995 -min 0.063293 -max 0.226574 --mean 0.11956 --stdDev 0.0187572 -min 0.07592 -max 0.250311 --mean 0.122435 --stdDev 0.019505 -min -0.00347078 -max 0.500646 --mean 0.127181 --stdDev 0.0314407 -min 0.0270167 -max 0.394247 --mean 0.128143 --stdDev 0.0302831 -min -0.0294599 -max 0.542999 --mean 0.125114 --stdDev 0.0315137 -min 0.0247256 -max 0.229192 --mean 0.109571 --stdDev 0.0199349 -min -0.234899 -max 0.79372 --mean 0.125687 --stdDev 0.0335479 -min -0.526694 -max 0.911127 --mean 0.126048 --stdDev 0.0363183 -min -0.105182 -max 0.686265 --mean 0.132825 --stdDev 0.030873 -min -0.307845 -max 0.969288 --mean 0.132675 --stdDev 0.0423124 -min -0.169741 -max 0.699102 --mean 0.135448 --stdDev 0.0472854 -min -0.41313 -max 0.955517 --mean 0.168209 --stdDev 0.0908494 -min -0.0180593 -max 0.620982 --mean 0.134354 --stdDev 0.0493728 -min -0.186645 -max 0.349589 --mean 0.1139 --stdDev 0.037978 **Extracting histograms for SAVI** ```python %%capture %%bash for file in project/glad_ard/*_SAVI_m.tif; do filename=$(basename $file .tif ) echo $filename pkstat -i $file -hist -nbin 20 -nodata -99 > project/glad_ard/${filename}_h.txt done ``` ```python !more project/glad_ard/897_crop_SAVI_m_h.txt ``` -0.002083286643 7 0.02986876667 154 0.06182081997 6017 0.09377287328 44899 0.1257249266 33275 0.1576769799 15828 0.1896290332 7138 0.2215810865 3958 0.2535331398 1774 0.2854851931 1003 0.3174372464 313 0.3493892998 218 0.3813413531 134 0.4132934064 86 0.4452454597 45 0.477197513 31 0.5091495663 20 0.5411016196 9 0.5730536729 5 0.6050057262 2 ## Generate stacks and extract statistics in Pyjeo I wanted to try out Pyjeo after Dr. Pieter Kempeneers presentation (http://spatial-ecology.net/docs/build/html/PKTOOLS/pyjeo_introduction.html). I used it to calculate yearly mean, median, 90 percentil and standard deviation. Probably there are cleaner ways to do this with some more pyjeo-practice. One of workarounds I ended up using was creating a test* jim, and copied crs++ info from that to an "empty" jim that was "filled" with all my timesteps/planes. Later I used these layers when extracting information for random points. ```python from pathlib import Path import numpy as np import pandas as pd %matplotlib inline import matplotlib.pyplot as plt import pyjeo as pj import os ``` ```python fn = Path('project/glad_ard/875_crop_SAVI_m.tif') test = pj.Jim(fn) ``` ```python #del jim #delete if old jim exists jim = pj.Jim(ncol = 1600, nrow = 933, otype='float32', nodata=-99) #create jim object with one 1 empty plane jim.properties.copyGeoTransform(test) #get all crs from test print(jim.properties.nrOfCol()) print(jim.properties.getBBox()) print(jim.stats.getStats()) print(jim.properties.nrOfPlane()) #show number of planes ``` 1600 [37.0, 18.86675, 37.4, 18.633499999999998] {'min': 0.0, 'max': 0.0, 'mean': 0.0} 1 ```python # iterate over SAVI files in # that directory #check number of files/planes !ls project/glad_ard/???_crop_SAVI_m.tif | wc -l #sjekk denne sorteringen files = sorted(Path('project/glad_ard').glob('???_crop_SAVI_m.tif')) for file in files: jim.geometry.stackPlane(pj.Jim(file)) print(jim.properties.nrOfPlane()) #show number of planes #remove first empty plane; probably better way to do this jim=jim[[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]] jim.properties.nrOfPlane() #should be 23 ``` 23 24 23 ```python print(jim.stats.getStats()) ``` {'min': -99.0, 'max': 0.9692878723144531, 'mean': -25.83536720275879} ```python #calculate mean mean = pj.geometry.reducePlane(jim, rule = 'mean', nodata = -99.0) #applied mask to all planes (more than 2D); "reduce" som i GEE mean.stats.getStats() ``` {'min': -0.02188333310186863, 'max': 0.5145996809005737, 'mean': 0.12636782228946686} ```python #calculate median median = pj.geometry.reducePlane(jim, rule = 'median', nodata = -99) median.stats.getStats() ``` {'min': -0.006859873421490192, 'max': 0.5675770044326782, 'mean': 0.12463919818401337} ```python #calculate stdev # std.pixops.convert('GDT_Float32') #converts into float if needed std = pj.Jim(mean) #works as a placeholder for the std to be calculated jim.np()[jim.np()==-99] = np.nan #all 0 assign to nan stdnp = np.nanstd(jim.np(), axis=0) #second data cube (1), starts at 0 std.np()[:] = stdnp std.stats.getStats() ``` {'min': 0.001729343319311738, 'max': 0.2672453820705414, 'mean': 0.017469022423028946} ```python #calculate quantiles per90 = pj.Jim(mean) jim.np()[jim.np()==-99] = np.nan #all -99 assign to nan per90np = np.nanpercentile(jim.np(), 90, axis=0) per90.np()[:] = per90np per90.stats.getStats() ``` {'min': 0.06932799518108368, 'max': 0.9273030161857605, 'mean': 0.14474131166934967} ```python plt.gray() # show the filtered result in grayscale fig = plt.figure(figsize=(20,10)) ax1 = fig.add_subplot(141) ax2 = fig.add_subplot(142) ax3 = fig.add_subplot(143) ax4 = fig.add_subplot(144) ax1.imshow(mean.np()) ax2.imshow(std.np()) ax3.imshow(per90.np()) ax4.imshow(median.np()) plt.show() ```
![png](output_31_1.png) ```python #save to files mean.io.write('project/glad_ard/mean_pyjeo.tif', co = ['COMPRESS=DEFLATE', 'ZLEVEL=9']) #io from io-module median.io.write('project/glad_ard/median_pyjeo.tif', co = ['COMPRESS=DEFLATE', 'ZLEVEL=9']) #io from io-module std.io.write('project/glad_ard/std_pyjeo.tif', co = ['COMPRESS=DEFLATE', 'ZLEVEL=9']) #io from io-module per90.io.write('project/glad_ard/per90_pyjeo.tif', co = ['COMPRESS=DEFLATE', 'ZLEVEL=9']) #io from io-module ``` ## 2 Creating topographical and hydrological layers in GRASS Requirements for GRASS to run - make seperate folder under home/user/ to run GRASS - install addons for special algorithms, i.e. r.stream.order. Addons had to be downloaded from github to be installed due to changes in URL not implemented in current version of GRASS (7.8.5). ``` %%bash mkdir -p /home/user/my_SE_data/exercise cp -r /home/user/SE_data/exercise/grassdb /home/user/my_SE_data/exercise git clone https://github.com/OSGeo/grass-addons.git /home/user/grass_addons #run GRASS to install extension cd /home/user/my_SE_data/exercise grass78 -text grassdb/europe/PERMANENT/ << EOF g.extension extension=r.stream.basins url=/home/user/grass_addons/src/raster/r.stream.basins/ g.extension extension=r.stream.distance url=/home/user/grass_addons/src/raster/r.stream.distance/ EOF ``` ### Download SRTM SRTM data (version 3, 30m resolution) was downloaded from NASA earthdata. See https://wiki.earthdata.nasa.gov/display/EL/How+To+Access+Data+With+cURL+And+Wget Replace "user" and "password" ``` %%bash cd /home/user/ touch .netrc echo "machine urs.earthdata.nasa.gov login password " > .netrc chmod 0600 .netrc #### create cookie #### cd ~ touch .urs_cookies ##### download more files in one go and move to folder #### curl -O -b ~/.urs_cookies -c ~/.urs_cookies -L -n "https://e4ftl01.cr.usgs.gov/MEASURES/SRTMGL1.003/2000.02.11/N18E0[36-37].SRTMGL1.hgt.zip" mv N18* project/srtm/ cd project/srtm/ unzip "*.zip" ##### to speed up convert to tiff #### for file in project/srtm/*.hgt; do filename=$(basename $file .hgt) gdal_translate -co COMPRESS=DEFLATE -co ZLEVEL=9 -co PREDICTOR=2 $file project/srtm/$filename.tif done ``` ### Calculate variables A number of topographic and hydrological explanatory variables are calculated in GRASS. Since some of the hydrological variables are depending on data for complete watersheds, region is readjusted to the smaller AOI (from landsat savi data) when the export is done (g.region). Distance from streams and stream junctions are also calculated. ```python %%capture %%bash # topidx don't work on lat lon data cp project/srtm/N18E037.tif /home/user/my_SE_data/exercise/ cd /home/user/my_SE_data/exercise mkdir grassout # use the --tmp-location option to store a temporal location in the /tmp folder # -f forces removal of gislock; --text - textbased GUI --tmp-location - temporary location deleted # -c geofile - location based on that file # opens a GRASS text interface where input is ended by EOF grass78 -f -text --tmp-location -c /home/user/my_SE_data/exercise/N18E037.tif <<'EOF' #prints all env. settings #debug 0 - silence g.gisenv set="GRASS_VERBOSE=-1","DEBUG=0" echo "######## import the layers ########" # not imported, just registered as a grass dataset r.external input=N18E037.tif output=elv --o --q # dem r.external input=landsat/SAVI.tif output=savi --o --q echo "######## calculate topographic layers ########" r.slope.aspect elevation=elv slope=slope aspect=aspect pcurvature=pcurv tcurvature=tcurv dx=dx dy=dy dxx=dxx dyy=dyy dxy=dxy --o #r.topidx input=elv output=topidx --o echo "######## estimating watershed ########" #threshold basinsize in number of pixels r.watershed elevation=elv threshold=2000 \ accumulation=w_accum \ tci=w_tci \ spi=w_spi \ drainage=w_dr_dir \ basin=w_basin \ stream=w_stream_w2k \ length_slope=w_lSl \ slope_steepness=w_SlSt \ --o echo "######## estimating distance from stream ########" # input should be from same function (r.watershed or r.stream..) r.stream.distance stream_rast=w_stream_w2k \ direction=w_dr_dir \ elevation=elv \ method=downstream \ distance=w_str_w2k_dist echo "######## extract stream vector ########" # extract stream vector, including points for confluence r.stream.extract elevation=elv \ accumulation=w_accum \ threshold=2000 \ stream_vector=streamx_2kv \ --o echo "######## info stream vector ########" #info about stream_v db.describe -c table=streamx_2kv v.to.rast input=streamx_2kv type=point where="type_code = 1" output=str_2kvj_r use=attr attribute_column="type_code" echo "######## estimating distance from junctions ########" #-m for meter r.grow.distance -m input=str_2kvj_r distance=str_2kvj_dist metric=geodesic echo "######## exporting products ########" #export multiple rasterfiles - exclude elv from the list g.region raster=savi for file in $(g.list type=raster ); do r.out.gdal --o -f -c -m createopt="COMPRESS=DEFLATE,ZLEVEL=9" type=Float64 nodata=999999999 format=GTiff input=$file output=grassout/${file}.tif done EOF ``` ## 3 Extract statistics for random points Random points are created within classes of SAVI_mean to make sure the whole range of SAVI values are represented. The SAVI classes are based on the histogram with 20 bins. Classfication of SAVI is done in GRASS using a file with the recoding values. Awk is used to generate the recode.txt file, that should . have the following format: 1:1:1.1:1.1 #from_start:from_end:to_start:to_end, where to_start=to_end for classes. Random points are generated with pkextractimg. ```python #del jim fn = Path('project/glad_ard/mean_pyjeo.tif') jim = pj.Jim(fn) jim.stats.getStats() ``` {'min': -0.02188333310186863, 'max': 0.5159205794334412, 'mean': 0.1263299584388733} ```python !rm -r output.txt stats = jim.stats.getStats('histogram', nbin=20, src_min=-0.1, src_max=0.55, nodata=-99) with open("output.txt", "a") as f: for index, bin in enumerate(stats['bin']): print(bin, stats['histogram'][index], file=f) ``` ```bash %%bash #make recode rules from txt-file paste -d":" \ <(awk '{print $1}' output.txt | head -n19) \ <(awk 'NR>1{print $1}' output.txt) \ <(printf "%s\n" {1..19}) \ <(printf "%s\n" {1..19}) \ > rules.txt ``` ```python !wc -l rules.txt !more rules.txt ``` 19 rules.txt -0.08375:-0.05125:1:1 -0.05125:-0.01875:2:2 -0.01875:0.01375:3:3 0.01375:0.04625:4:4 0.04625:0.07875:5:5 0.07875:0.11125:6:6 0.11125:0.14375:7:7 0.14375:0.17625:8:8 0.17625:0.20875:9:9 0.20875:0.24125:10:10 0.24125:0.27375:11:11 0.27375:0.30625:12:12 0.30625:0.33875:13:13 0.33875:0.37125:14:14 0.37125:0.40375:15:15 0.40375:0.43625:16:16 0.43625:0.46875:17:17 0.46875:0.50125:18:18 0.50125:0.53375:19:19 **classify SAVI with GRASS** ```python %%capture %%bash cp project/glad_ard/mean_pyjeo.tif /home/user/my_SE_data/exercise/ cd /home/user/my_SE_data/exercise grass78 -f -text --tmp-location -c /home/user/my_SE_data/exercise/mean_pyjeo.tif <<'EOF' g.gisenv set="GRASS_VERBOSE=-1","DEBUG=0" echo "######## import the layers #############" r.external input=mean_pyjeo.tif output=SAVI --o --q # dem echo "######## recode SAVI 30classes #############" r.recode input=SAVI output=SAVI_c rules=rules.txt r.out.gdal --o -f -c -m createopt="COMPRESS=DEFLATE,ZLEVEL=9" format=GTiff input=SAVI_c output=SAVI_c.tif EOF ``` **create random points** ```python !pkextractimg -i project/glad_ard/mean_pyjeo.tif \ -s project/glad_ard/SAVI_c.tif \ -o project/glad_ard/random.sqlite -t 5 ``` ```python !ogrinfo project/glad_ard/random.sqlite ``` - 'VirtualXPath' [XML Path Language - XPath] INFO: Open of `project/glad_ard/random.sqlite' using driver `SQLite' successful. 1: training data (Point) ### Extract values for response and predictor variables In this part values for predictors and response variables for all random points are extracted. I'm using pyjeo for this part. Since stacking requires the datasets to be same datatype as well as CRS, extraction is done separately for response and predictors. **Repsonse variable** ```python del jim jim = pj.Jim(ncol = 1600, nrow = 933, otype='float32', nodata=-99) #create empty jim object jim.properties.copyGeoReference(test) print(jim.properties.nrOfCol()) print(jim.properties.getBBox()) print(jim.stats.getStats()) files = sorted(Path('project/glad_ard').glob('*pyjeo.tif')) print(files) for file in files: jim.geometry.stackPlane(pj.Jim(file)) jim.properties.nrOfPlane() #show number of planes jim=jim[[1,2,3,4]] #select last 4 planes ``` 1600 [37.0, 18.86675, 37.4, 18.633499999999998] {'min': 0.0, 'max': 0.0, 'mean': 0.0} [PosixPath('project/glad_ard/mean_pyjeo.tif'), PosixPath('project/glad_ard/median_pyjeo.tif'), PosixPath('project/glad_ard/per90_pyjeo.tif'), PosixPath('project/glad_ard/std_pyjeo.tif')] ```python #check CRS for layers !ogrinfo -so project/glad_ard/random.sqlite 'training data' | grep -E 'EPSG' !gdalinfo project/glad_ard/mean_pyjeo.tif | grep -E 'EPSG' jim.properties.getProjection() #test.properties.getProjection() ``` ID["EPSG",4326]] 'GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]' ```python vfn = 'project/glad_ard/random.sqlite' v = pj.JimVect(vfn) output = 'project/glad_ard/temp.sqlite' Path(output).unlink(missing_ok = True) extracted1 = pj.geometry.extract(v, jim, output=output, srcnodata = -99) savi = pd.DataFrame(extracted1.dict()) savi.to_csv('project/glad_ard/savi.csv') ``` **Predictor variables** ```python fn = Path('/home/user/my_SE_data/exercise/grassout/elv.tif') del jim #if old jim exists jim = pj.Jim(fn) planes = jim.properties.nrOfPlane() col = jim.properties.nrOfCol() #show number of planes; i.e. timesteps row = jim.properties.nrOfRow() print('no of planes, columns and rows:',planes, col, row) ``` no of planes, columns and rows: 1 1600 933 ```python print(jim.properties.getBBox()) print(jim.stats.getStats()) ``` [37.0, 18.86675, 37.4, 18.633499999999998] {'min': 6.0, 'max': 1530.0, 'mean': 576.3270136655949} ```python %%capture %%bash #planes to be added ls /home/user/my_SE_data/exercise/grassout/*.tif | wc -l #check that all rasters are same datatype for file in /home/user/my_SE_data/exercise/grassout/*.tif; do echo $file gdalinfo $file -stats | grep -E 'Type' | awk '{print $4}' done ``` ```python # iterate over tif files in # that directory files = sorted(Path('/home/user/my_SE_data/exercise/grassout').glob('*.tif')) print(files) for file in files: jim.geometry.stackPlane(pj.Jim(file)) planes = jim.properties.nrOfPlane() col = jim.properties.nrOfCol() row = jim.properties.nrOfRow() print('no of files, no of planes, columns and rows:',len(files), planes, col, row) ``` [PosixPath('/home/user/my_SE_data/exercise/grassout/aspect.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/dx.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/dxx.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/dxy.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/dy.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/dyy.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/elv.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/pcurv.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/savi.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/slope.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/str_2kvj_dist.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/str_2kvj_r.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/tcurv.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_SlSt.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_accum.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_basin.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_dr_dir.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_lSl.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_spi.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_str_w2k_dist.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_stream_w2k.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_tci.tif')] no of files, no of planes, columns and rows: 22 23 1600 933 ```python !ogrinfo -so project/glad_ard/random.sqlite 'training data' | grep -E 'EPSG' # confirm that CRS is the same jim.properties.getProjection() ``` ID["EPSG",4326]] 'GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]' ```python vfn = 'project/glad_ard/random.sqlite' v = pj.JimVect(vfn) output = 'project/glad_ard/temp1.sqlite' Path(output).unlink(missing_ok = True) extracted3 = pj.geometry.extract(v, jim, output=output, srcnodata = 999999999) ``` ```python topo = pd.DataFrame(extracted3.dict()) topo.to_csv('project/glad_ard/predictors.csv') ``` ```python topo.head() ```
fid label b t0b0 t1b0 t2b0 t3b0 t4b0 t5b0 t6b0 ... t13b0 t14b0 t15b0 t16b0 t17b0 t18b0 t19b0 t20b0 t21b0 t22b0
0 0 11 0.097008 760.0 167.750107 -0.243415 -0.006419 -0.000278 0.052850 -0.003702 ... -0.003823 16.299999 57.580921 298.0 3.0 0.078407 173.928513 42.452179 1.000000e+09 8.741459
1 1 10 0.082413 793.0 186.597610 -0.456937 -0.003307 -0.001944 -0.052850 -0.005112 ... -0.004220 16.299999 18.216928 298.0 5.0 0.040093 153.324326 143.446518 1.000000e+09 7.134220
2 2 10 0.082921 808.0 187.867950 -0.529534 -0.002334 0.000556 -0.073177 -0.004231 ... -0.003833 16.299999 13.026308 298.0 5.0 0.054766 115.282860 199.080109 1.000000e+09 6.180274
3 3 11 0.106816 877.0 135.514297 -0.269037 -0.004474 0.001944 0.264250 -0.000881 ... -0.000656 0.030000 5.122952 298.0 3.0 0.030000 23.608582 824.232788 1.000000e+09 5.899572
4 4 11 0.114286 898.0 95.767517 -0.042704 -0.003501 0.001667 0.422801 -0.003173 ... -0.002913 16.299999 2.950633 298.0 2.0 0.037655 23.699802 1175.459473 1.000000e+09 5.373617

5 rows × 26 columns

## 4 Modelling vegetation ```python import pandas as pd import numpy as np import rasterio from rasterio import * #raster processing from rasterio.plot import show from pyspatialml import Raster from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split,GridSearchCV from sklearn.pipeline import Pipeline from scipy.stats import pearsonr import matplotlib.pyplot as plt plt.rcParams["figure.figsize"] = (10,6.5) import seaborn as sns ``` ### Prepare table **import table and rename predictor variables** ```python #read data using panda predictors = pd.read_csv("project/glad_ard/predictors.csv", sep=",", index_col=False) pd.set_option('display.max_columns',None) predictors.columns ``` Index(['Unnamed: 0', 'fid', 'label', 'b', 't0b0', 't1b0', 't2b0', 't3b0', 't4b0', 't5b0', 't6b0', 't7b0', 't8b0', 't9b0', 't10b0', 't11b0', 't12b0', 't13b0', 't14b0', 't15b0', 't16b0', 't17b0', 't18b0', 't19b0', 't20b0', 't21b0', 't22b0'], dtype='object') ```python !ls /home/user/my_SE_data/exercise/grassout/*.tif | xargs -n1 basename ``` aspect.tif dx.tif dxx.tif dxy.tif dy.tif dyy.tif elv.tif pcurv.tif savi.tif slope.tif str_2kvj_dist.tif str_2kvj_r.tif tcurv.tif w_accum.tif w_basin.tif w_dr_dir.tif w_lSl.tif w_SlSt.tif w_spi.tif w_stream_w2k.tif w_str_w2k_dist.tif w_tci.tif ```python print(files) len(files) ``` [PosixPath('/home/user/my_SE_data/exercise/grassout/aspect.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/dx.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/dxx.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/dxy.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/dy.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/dyy.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/elv.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/pcurv.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/savi.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/slope.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/str_2kvj_dist.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/str_2kvj_r.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/tcurv.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_SlSt.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_accum.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_basin.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_dr_dir.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_lSl.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_spi.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_str_w2k_dist.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_stream_w2k.tif'), PosixPath('/home/user/my_SE_data/exercise/grassout/w_tci.tif')] 22 ```python #PS! sorting gave different results for jim object and in bash (capital letters?) names_list = ['aspect', 'dx', 'dxx', 'dxy', 'dy', 'dyy', 'elv', 'pcurv', 'savi', 'slope', 'str_2kvj_dist', 'str_2kvj_r', 'tcurv', 'w_SlSt', 'w_accum', 'w_basin', 'w_dr_dir', 'w_lSl', 'w_spi', 'w_str_w2k_dist', 'w_stream_w2k', 'w_tci'] len(names_list) print(predictors.columns) len(predictors.columns) ``` Index(['Unnamed: 0', 'fid', 'label', 'b', 't0b0', 't1b0', 't2b0', 't3b0', 't4b0', 't5b0', 't6b0', 't7b0', 't8b0', 't9b0', 't10b0', 't11b0', 't12b0', 't13b0', 't14b0', 't15b0', 't16b0', 't17b0', 't18b0', 't19b0', 't20b0', 't21b0', 't22b0'], dtype='object') 27 ```python predictors.columns = predictors.columns[:5].tolist() + names_list print(predictors.columns) ``` Index(['Unnamed: 0', 'fid', 'label', 'b', 't0b0', 'aspect', 'dx', 'dxx', 'dxy', 'dy', 'dyy', 'elv', 'pcurv', 'savi', 'slope', 'str_2kvj_dist', 'str_2kvj_r', 'tcurv', 'w_SlSt', 'w_accum', 'w_basin', 'w_dr_dir', 'w_lSl', 'w_spi', 'w_str_w2k_dist', 'w_stream_w2k', 'w_tci'], dtype='object') ```python len(predictors) ``` 74610 ```python predictors.head() ```
Unnamed: 0 fid label b t0b0 aspect dx dxx dxy dy dyy elv pcurv savi slope str_2kvj_dist str_2kvj_r tcurv w_SlSt w_accum w_basin w_dr_dir w_lSl w_spi w_str_w2k_dist w_stream_w2k w_tci
0 0 0 11 0.097008 760.0 167.750107 -0.243415 -0.006419 -0.000278 0.052850 -0.003702 760.0 -0.005648 0.086638 13.986955 1611.938110 1.000000e+09 -0.003823 16.299999 57.580921 298.0 3.0 0.078407 173.928513 42.452179 1.000000e+09 8.741459
1 1 1 10 0.082413 793.0 186.597610 -0.456937 -0.003307 -0.001944 -0.052850 -0.005112 793.0 -0.002830 0.083820 24.701620 1603.144653 1.000000e+09 -0.004220 16.299999 18.216928 298.0 5.0 0.040093 153.324326 143.446518 1.000000e+09 7.134220
2 2 2 10 0.082921 808.0 187.867950 -0.529534 -0.002334 0.000556 -0.073177 -0.004231 808.0 -0.001522 0.083765 28.127462 1601.273193 1.000000e+09 -0.003833 16.299999 13.026308 298.0 5.0 0.054766 115.282860 199.080109 1.000000e+09 6.180274
3 3 3 11 0.106816 877.0 135.514297 -0.269037 -0.004474 0.001944 0.264250 -0.000881 877.0 -0.003813 0.130181 20.661793 1625.434448 1.000000e+09 -0.000656 0.030000 5.122952 298.0 3.0 0.030000 23.608582 824.232788 1.000000e+09 5.899572
4 4 4 11 0.114286 898.0 95.767517 -0.042704 -0.003501 0.001667 0.422801 -0.003173 898.0 -0.002736 0.139849 23.023153 1658.049072 1.000000e+09 -0.002913 16.299999 2.950633 298.0 2.0 0.037655 23.699802 1175.459473 1.000000e+09 5.373617
**Cleaing up table and calculate some new variables** Columns that will not be used in the modelling is deleted from the table. Aspect is a circular variable and is linearised by calculating its sine and cosine. w_basin is a categorical value and to test whether it is treated correctly and meaningful its values are shuffled (basin_shuf). ```python predictors = predictors.drop(['Unnamed: 0', 'b', 't0b0', 'w_stream_w2k', 'savi', 'str_2kvj_r'], axis=1) predictors.columns ``` Index(['fid', 'label', 'aspect', 'dx', 'dxx', 'dxy', 'dy', 'dyy', 'elv', 'pcurv', 'slope', 'str_2kvj_dist', 'tcurv', 'w_SlSt', 'w_accum', 'w_basin', 'w_dr_dir', 'w_lSl', 'w_spi', 'w_str_w2k_dist', 'w_tci'], dtype='object') ```python #linearise aspect import numpy as np predictors['cos_asp']=np.cos(np.radians(predictors['aspect'])) predictors['sin_asp']=np.sin(np.radians(predictors['aspect'])) ``` ```python #shuffle categorical basin-variable predictors['basin_shuf'] = np.random.permutation(predictors['w_basin'].values) predictors.head() ```
fid label aspect dx dxx dxy dy dyy elv pcurv slope str_2kvj_dist tcurv w_SlSt w_accum w_basin w_dr_dir w_lSl w_spi w_str_w2k_dist w_tci cos_asp sin_asp basin_shuf
0 0 11 167.750107 -0.243415 -0.006419 -0.000278 0.052850 -0.003702 760.0 -0.005648 13.986955 1611.938110 -0.003823 16.299999 57.580921 298.0 3.0 0.078407 173.928513 42.452179 8.741459 -0.977232 0.212176 7490.0
1 1 10 186.597610 -0.456937 -0.003307 -0.001944 -0.052850 -0.005112 793.0 -0.002830 24.701620 1603.144653 -0.004220 16.299999 18.216928 298.0 5.0 0.040093 153.324326 143.446518 7.134220 -0.993378 -0.114896 1646.0
2 2 10 187.867950 -0.529534 -0.002334 0.000556 -0.073177 -0.004231 808.0 -0.001522 28.127462 1601.273193 -0.003833 16.299999 13.026308 298.0 5.0 0.054766 115.282860 199.080109 6.180274 -0.990586 -0.136890 6642.0
3 3 11 135.514297 -0.269037 -0.004474 0.001944 0.264250 -0.000881 877.0 -0.003813 20.661793 1625.434448 -0.000656 0.030000 5.122952 298.0 3.0 0.030000 23.608582 824.232788 5.899572 -0.713425 0.700731 1154.0
4 4 11 95.767517 -0.042704 -0.003501 0.001667 0.422801 -0.003173 898.0 -0.002736 23.023153 1658.049072 -0.002913 16.299999 2.950633 298.0 2.0 0.037655 23.699802 1175.459473 5.373617 -0.100492 0.994938 1172.0
**import response variable and join to predictors** ```python #read response using panda savi = pd.read_csv("project/glad_ard/savi.csv", sep=",", index_col=False) pd.set_option('display.max_columns',None) print(savi.columns) savi = savi.rename({'t0b0':'mean', 't1b0':'median', 't2b0':'per90', 't3b0':'std' } , axis='columns') savi = savi.drop(['Unnamed: 0', 'b', 'label'], axis=1) savi.head(10) print(len(savi)) print(savi.columns) ``` Index(['Unnamed: 0', 'fid', 'label', 'b', 't0b0', 't1b0', 't2b0', 't3b0'], dtype='object') 74610 Index(['fid', 'mean', 'median', 'per90', 'std'], dtype='object') ```python predictors = predictors.merge(savi, left_on = 'fid', right_on='fid', how = 'inner') ``` ```python len(predictors) ``` 74610 ```python predictors.head() ```
fid label aspect dx dxx dxy dy dyy elv pcurv slope str_2kvj_dist tcurv w_SlSt w_accum w_basin w_dr_dir w_lSl w_spi w_str_w2k_dist w_tci mean median per90 std
0 0 11 167.750107 -0.243415 -0.006419 -0.000278 0.052850 -0.003702 760.0 -0.005648 13.986955 1611.938110 -0.003823 16.299999 57.580921 298.0 3.0 0.078407 173.928513 42.452179 8.741459 0.097008 0.094664 0.117911 0.016844
1 1 10 186.597610 -0.456937 -0.003307 -0.001944 -0.052850 -0.005112 793.0 -0.002830 24.701620 1603.144653 -0.004220 16.299999 18.216928 298.0 5.0 0.040093 153.324326 143.446518 7.134220 0.082413 0.082888 0.101135 0.020149
2 2 10 187.867950 -0.529534 -0.002334 0.000556 -0.073177 -0.004231 808.0 -0.001522 28.127462 1601.273193 -0.003833 16.299999 13.026308 298.0 5.0 0.054766 115.282860 199.080109 6.180274 0.082921 0.078278 0.113270 0.023337
3 3 11 135.514297 -0.269037 -0.004474 0.001944 0.264250 -0.000881 877.0 -0.003813 20.661793 1625.434448 -0.000656 0.030000 5.122952 298.0 3.0 0.030000 23.608582 824.232788 5.899572 0.106816 0.104386 0.123099 0.014474
4 4 11 95.767517 -0.042704 -0.003501 0.001667 0.422801 -0.003173 898.0 -0.002736 23.023153 1658.049072 -0.002913 16.299999 2.950633 298.0 2.0 0.037655 23.699802 1175.459473 5.373617 0.114286 0.117651 0.129462 0.015016
```python predictors.describe() ```
fid label aspect dx dxx dxy dy dyy elv pcurv slope str_2kvj_dist tcurv w_SlSt w_accum w_basin w_dr_dir w_lSl w_spi w_str_w2k_dist w_tci cos_asp sin_asp basin_shuf mean median per90 std
count 74610.000000 74610.000000 7.461000e+04 7.461000e+04 7.461000e+04 7.461000e+04 7.461000e+04 7.461000e+04 74610.000000 7.461000e+04 7.461000e+04 74610.000000 7.461000e+04 74610.000000 74610.000000 7.461000e+04 74610.000000 74610.000000 7.461000e+04 7.461000e+04 7.461000e+04 74610.000000 74610.000000 7.461000e+04 74610.000000 74610.000000 74610.000000 74610.000000
mean 37333.886986 11.610481 6.837265e+05 6.835545e+05 6.835545e+05 6.835545e+05 6.835545e+05 6.835545e+05 575.752379 6.835545e+05 6.835639e+05 1244.034957 6.835545e+05 6.898900 156.125340 1.102226e+07 4.388795 0.041021 6.843058e+05 1.101833e+07 6.835616e+05 0.149895 0.070254 1.102226e+07 0.126390 0.124655 0.144773 0.017460
std 21553.426532 0.961089 2.613611e+07 2.613611e+07 2.613611e+07 2.613611e+07 2.613611e+07 2.613611e+07 419.481320 2.613611e+07 2.613611e+07 831.669605 2.613611e+07 8.035924 11065.900632 1.043836e+08 2.454120 0.024518 2.613610e+07 1.043841e+08 2.613611e+07 0.700993 0.693699 1.043836e+08 0.033416 0.031926 0.045378 0.013877
min 0.000000 9.000000 0.000000e+00 -1.344040e+00 -1.182164e-01 -2.553215e-02 -1.341612e+00 -5.341714e-02 8.000000 -3.485471e-02 0.000000e+00 0.000000 -4.527013e-02 0.030000 -518632.187500 2.980000e+02 -4.000000 0.025332 1.446173e-01 0.000000e+00 1.318053e+00 -1.000000 -1.000000 2.980000e+02 0.018881 0.029393 0.071009 0.002483
25% 18669.250000 11.000000 6.651141e+01 -2.985837e-02 -1.551966e-03 -5.554170e-04 -3.658877e-02 -1.410363e-03 171.000000 -1.434810e-03 1.847334e+00 585.213623 -1.201079e-03 0.030000 1.378385 1.468000e+03 2.000000 0.030000 2.121571e+00 2.781997e+02 5.179891e+00 -0.532889 -0.614763 1.468000e+03 0.106657 0.105575 0.119463 0.009773
50% 37334.500000 11.000000 1.623830e+02 1.279920e-02 0.000000e+00 0.000000e+00 4.065462e-03 0.000000e+00 452.000000 -1.332938e-04 4.606714e+00 1060.480957 1.762322e-19 0.030000 2.929744 7.130000e+03 5.000000 0.030000 9.361089e+00 8.213889e+02 6.182429e+00 0.310322 0.123289 7.130000e+03 0.119984 0.119425 0.133374 0.012945
75% 55997.750000 12.000000 2.785243e+02 5.974620e-02 1.359604e-03 5.553937e-04 4.878521e-02 1.234086e-03 1037.000000 1.057664e-03 1.578789e+01 1755.921539 1.408627e-03 16.299999 10.037903 7.732000e+03 7.000000 0.040093 3.752013e+01 1.587121e+03 8.189266e+00 0.826665 0.732041 7.732000e+03 0.139325 0.138246 0.155657 0.019381
max 74664.000000 19.000000 1.000000e+09 1.000000e+09 1.000000e+09 1.000000e+09 1.000000e+09 1.000000e+09 1523.000000 1.000000e+09 1.000000e+09 4256.849121 1.000000e+09 16.299999 207631.109375 1.000000e+09 8.000000 0.409775 1.000000e+09 1.000000e+09 1.000000e+09 1.000000 1.000000 1.000000e+09 0.404548 0.414857 0.674392 0.238861
**Check for and remove nan-values** ```python #predictors.describe() print(predictors.isnull().values.any()) predictors = predictors.dropna() predictors.isnull().values.any() ``` False False **checking covariance for some variables** ```python predictors_sample = predictors[["mean","std", "w_basin", "elv", "str_2kvj_dist"]].sample(5000) sns.pairplot(predictors_sample , kind="reg", plot_kws=dict(scatter_kws=dict(s=2))) plt.show() ``` ![png](output_85_0.png) ```python bins = np.linspace(min(predictors['mean']),max(predictors['mean']),100) plt.hist((predictors['mean']),bins,alpha=0.8); ``` ![png](output_86_0.png) ```python predictors.columns #get names ``` Index(['fid', 'label', 'aspect', 'dx', 'dxx', 'dxy', 'dy', 'dyy', 'elv', 'pcurv', 'slope', 'str_2kvj_dist', 'tcurv', 'w_SlSt', 'w_accum', 'w_basin', 'w_dr_dir', 'w_lSl', 'w_spi', 'w_str_w2k_dist', 'w_tci', 'cos_asp', 'sin_asp', 'basin_shuf', 'mean', 'median', 'per90', 'std'], dtype='object') ```python predictors.shape ``` (74610, 28) ```python X = predictors.iloc[:,[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 17, 18, 19, 20, 21, 22, 23]].values #remove savi related and stream_j_dist Y = predictors.iloc[:,24].values #response variable feat = predictors.iloc[:,[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 17, 18, 19, 20, 21, 22, 23]].columns.values predictors.iloc[:,[24]].columns.values ``` array(['mean'], dtype=object) ```python print(feat) ``` ['aspect' 'dx' 'dxx' 'dxy' 'dy' 'dyy' 'elv' 'pcurv' 'slope' 'str_2kvj_dist' 'tcurv' 'w_accum' 'w_basin' 'w_lSl' 'w_spi' 'w_str_w2k_dist' 'w_tci' 'cos_asp' 'sin_asp' 'basin_shuf'] ```python print(Y.shape) print(X.shape) ``` (74610,) (74610, 20) ```python Y ``` array([0.09700802, 0.08241346, 0.0829206 , ..., 0.13141605, 0.12182923, 0.11685393]) **split data into test and training dataset** ```python X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5, random_state=24) #id. training and testing; split 50/50 y_train = np.ravel(Y_train) y_test = np.ravel(Y_test) ``` ### Random forest Random forest is defined and then run with different settings to find the best model. ```python rf = RandomForestRegressor(random_state = 42) rf.get_params()#default parameters - should be tuned! - not needed in R ``` {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False} ```python # first tried with this default setting rfReg = RandomForestRegressor(min_samples_leaf=50, oob_score=True) # after running chunks below, these settings seems to better fit the data. rfReg = RandomForestRegressor(n_estimators=80, min_samples_leaf=20, max_depth=100, max_features=0.5, random_state=42 , oob_score = True) rfReg.fit(X_train, y_train); dic_pred = {} dic_pred['train'] = rfReg.predict(X_train) dic_pred['test'] = rfReg.predict(X_test) pearsonr_all = [pearsonr(dic_pred['train'],y_train)[0],pearsonr(dic_pred['test'],y_test)[0]] #training vs pred R pearsonr_all ``` [0.9151073853228917, 0.8930071550321135] ```python # checking the oob score rfReg.oob_score_ ``` 0.7825391349776363 ```python plt.rcParams["figure.figsize"] = (8,6) plt.scatter(y_train,dic_pred['train']) plt.xlabel('training SAVI mean (all rows)') plt.ylabel('training prediction') ident = [0, 0.5] plt.plot(ident,ident,'r--') ``` [] ![png](output_99_1.png) ```python impt = [rfReg.feature_importances_, np.std([tree.feature_importances_ for tree in rfReg.estimators_],axis=1)] ind = np.argsort(impt[0]) ``` ```python ind ``` array([13, 5, 3, 19, 7, 10, 2, 18, 0, 17, 11, 14, 16, 4, 8, 1, 15, 9, 6, 12]) ```python #feature importance plt.rcParams["figure.figsize"] = (6,12) plt.barh(range(len(feat)),impt[0][ind],color="b", xerr=impt[1][ind], align="center") plt.yticks(range(len(feat)),feat[ind]); ``` ![png](output_102_0.png) ```python src1 = rasterio.open("/home/user/my_SE_data/exercise/grassout/w_basin.tif") src2 = rasterio.open("/home/user/my_SE_data/exercise/grassout/elv.tif") src3 = rasterio.open("/home/user/my_SE_data/exercise/grassout/str_2kvj_dist.tif") src4 = rasterio.open("/home/user/my_SE_data/exercise/landsat/mean_pyjeo.tif") fig, (src1p,src2p,src3p, src4p) = plt.subplots(1,4, figsize=(21,7)) show(src1, ax=src1p, title='basins' , vmin=0, vmax=10000, cmap='gist_rainbow' ) show(src2, ax=src2p, title='elevation' , vmin=0, vmax=1500, cmap='gist_rainbow' ) show(src3, ax=src3p, title='str_2kvj_dist' , vmin=0, vmax=5000, cmap='gist_rainbow' ) show(src4, ax=src4p, title='mean SAVI' , vmin=0, vmax=1, cmap='gist_rainbow' ) ``` ![png](output_103_1.png) ### Random Forest tuning * "max_features": number of features to consider when looking for the best split. * "max_samples": number of samples to draw from X to train each base estimator. * "n_estimators": identify the number of trees that must grow. It must be large enough so that the error is stabilized. Defoult 100. * "max_depth": max number of levels in each decision tree. ```python scores =[] for k in range(40, 100): rfReg = RandomForestRegressor(n_estimators=k, min_samples_leaf=50, max_features=0.33,max_depth=150,max_samples=0.7,n_jobs=-1,random_state=42 , oob_score = True) rfReg.fit(X_train, y_train); dic_pred = {} dic_pred['train'] = rfReg.predict(X_train) dic_pred['test'] = rfReg.predict(X_test) scores.append([pearsonr(dic_pred['train'],y_train)[0],pearsonr(dic_pred['test'],y_test)[0]]) %matplotlib inline plt.plot(range(40, 100), scores) plt.xlabel('Value of n_estimators for Random Forest Classifier') plt.ylabel('Prediction Accuracy') ``` Text(0, 0.5, 'Prediction Accuracy') ![png](output_105_1.png) ```python scores =[] for k in [50, 100, 150]: rfReg = RandomForestRegressor(n_estimators=80, min_samples_leaf=50, max_features=0.33, max_depth=k, max_samples=0.7,n_jobs=-1,random_state=42 , oob_score = True) rfReg.fit(X_train, y_train); dic_pred = {} dic_pred['train'] = rfReg.predict(X_train) dic_pred['test'] = rfReg.predict(X_test) scores.append([pearsonr(dic_pred['train'],y_train)[0],pearsonr(dic_pred['test'],y_test)[0]]) %matplotlib inline plt.plot([50, 100, 150], scores) plt.xlabel('Value of n_estimators for Random Forest Classifier') plt.ylabel('Prediction Accuracy') ``` Text(0, 0.5, 'Prediction Accuracy') ![png](output_106_1.png) ```python scores =[] for k in [10, 20, 40, 50, 75, 100, 125, 150]: rfReg = RandomForestRegressor(n_estimators=80, min_samples_leaf=k, max_features=0.33, max_depth=50, max_samples=0.7,n_jobs=-1,random_state=42 , oob_score = True) rfReg.fit(X_train, y_train); dic_pred = {} dic_pred['train'] = rfReg.predict(X_train) dic_pred['test'] = rfReg.predict(X_test) scores.append([pearsonr(dic_pred['train'],y_train)[0],pearsonr(dic_pred['test'],y_test)[0]]) %matplotlib inline plt.plot([10, 20, 40, 50, 75, 100, 125, 150], scores) plt.xlabel('Value of min_samples_leaf for Random Forest Classifier') plt.ylabel('Prediction Accuracy') # rfReg = RandomForestRegressor(n_estimators=80, min_samples_leaf=80, max_features=0.33, max_depth=50, max_samples=0.7,n_jobs=-1,random_state=42 , oob_score = True) ``` Text(0, 0.5, 'Prediction Accuracy') ![png](output_107_1.png) ```python scores =[] for k in [0.1, 0.2, 0.3, 0.4, 0.5]: rfReg = RandomForestRegressor(n_estimators=80, min_samples_leaf=50, max_features=k, max_depth=50, max_samples=0.7,n_jobs=-1,random_state=42 , oob_score = True) rfReg.fit(X_train, y_train); dic_pred = {} dic_pred['train'] = rfReg.predict(X_train) dic_pred['test'] = rfReg.predict(X_test) scores.append([pearsonr(dic_pred['train'],y_train)[0],pearsonr(dic_pred['test'],y_test)[0]]) %matplotlib inline plt.plot([0.1, 0.2, 0.3, 0.4, 0.5], scores) plt.xlabel('Value of max_features for Random Forest Classifier') plt.ylabel('Prediction Accuracy') # rfReg = RandomForestRegressor(n_estimators=80, min_samples_leaf=80, max_features=0.33, max_depth=50, max_samples=0.7,n_jobs=-1,random_state=42 , oob_score = True) ``` Text(0, 0.5, 'Prediction Accuracy') ![png](output_108_1.png) ### Predict on the raster using pyspatialml `pip install Pyspatialml` **prediction with most important variables** ```python X = predictors[['elv', 'w_basin', 'str_2kvj_dist', 'w_str_w2k_dist', 'dx', 'dy', 'slope', 'w_tci', 'w_spi', 'w_accum']] #remove savi related and stream_j_dist Y = predictors[['mean']] #response variable feat = predictors[['elv', 'w_basin', 'str_2kvj_dist', 'w_str_w2k_dist', 'dx', 'dy', 'slope', 'w_tci', 'w_spi', 'w_accum']].columns.values predictors[['mean']].columns.values ``` array(['mean'], dtype=object) ```python X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5, random_state=24) #id. training and testing; split 50/50 y_train = np.ravel(Y_train) y_test = np.ravel(Y_test) ``` ```python rf = RandomForestRegressor(random_state = 42) ``` ```python rfReg = RandomForestRegressor(n_estimators=80, min_samples_leaf=20, max_depth=100, max_features=0.5, random_state=42 , oob_score = True) rfReg.fit(X_train, y_train); dic_pred = {} dic_pred['train'] = rfReg.predict(X_train) dic_pred['test'] = rfReg.predict(X_test) pearsonr_all = [pearsonr(dic_pred['train'],y_train)[0],pearsonr(dic_pred['test'],y_test)[0]] #training vs pred R pearsonr_all ``` [0.9133764136610967, 0.8939919955832134] ```python # checking the oob score rfReg.oob_score_ ``` 0.7855745474283174 ```python plt.rcParams["figure.figsize"] = (8,6) plt.scatter(y_train,dic_pred['train']) plt.xlabel('training SAVI mean (all rows)') plt.ylabel('training prediction') ident = [0, 0.5] plt.plot(ident,ident,'r--') ``` [] ![png](output_115_1.png) ```python impt = [rfReg.feature_importances_, np.std([tree.feature_importances_ for tree in rfReg.estimators_],axis=1)] ind = np.argsort(impt[0]) ``` ```python ind ``` array([8, 9, 7, 5, 6, 4, 3, 2, 0, 1]) ```python #feature importance plt.rcParams["figure.figsize"] = (6,12) plt.barh(range(len(feat)),impt[0][ind],color="b", xerr=impt[1][ind], align="center") plt.yticks(range(len(feat)),feat[ind]); ``` ![png](output_118_0.png) **import and stack datasets for prediction** ```python #Import datasets elv = rasterio.open("/home/user/my_SE_data/exercise/grassout/elv.tif") w_basin = rasterio.open("/home/user/my_SE_data/exercise/grassout/w_basin.tif") str_2kvj_dist = rasterio.open("/home/user/my_SE_data/exercise/grassout/str_2kvj_dist.tif") w_str_w2k_dist = rasterio.open("/home/user/my_SE_data/exercise/grassout/w_str_w2k_dist.tif") dx = rasterio.open("/home/user/my_SE_data/exercise/grassout/dx.tif") dy = rasterio.open("/home/user/my_SE_data/exercise/grassout/dy.tif") slope = rasterio.open("/home/user/my_SE_data/exercise/grassout/slope.tif") w_tci = rasterio.open("/home/user/my_SE_data/exercise/grassout/w_tci.tif") w_spi = rasterio.open("/home/user/my_SE_data/exercise/grassout/w_spi.tif") w_accum = rasterio.open("/home/user/my_SE_data/exercise/grassout/w_accum.tif") ``` ```python predictors_rasters = [elv, w_basin, str_2kvj_dist, w_str_w2k_dist, dx, dy, slope, w_tci, w_spi, w_accum] stack = Raster(predictors_rasters) ``` ```python result = stack.predict(estimator=rfReg, dtype='float64', nodata=-1) ``` ```python # plot regression result plt.rcParams["figure.figsize"] = (12,12) result.iloc[0].cmap = "plasma" result.plot() plt.show() ``` ![png](output_123_0.png) ```python import rasterio from matplotlib import pyplot src = rasterio.open("/home/user/my_SE_data/exercise/landsat/mean_pyjeo.tif") pyplot.imshow(src.read(1), cmap='plasma') pyplot.show() ``` ![png](output_124_0.png) ## 5 Follow up This case study was primarily for learning the tools and how bash, python, grass etc can be combined in one environment. Obviously there are many improvements that can be done for the case study itself. Some are listed below - consider more explanatory variables and remove w_basin - test deep learning methods - look into other response variables that better catch the variability of the environment and try to distinguish among the long-lived and short-lived vegetation - expand the area to include larger environmental variability