# Estimation of tree height using GEDI dataset - Random Forest prediction

In order to run this notebook, please do what follows:

cd  /media/sf_LVM_shared/my_SE_data/exercise
source \$HOME/venv/bin/activate
pip3 install pyspatialml
jupyter lab Tree_Height_03RF_pred

Find a relationship between tree height and enviromental predictors to be able to predict out side the GEDI observation.
The model, in its simplest form, looks like the following:
y ~ f ( a*x1 + b*x2 + c*x3 + …)  + ε,


where y is a response variable, the x’s are predictor variables, and ε is the associated error.

There are many different types of models

• Parametric models – make assumptions about the underlying distribution of the data.

• Maximum likelihood classification

• Discriminant analysis

• General linear models (ex. linear regression)

• Nonparametric models – make no assumptions about the underlying data distributions.

• Support Vector Machine

• Artificial neural networks

• Random Forest

## Random forest

Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification (containing categorical variables) and Regression (response continuous variables) problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.

One of the most important features of the Random Forest Algorithm is that it can handle continuous and categorical variables predictors without any assamption on their data distribution, dimension, and even if the predictors are autocorrelated.

Bagging
Bagging, also known as Bootstrap Aggregation is the ensemble technique used by random forest. Choosesing a random sample from the data set it is creating a tree. Each tree is generated from the samples (Bootstrap Samples) provided by the Original Data with replacement known as row sampling. This step of row sampling with replacement is called bootstrap. Each is trained independently which generates results. The final output is based on majority voting after combining the results of all models. This step which involves combining all the results and generating output based on majority voting is known as tree aggregation.

Steps involved in random forest algorithm:

• Step 1: In Random forest n number of random records are taken from the data set having k number of records.

• Step 2: Individual decision trees are constructed for each sample.

• Step 3: Each decision tree will generate an output.
• Step 4: Final output is considered based on Majority Voting or Averaging for Classification and regression respectively.

[2]:

from IPython.display import Image
Image("../images/Random_Forest.jpg" , width = 700, height = 700)

[2]:

Random Forest exercise
In this exercise we will use Random Forest Regression algorithm with the sklearn python package to estimage the tree height. In last step, we are going to use the Pyspatialml for the spatial prediction on the raster files.
[7]:

import pandas as pd
import numpy as np
import rasterio
from rasterio import *
from rasterio.plot import show
# from pyspatialml import Raster
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.pipeline import Pipeline
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10,6.5)


## Import raw data, extracted predictors and show the data distribution

[4]:

predictors = pd.read_csv("tree_height/txt/eu_x_y_height_predictors_select.txt", sep=" ",  index_col=False)
pd.set_option('display.max_columns',None)
# change column name
predictors = predictors.rename({'dev-magnitude':'devmagnitude'} , axis='columns')

[4]:

ID X Y h BLDFIE_WeigAver CECSOL_WeigAver CHELSA_bio18 CHELSA_bio4 convergence cti devmagnitude eastness elev forestheight glad_ard_SVVI_max glad_ard_SVVI_med glad_ard_SVVI_min northness ORCDRC_WeigAver outlet_dist_dw_basin SBIO3_Isothermality_5_15cm SBIO4_Temperature_Seasonality_5_15cm treecover
0 1 6.050001 49.727499 3139.00 1540 13 2113 5893 -10.486560 -238043120 1.158417 0.069094 353.983124 23 276.871094 46.444092 347.665405 0.042500 9 780403 19.798992 440.672211 85
1 2 6.050002 49.922155 1454.75 1491 12 1993 5912 33.274361 -208915344 -1.755341 0.269112 267.511688 19 -49.526367 19.552734 -130.541748 0.182780 16 772777 20.889412 457.756195 85
2 3 6.050002 48.602377 853.50 1521 17 2124 5983 0.045293 -137479792 1.908780 -0.016055 389.751160 21 93.257324 50.743652 384.522461 0.036253 14 898820 20.695877 481.879700 62
3 4 6.050009 48.151979 3141.00 1526 16 2569 6130 -33.654274 -267223072 0.965787 0.067767 380.207703 27 542.401367 202.264160 386.156738 0.005139 15 831824 19.375000 479.410278 85
4 5 6.050010 49.588410 2065.25 1547 14 2108 5923 27.493824 -107809368 -0.162624 0.014065 308.042786 25 136.048340 146.835205 198.127441 0.028847 17 796962 18.777500 457.880066 85
5 6 6.050014 48.608456 1246.50 1515 19 2124 6010 -1.602039 17384282 1.447979 -0.018912 364.527100 18 221.339844 247.387207 480.387939 0.042747 14 897945 19.398880 474.331329 62
6 7 6.050016 48.571401 2938.75 1520 19 2169 6147 27.856503 -66516432 -1.073956 0.002280 254.679596 19 125.250488 87.865234 160.696777 0.037254 11 908426 20.170450 476.414520 96
7 8 6.050019 49.921613 3294.75 1490 12 1995 5912 22.102139 -297770784 -1.402633 0.309765 294.927765 26 -86.729492 -145.584229 -190.062988 0.222435 15 772784 20.855963 457.195404 86
8 9 6.050020 48.822645 1623.50 1554 18 1973 6138 18.496584 -25336536 -0.800016 0.010370 240.493759 22 -51.470703 -245.886719 172.074707 0.004428 8 839132 21.812290 496.231110 64
9 10 6.050024 49.847522 1400.00 1521 15 2187 5886 -5.660453 -278652608 1.477951 -0.068720 376.671143 12 277.297363 273.141846 -138.895996 0.098817 13 768873 21.137711 466.976685 70
[5]:

len(predictors)

[5]:

1267239


Plotting the data

[6]:

bins = np.linspace(min(predictors['h']),max(predictors['h']),100)
plt.hist((predictors['h']),bins,alpha=0.8);


Select only tree with height less then 70m (highest tree in germany 67m https://visit.freiburg.de/en/attractions/waldtraut-germany-s-tallest-tree)

[7]:

predictors_sel = predictors.loc[(predictors['h'] < 7000)  ].sample(100000)
predictors_sel.insert ( 4, 'hm' ,  predictors_sel['h']/100 ) # add a culumn of heigh in meter
len(predictors_sel)

[7]:

ID X Y h hm BLDFIE_WeigAver CECSOL_WeigAver CHELSA_bio18 CHELSA_bio4 convergence cti devmagnitude eastness elev forestheight glad_ard_SVVI_max glad_ard_SVVI_med glad_ard_SVVI_min northness ORCDRC_WeigAver outlet_dist_dw_basin SBIO3_Isothermality_5_15cm SBIO4_Temperature_Seasonality_5_15cm treecover
211869 211870 6.763981 49.923656 2504.00 25.0400 1533 13 2114 6069 -51.807003 -204555312 1.713994 -0.030101 356.479462 23 168.107666 0.701416 -27.481689 0.001850 15 667874 19.075720 444.208923 95
1243664 1243665 9.833275 49.526917 649.25 6.4925 1534 15 2102 6516 0.582322 -200281328 -0.082161 0.021150 338.562622 23 663.329102 66.410156 208.560791 -0.054791 8 801897 18.012062 487.378571 89
862800 862801 8.747207 49.396667 4145.25 41.4525 1464 13 2770 6187 -5.705454 -193031872 2.207440 0.082736 471.601471 29 682.764160 196.173340 410.016113 -0.105913 9 687733 19.400467 449.129852 85
184466 184467 6.692364 48.311297 902.75 9.0275 1503 16 2458 6333 2.532082 -153840256 -0.687262 -0.040297 326.341187 12 -155.665771 -179.105957 -273.086426 0.005218 11 938821 19.768513 504.342621 86
210193 210194 6.759972 48.427197 1215.75 12.1575 1499 11 2473 6283 -34.008671 -184876336 0.869934 0.029229 331.943359 26 214.386475 39.022705 -19.717529 -0.010323 20 937918 18.850109 450.874207 45
485841 485842 7.421320 48.972783 2629.50 26.2950 1443 13 2366 6249 -12.187319 -237638256 0.527734 0.184359 317.114014 25 55.184082 58.465332 91.532715 -0.045335 10 828316 19.876814 445.643494 73
120379 120380 6.482271 49.914016 2173.00 21.7300 1538 16 1847 6105 -17.632570 -197189520 -1.088760 -0.035505 272.008911 23 -126.860352 -195.746094 75.838867 -0.008322 18 728450 19.971024 459.339722 97
388316 388317 7.193769 48.376309 527.00 5.2700 1460 11 2972 6227 24.105930 -287456672 -1.292012 0.108061 523.706970 11 309.591309 440.787842 600.310059 0.088250 17 870673 20.666460 459.003754 11
825800 825801 8.663616 49.674021 2836.25 28.3625 1533 14 2789 6461 19.265608 -253300688 -0.924922 -0.056384 208.336411 26 155.725586 211.511230 500.423096 0.082024 10 649353 20.394165 502.623260 85
1116257 1116258 9.384341 49.143288 1670.00 16.7000 1527 14 2207 6644 8.773687 -330216832 -0.236958 -0.159152 255.992538 26 154.274658 110.248535 181.868408 0.074600 5 780885 19.781750 518.130005 72

Plotting response variables distribution

[8]:

bins = np.linspace(min(predictors_sel['hm']),max(predictors_sel['hm']),100)
plt.hist((predictors_sel['hm']),bins,alpha=0.8);


The Global Forest Canopy Height, 2019 map has been release in 2020 (scientific publication https://doi.org/10.1016/j.rse.2020.112165). The authors use a regression tree model that was calibrated and applied to each individual Landsat GLAD ARD tile (1 × 1◦) in a “moving window” mode. Such tree height estimation is storede in forestheight.tiff and in the table as forestheight column. A correlation plot and its pearson coefficient is show below.

[9]:

plt.rcParams["figure.figsize"] = (8,6)
plt.scatter(predictors['h']/100,predictors['forestheight'])
plt.xlabel('hm')
plt.ylabel('forestheight')
ident = [0, 50]
plt.plot(ident,ident,'r--')


[9]:

[<matplotlib.lines.Line2D at 0x7fa02257ddb0>]

[10]:

pearsonr_Publication_Estimation = pearsonr(predictors['h'],predictors['forestheight'])[0]
pearsonr_Publication_Estimation

[10]:

0.4527925129990046


We will try to beats such error estimation using a more advance ML tecnques and different enviromental predictors that better express the ecological condition.

## Data set splitting

Split the dataset (predictors_sel) in order to create response variable vs predictors variables (we are excluding forestheight predictors).

[11]:

   X = predictors_sel.iloc[:,[5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23]].values
Y = predictors_sel.iloc[:,4:5].values
feat = predictors_sel.iloc[:,[5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23]].columns.values


Double ceck that we select the right columns

[12]:

feat

[12]:

array(['BLDFIE_WeigAver', 'CECSOL_WeigAver', 'CHELSA_bio18',
'CHELSA_bio4', 'convergence', 'cti', 'devmagnitude', 'eastness',
'outlet_dist_dw_basin', 'SBIO3_Isothermality_5_15cm',
'SBIO4_Temperature_Seasonality_5_15cm', 'treecover'], dtype=object)

[13]:

Y.shape

[13]:

(100000, 1)

[14]:

X.shape

[14]:

(100000, 18)


Create 4 dataset for training and testing the algorithm

[15]:

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5, random_state=24)
y_train = np.ravel(Y_train)
y_test = np.ravel(Y_test)


## Random Forest default parameters

Random Forest can be implemented using the RandomForestRegressor in scikit-learn

Training Random Forest using default parameters

[16]:

rf = RandomForestRegressor(random_state = 42)
rf.get_params()

[16]:

{'bootstrap': True,
'ccp_alpha': 0.0,
'criterion': 'mse',
'max_depth': None,
'max_features': 'auto',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 100,
'n_jobs': None,
'oob_score': False,
'random_state': 42,
'verbose': 0,
'warm_start': False}

[17]:

rfReg = RandomForestRegressor(min_samples_leaf=50, oob_score=True)
rfReg.fit(X_train, y_train);
dic_pred = {}
dic_pred['train'] = rfReg.predict(X_train)
dic_pred['test'] = rfReg.predict(X_test)
pearsonr_all = [pearsonr(dic_pred['train'],y_train)[0],pearsonr(dic_pred['test'],y_test)[0]]
pearsonr_all

[17]:

[0.6043347073198263, 0.5189400044676861]

[18]:

# checking the oob score
rfReg.oob_score_

[18]:

0.2651550005190103

Additional resources how to reduce the oob error can found at

### Random Forest tuning

• “max_features”: number of features to consider when looking for the best split.

• “max_samples”: number of samples to draw from X to train each base estimator.

• “n_estimators”: identify the number of trees that must grow. It must be large enough so that the error is stabilized. Defoult 100.

• “max_depth”: max number of levels in each decision tree.

Using this pseudo code tune the parameters in order the improve the algoirithm performance

pipeline = Pipeline([('rf',RandomForestRegressor())])

parameters = {
'rf__max_features':(3,4,5),
'rf__max_samples':(0.5,0.6,0.7),
'rf__n_estimators':(500,1000),
'rf__max_depth':(50,100,200,300)}

grid_search = GridSearchCV(pipeline,parameters,n_jobs=6,cv=5,scoring='r2',verbose=1)
grid_search.fit(X_train,y_train)

rfReg = RandomForestRegressor(n_estimators=5000,max_features=0.33,max_depth=500,max_samples=0.7,n_jobs=-1,random_state=24 , oob_score = True)
rfReg.fit(X_train, y_train);
dic_pred = {}
dic_pred['train'] = rfReg.predict(X_train)
dic_pred['test'] = rfReg.predict(X_test)
pearsonr_all_tune = [pearsonr(dic_pred['train'],y_train)[0],pearsonr(dic_pred['test'],y_test)[0]]
pearsonr_all_tune

grid_search.best_score_

print ('Best Training score: %0.3f' % grid_search.best_score_)
print ('Optimal parameters:') best_par = grid_search.best_estimator_.get_params()
for par_name in sorted(parameters.keys()):
print ('\t%s: %r' % (par_name, best_par[par_name]))


Tracking the error rate trend as in https://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html

[19]:

plt.rcParams["figure.figsize"] = (8,6)
plt.scatter(y_train,dic_pred['train'])
plt.xlabel('training GEDI Height (all rows)')
plt.ylabel('training prediction')
ident = [0, 50]
plt.plot(ident,ident,'r--')

[19]:

[<matplotlib.lines.Line2D at 0x7fa02261eec0>]

[20]:

plt.rcParams["figure.figsize"] = (8,6)
plt.scatter(y_test,dic_pred['test'])
plt.xlabel('testing GEDI Height (all rows)')
plt.ylabel('testing prediction')
ident = [0, 50]
plt.plot(ident,ident,'r--')

[20]:

[<matplotlib.lines.Line2D at 0x7fa03153b3a0>]

[21]:

impt = [rfReg.feature_importances_, np.std([tree.feature_importances_ for tree in rfReg.estimators_],axis=1)]
ind = np.argsort(impt[0])

[22]:

ind

[22]:

array([ 1, 13,  0,  4,  6,  8, 16,  2,  5, 15, 12,  3, 11,  7,  9, 14, 10,
17])

[23]:

plt.rcParams["figure.figsize"] = (6,12)
plt.barh(range(len(feat)),impt[0][ind],color="b", xerr=impt[1][ind], align="center")
plt.yticks(range(len(feat)),feat[ind]);


Most important variable is treecover followed by Spectral Variability Vegetation Index, followed by outlet_dist_dw_basin that is a proxy of water accumulation. Besides eastness and northness describes microclimat condition.

## Base on data quality flag select more reilable tree height.

File storing tree hight (cm) obtained by 6 algorithms, with their associate quality flags. The quality flags can be used to refine and select the best tree height estimation and use it as tree height observation.

• a?_95: tree hight (cm) at 95 quintile, for each algorithm

• min_rh_95: minimum value of tree hight (cm) ammong the 6 algorithms

• max_rh_95: maximum value of tree hight (cm) ammong the 6 algorithms

• BEAM: 1-4 coverage beam = lower power (worse) ; 5-8 power beam = higher power (better)

• digital_elev: digital mdoel elevation

• elev_low: elevation of center of lowest mode

• qc_a?: quality_flag for six algorithms quality_flag = 1 (better); = 0 (worse)

• se_a?: sensitivity for six algorithms sensitivity < 0.95 (worse); sensitivity > 0.95 (beter )

• solar_ele: solar elevation. > 0 day (worse); < 0 night (better)

[24]:

height_6algorithms = pd.read_csv("tree_height/txt/eu_y_x_select_6algorithms_fullTable.txt", sep=" ",  index_col=False)
pd.set_option('display.max_columns',None)

[24]:

ID X Y a1_95 a2_95 a3_95 a4_95 a5_95 a6_95 min_rh_95 max_rh_95 BEAM digital_elev elev_low qc_a1 qc_a2 qc_a3 qc_a4 qc_a5 qc_a6 se_a1 se_a2 se_a3 se_a4 se_a5 se_a6 deg_fg solar_ele
0 1 6.050001 49.727499 3139 3139 3139 3120 3139 3139 3120 3139 5 410.0 383.72153 1 1 1 1 1 1 0.962 0.984 0.968 0.962 0.989 0.979 0 17.7
1 2 6.050002 49.922155 1022 2303 970 872 5596 1524 872 5596 5 290.0 2374.14110 0 0 0 0 0 0 0.948 0.990 0.960 0.948 0.994 0.980 0 43.7
2 3 6.050002 48.602377 380 1336 332 362 1336 1340 332 1340 4 440.0 435.97781 1 1 1 1 1 1 0.947 0.975 0.956 0.947 0.981 0.968 0 0.2
3 4 6.050009 48.151979 3153 3142 3142 3127 3138 3142 3127 3153 2 450.0 422.00537 1 1 1 1 1 1 0.930 0.970 0.943 0.930 0.978 0.962 0 -14.2
4 5 6.050010 49.588410 666 4221 651 33 5611 2723 33 5611 8 370.0 2413.74830 0 0 0 0 0 0 0.941 0.983 0.946 0.941 0.992 0.969 0 22.1
5 6 6.050014 48.608456 787 1179 1187 761 1833 1833 761 1833 3 420.0 415.51581 1 1 1 1 1 1 0.952 0.979 0.961 0.952 0.986 0.975 0 0.2
[25]:

height_6algorithms_sel = height_6algorithms.loc[(height_6algorithms['BEAM'] > 4)
&   (height_6algorithms['qc_a1'] == 1)
&   (height_6algorithms['qc_a2'] == 1)
&   (height_6algorithms['qc_a3'] == 1)
&   (height_6algorithms['qc_a4'] == 1)
&   (height_6algorithms['qc_a5'] == 1)
&   (height_6algorithms['qc_a6'] == 1)
&   (height_6algorithms['se_a1'] > 0.95)
&   (height_6algorithms['se_a2'] > 0.95)
&   (height_6algorithms['se_a3'] > 0.95)
&   (height_6algorithms['se_a4'] > 0.95)
&   (height_6algorithms['se_a5'] > 0.95)
&   (height_6algorithms['se_a6'] > 0.95)
&   (height_6algorithms['deg_fg'] == 0)
&   (height_6algorithms['solar_ele'] < 0)]

[26]:

height_6algorithms_sel

[26]:

ID X Y a1_95 a2_95 a3_95 a4_95 a5_95 a6_95 min_rh_95 max_rh_95 BEAM digital_elev elev_low qc_a1 qc_a2 qc_a3 qc_a4 qc_a5 qc_a6 se_a1 se_a2 se_a3 se_a4 se_a5 se_a6 deg_fg solar_ele
7 8 6.050019 49.921613 3303 3288 3296 3236 3857 3292 3236 3857 7 320.0 297.68533 1 1 1 1 1 1 0.971 0.988 0.976 0.971 0.992 0.984 0 -33.9
11 12 6.050039 47.995344 2762 2736 2740 2747 3893 2736 2736 3893 5 390.0 368.55121 1 1 1 1 1 1 0.975 0.990 0.979 0.975 0.994 0.987 0 -37.3
14 15 6.050046 49.865317 1398 2505 2509 1316 2848 2505 1316 2848 6 340.0 330.40564 1 1 1 1 1 1 0.973 0.990 0.979 0.973 0.994 0.986 0 -18.2
15 16 6.050048 49.050020 984 943 947 958 2617 947 943 2617 6 300.0 291.22598 1 1 1 1 1 1 0.978 0.991 0.982 0.978 0.995 0.988 0 -35.4
16 17 6.050049 48.391359 3362 3332 3336 3351 4467 3336 3332 4467 5 530.0 504.78122 1 1 1 1 1 1 0.973 0.988 0.977 0.973 0.992 0.984 0 -5.1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1267207 1267208 9.949829 49.216272 2160 2816 2816 2104 3299 2816 2104 3299 8 420.0 386.44556 1 1 1 1 1 1 0.980 0.993 0.984 0.980 0.995 0.989 0 -16.9
1267211 1267212 9.949856 49.881190 3190 3179 3179 3171 3822 3179 3171 3822 6 380.0 363.69348 1 1 1 1 1 1 0.968 0.986 0.974 0.968 0.990 0.982 0 -35.1
1267216 1267217 9.949880 49.873435 2061 2828 2046 2024 2828 2828 2024 2828 7 380.0 361.06812 1 1 1 1 1 1 0.967 0.988 0.974 0.967 0.993 0.983 0 -35.1
1267227 1267228 9.949958 49.127182 366 2307 1260 355 3531 2719 355 3531 6 500.0 493.52792 1 1 1 1 1 1 0.973 0.989 0.978 0.973 0.993 0.985 0 -36.0
1267237 1267238 9.949999 49.936763 2513 2490 2490 2494 2490 2490 2490 2513 5 360.0 346.54227 1 1 1 1 1 1 0.968 0.988 0.974 0.968 0.993 0.983 0 -32.2

226892 rows × 28 columns

Calculate the mean height excluidng the maximum and minimum values

[27]:

height_sel =  pd.DataFrame({'ID' : height_6algorithms_sel['ID'] ,
'hm_sel': (height_6algorithms_sel['a1_95'] + height_6algorithms_sel['a2_95'] + height_6algorithms_sel['a3_95'] + height_6algorithms_sel['a4_95']
+ height_6algorithms_sel['a5_95'] + height_6algorithms_sel['a6_95'] - height_6algorithms_sel['min_rh_95'] - height_6algorithms_sel['max_rh_95']) / 400 } )

[28]:

height_sel

[28]:

ID hm_sel
7 8 32.9475
11 12 27.4625
14 15 22.2925
15 16 9.5900
16 17 33.4625
... ... ...
1267207 1267208 26.5200
1267211 1267212 31.8175
1267216 1267217 24.4075
1267227 1267228 16.6300
1267237 1267238 24.9100

226892 rows × 2 columns

Merge the new height with the predictors table, using the ID as Primary Key

[29]:

predictors_hm_sel = pd.merge( predictors ,  height_sel , left_on='ID' ,  right_on='ID' ,  how='right')

[30]:

predictors_hm_sel.head(6)

[30]:

ID X Y h BLDFIE_WeigAver CECSOL_WeigAver CHELSA_bio18 CHELSA_bio4 convergence cti devmagnitude eastness elev forestheight glad_ard_SVVI_max glad_ard_SVVI_med glad_ard_SVVI_min northness ORCDRC_WeigAver outlet_dist_dw_basin SBIO3_Isothermality_5_15cm SBIO4_Temperature_Seasonality_5_15cm treecover hm_sel
0 8 6.050019 49.921613 3294.75 1490 12 1995 5912 22.102139 -297770784 -1.402633 0.309765 294.927765 26 -86.729492 -145.584229 -190.062988 0.222435 15 772784 20.855963 457.195404 86 32.9475
1 12 6.050039 47.995344 2746.25 1523 12 2612 6181 3.549103 -71279992 0.507727 -0.021408 322.920227 26 660.006104 92.722168 190.979736 -0.034787 16 784807 20.798000 460.501221 97 27.4625
2 15 6.050046 49.865317 2229.25 1517 13 2191 5901 31.054762 -186807440 -1.375050 -0.126880 291.412537 7 1028.385498 915.806396 841.586182 0.024677 16 766444 19.941267 454.185089 54 22.2925
3 16 6.050048 49.050020 959.00 1526 14 2081 6100 9.933455 -183562672 -0.382834 0.086874 246.288010 24 -12.283691 -58.179199 174.205566 0.094175 10 805730 19.849365 470.946533 78 9.5900
4 17 6.050049 48.391359 3346.25 1489 19 2486 5966 -6.957157 -273522688 2.989759 0.214769 474.409088 24 125.583008 6.154297 128.129150 0.017164 15 950190 21.179420 491.398376 85 33.4625
5 19 6.050053 49.877876 529.00 1531 12 2184 5915 -24.278454 -377335296 0.265329 -0.248356 335.534760 25 593.601074 228.712402 315.298340 -0.127365 17 764713 19.760756 448.580811 96 5.2900
[31]:

predictors_hm_sel = predictors_hm_sel.loc[(predictors['h'] < 7000) ].sample(100000)

[32]:

predictors_hm_sel

[32]:

ID X Y h BLDFIE_WeigAver CECSOL_WeigAver CHELSA_bio18 CHELSA_bio4 convergence cti devmagnitude eastness elev forestheight glad_ard_SVVI_max glad_ard_SVVI_med glad_ard_SVVI_min northness ORCDRC_WeigAver outlet_dist_dw_basin SBIO3_Isothermality_5_15cm SBIO4_Temperature_Seasonality_5_15cm treecover hm_sel
83357 447901 7.325923 49.296261 1988.00 1504 11 1973 6329 -27.310654 -311542144 1.611667 -0.059867 315.589905 22 99.902344 182.999268 358.212158 -0.027319 13 852606 18.866175 470.286011 75 19.8800
213105 1196478 9.622623 49.833063 2220.25 1520 13 1875 6606 -20.839262 -279513600 -0.550544 0.038211 244.721085 15 112.770508 -38.957520 -107.884766 -0.153021 11 770768 19.954826 507.755493 54 22.2025
58293 313301 7.017546 49.874260 1389.75 1525 14 2127 6080 -16.012159 -287215200 1.066931 -0.218266 424.159332 22 -59.042236 -32.115967 126.038086 0.068847 13 619919 19.761299 465.827271 85 13.8975
54985 297876 6.981116 48.889372 1883.25 1525 13 2161 6330 8.863924 431331168 -1.713272 0.013969 221.787323 23 560.937012 101.161133 27.694336 -0.025228 7 847969 18.343138 464.072418 67 18.8325
18698 98303 6.397619 49.380758 2965.00 1538 11 2367 6042 23.690971 -80957808 -0.130503 -0.035726 276.400085 25 230.774902 120.549805 309.234375 0.012809 13 740733 18.482462 444.627533 78 29.6500
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
60228 322853 7.040689 49.728151 3375.00 1402 12 2362 5810 17.660843 15303386 1.776377 0.010083 517.456360 25 172.217773 256.016113 213.990967 -0.079195 25 663180 21.064062 430.348450 85 33.7500
42226 231457 6.810772 48.268025 2707.00 1409 15 3120 6229 0.276158 -217058432 -0.143515 -0.099901 471.342743 25 -153.505615 -142.315430 -232.547607 -0.198644 29 955008 20.322662 441.391357 85 27.0700
49495 271346 6.916373 49.514878 3097.25 1536 13 2213 6140 -0.606761 -299815488 0.547743 -0.104366 365.085297 24 29.947266 -2.760986 204.209473 0.163875 14 789703 20.452227 469.161865 85 30.9725
11 40 6.050120 48.058094 3012.25 1521 14 2474 6191 -15.363133 -195105152 1.290480 -0.043498 331.926483 25 750.277832 12.481934 206.658447 -0.008241 15 820336 19.240412 474.946625 85 30.1225
185044 1039892 9.187669 49.500969 1117.00 1472 15 2561 6294 10.249912 -141432176 1.770781 -0.046596 483.281250 27 -15.297119 -161.964844 -210.821777 -0.015172 23 761513 19.522852 437.534729 38 11.1700

100000 rows × 24 columns

[33]:

   X = predictors_hm_sel.iloc[:,[4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22]].values
Y = predictors_hm_sel.iloc[:,23:24].values
feat = predictors_hm_sel.iloc[:,[4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22]].columns.values

[34]:

feat

[34]:

array(['BLDFIE_WeigAver', 'CECSOL_WeigAver', 'CHELSA_bio18',
'CHELSA_bio4', 'convergence', 'cti', 'devmagnitude', 'eastness',
'outlet_dist_dw_basin', 'SBIO3_Isothermality_5_15cm',
'SBIO4_Temperature_Seasonality_5_15cm', 'treecover'], dtype=object)

[35]:

Y.shape

[35]:

(100000, 1)

[36]:

X.shape

[36]:

(100000, 18)

[37]:

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5, random_state=24)
y_train = np.ravel(Y_train)
y_test = np.ravel(Y_test)


## Random Forest default parameters with select row base on quality flag

Training Random Forest using default parameters

[38]:

rfReg = RandomForestRegressor(min_samples_leaf=20, oob_score=True)
rfReg.fit(X_train, y_train);
dic_pred = {}
dic_pred['train'] = rfReg.predict(X_train)
dic_pred['test'] = rfReg.predict(X_test)

[39]:

# checking the oob score
rfReg.oob_score_

[39]:

0.37582774369799943


## Final assesment

Selected data

[40]:

pearsonr_all_sel = [pearsonr(dic_pred['train'],y_train)[0],pearsonr(dic_pred['test'],y_test)[0]]
pearsonr_all_sel

[40]:

[0.747400101639897, 0.615726616845201]


All data

[41]:

pearsonr_all

[41]:

[0.6043347073198263, 0.5189400044676861]


Publication results

[42]:

pearsonr_Publication_Estimation

[42]:

0.4527925129990046


Tone the RF

pipeline = Pipeline([('rf',RandomForestRegressor())])

parameters = {
'rf__max_features':("log2","sqrt",0.33),
'rf__max_samples':(0.5,0.6,0.7),
'rf__n_estimators':(500,1000),
'rf__max_depth':(50,100,200)}

grid_search = GridSearchCV(pipeline,parameters,n_jobs=-1,cv=3,scoring='r2',verbose=1)
grid_search.fit(X_train,y_train)

grid_search.best_score_

print ('Best Training score: %0.3f' % grid_search.best_score_)
print ('Optimal parameters:')
best_par = grid_search.best_estimator_.get_params()
for par_name in sorted(parameters.keys()):
print ('\t%s: %r' % (par_name, best_par[par_name]))

rfReg = RandomForestRegressor(n_estimators=500,max_features='sqrt',max_depth=50,max_samples=0.6,n_jobs=-1,random_state=24)
rfReg.fit(X_train, y_train);
dic_pred = {}
dic_pred['train'] = rfReg.predict(X_train)
dic_pred['test'] = rfReg.predict(X_test)
[pearsonr(dic_pred['train'],y_train)[0],pearsonr(dic_pred['test'],y_test)[0]]

[43]:

plt.rcParams["figure.figsize"] = (8,6)
plt.scatter(y_train,dic_pred['train'])
plt.xlabel('trening GEDI Height (selected rows)')
plt.ylabel('trening prediction')
ident = [0, 50]
plt.plot(ident,ident,'r--')

[43]:

[<matplotlib.lines.Line2D at 0x7fa0315145e0>]

[44]:

plt.rcParams["figure.figsize"] = (8,6)
plt.scatter(y_test,dic_pred['test'])
plt.xlabel('testing GEDI Height (selected rows)')
plt.ylabel('testing prediction')
ident = [0, 50]
plt.plot(ident,ident,'r--')

[44]:

[<matplotlib.lines.Line2D at 0x7fa0226b4d30>]

[45]:

impt = [rfReg.feature_importances_, np.std([tree.feature_importances_ for tree in rfReg.estimators_],axis=1)]
ind = np.argsort(impt[0])

[46]:

ind

[46]:

array([ 1, 13,  0,  6, 16,  4,  8,  5,  3,  2, 15, 11, 12,  9, 14,  7, 10,
17])

[47]:

plt.rcParams["figure.figsize"] = (6,12)
plt.barh(range(len(feat)),impt[0][ind],color="b", xerr=impt[1][ind], align="center")
plt.yticks(range(len(feat)),feat[ind]);


### Predict on the raster using pyspatialml

[54]:

# import satalite indeces

# import climate
CHELSA_bio4 = "tree_height/geodata_raster/CHELSA_bio4.tif"
CHELSA_bio18 = "tree_height/geodata_raster/CHELSA_bio18.tif"

# soil
BLDFIE_WeigAver = "tree_height/geodata_raster/BLDFIE_WeigAver.tif"
CECSOL_WeigAver = "tree_height/geodata_raster/CECSOL_WeigAver.tif"
ORCDRC_WeigAver = "tree_height/geodata_raster/ORCDRC_WeigAver.tif"

# Geomorphological
elev = "tree_height/geodata_raster/elev.tif"
convergence = "tree_height/geodata_raster/convergence.tif"
northness = "tree_height/geodata_raster/northness.tif"
eastness = "tree_height/geodata_raster/eastness.tif"
devmagnitude = "tree_height/geodata_raster/devmagnitude.tif"

# Hydrography
cti = "tree_height/geodata_raster/cti.tif"
outlet_dist_dw_basin = "tree_height/geodata_raster/outlet_dist_dw_basin.tif"

# Soil climate

SBIO3_Isothermality_5_15cm = "tree_height/geodata_raster/SBIO3_Isothermality_5_15cm.tif"
SBIO4_Temperature_Seasonality_5_15cm = "tree_height/geodata_raster/SBIO4_Temperature_Seasonality_5_15cm.tif"

# forest

treecover = "tree_height/geodata_raster/treecover.tif"

[56]:

treecover

[56]:

'tree_height/geodata_raster/treecover.tif'

[59]:

predictors_rasters = [glad_ard_SVVI_min, glad_ard_SVVI_med, glad_ard_SVVI_max,
CHELSA_bio4,CHELSA_bio18,
BLDFIE_WeigAver,CECSOL_WeigAver,ORCDRC_WeigAver,
elev,convergence,northness,eastness,devmagnitude,cti,outlet_dist_dw_basin,
SBIO3_Isothermality_5_15cm,SBIO4_Temperature_Seasonality_5_15cm,treecover]

stack = Raster(predictors_rasters)

[60]:

result = stack.predict(estimator=rfReg, dtype='int16', nodata=-1)

[53]:

# plot regression result
plt.rcParams["figure.figsize"] = (12,12)
result.iloc[0].cmap = "plasma"
result.plot()
plt.show()