Estimation of tree height using GEDI dataset - Support Vector Machine for Regression (SVR) - 2022
Let’s see a quick example of how to use Suppor Vector Regression for tree height estimation
[2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR
We will load the data using Pandas and display few samples of it
[3]:
data = pd.read_csv("./tree_height/txt/eu_x_y_height_predictors_select.txt", sep=" ", index_col=False)
pd.set_option('display.max_columns',None)
print(data.shape)
data.head(10)
(1267239, 23)
[3]:
ID | X | Y | h | BLDFIE_WeigAver | CECSOL_WeigAver | CHELSA_bio18 | CHELSA_bio4 | convergence | cti | dev-magnitude | eastness | elev | forestheight | glad_ard_SVVI_max | glad_ard_SVVI_med | glad_ard_SVVI_min | northness | ORCDRC_WeigAver | outlet_dist_dw_basin | SBIO3_Isothermality_5_15cm | SBIO4_Temperature_Seasonality_5_15cm | treecover | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 6.050001 | 49.727499 | 3139.00 | 1540 | 13 | 2113 | 5893 | -10.486560 | -238043120 | 1.158417 | 0.069094 | 353.983124 | 23 | 276.871094 | 46.444092 | 347.665405 | 0.042500 | 9 | 780403 | 19.798992 | 440.672211 | 85 |
1 | 2 | 6.050002 | 49.922155 | 1454.75 | 1491 | 12 | 1993 | 5912 | 33.274361 | -208915344 | -1.755341 | 0.269112 | 267.511688 | 19 | -49.526367 | 19.552734 | -130.541748 | 0.182780 | 16 | 772777 | 20.889412 | 457.756195 | 85 |
2 | 3 | 6.050002 | 48.602377 | 853.50 | 1521 | 17 | 2124 | 5983 | 0.045293 | -137479792 | 1.908780 | -0.016055 | 389.751160 | 21 | 93.257324 | 50.743652 | 384.522461 | 0.036253 | 14 | 898820 | 20.695877 | 481.879700 | 62 |
3 | 4 | 6.050009 | 48.151979 | 3141.00 | 1526 | 16 | 2569 | 6130 | -33.654274 | -267223072 | 0.965787 | 0.067767 | 380.207703 | 27 | 542.401367 | 202.264160 | 386.156738 | 0.005139 | 15 | 831824 | 19.375000 | 479.410278 | 85 |
4 | 5 | 6.050010 | 49.588410 | 2065.25 | 1547 | 14 | 2108 | 5923 | 27.493824 | -107809368 | -0.162624 | 0.014065 | 308.042786 | 25 | 136.048340 | 146.835205 | 198.127441 | 0.028847 | 17 | 796962 | 18.777500 | 457.880066 | 85 |
5 | 6 | 6.050014 | 48.608456 | 1246.50 | 1515 | 19 | 2124 | 6010 | -1.602039 | 17384282 | 1.447979 | -0.018912 | 364.527100 | 18 | 221.339844 | 247.387207 | 480.387939 | 0.042747 | 14 | 897945 | 19.398880 | 474.331329 | 62 |
6 | 7 | 6.050016 | 48.571401 | 2938.75 | 1520 | 19 | 2169 | 6147 | 27.856503 | -66516432 | -1.073956 | 0.002280 | 254.679596 | 19 | 125.250488 | 87.865234 | 160.696777 | 0.037254 | 11 | 908426 | 20.170450 | 476.414520 | 96 |
7 | 8 | 6.050019 | 49.921613 | 3294.75 | 1490 | 12 | 1995 | 5912 | 22.102139 | -297770784 | -1.402633 | 0.309765 | 294.927765 | 26 | -86.729492 | -145.584229 | -190.062988 | 0.222435 | 15 | 772784 | 20.855963 | 457.195404 | 86 |
8 | 9 | 6.050020 | 48.822645 | 1623.50 | 1554 | 18 | 1973 | 6138 | 18.496584 | -25336536 | -0.800016 | 0.010370 | 240.493759 | 22 | -51.470703 | -245.886719 | 172.074707 | 0.004428 | 8 | 839132 | 21.812290 | 496.231110 | 64 |
9 | 10 | 6.050024 | 49.847522 | 1400.00 | 1521 | 15 | 2187 | 5886 | -5.660453 | -278652608 | 1.477951 | -0.068720 | 376.671143 | 12 | 277.297363 | 273.141846 | -138.895996 | 0.098817 | 13 | 768873 | 21.137711 | 466.976685 | 70 |
As explained in the previous lecture, ‘h’ is the estimated tree heigth. So let’s use it as our target.
[3]:
tree_height = data['h'].to_numpy()
data = data.drop('h', 1)
/var/folders/96/dnthcv6d22j1gtb3t_m_txr00000gn/T/ipykernel_15382/445300786.py:2: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
data = data.drop('h', 1)
Now we will split the data into training vs test datasets and perform the normalization.
[4]:
X_train, X_test, y_train, y_test = train_test_split(data.to_numpy()[:20000,:],tree_height[:20000], random_state=0)
print('X_train.shape:{}, X_test.shape:{} '.format(X_train.shape, X_test.shape))
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train.shape:(15000, 22), X_test.shape:(5000, 22)
Now, we will build our SVR regressor. For more details on all the parameters it accepts, please check the documentation
[5]:
svr = SVR()
svr.fit(X_train, y_train) # Fit the SVR model according to the given training data.
print('Accuracy of SVR on training set: {:.5f}'.format(svr.score(X_train, y_train))) # Returns the coefficient of determination (R^2) of the prediction.
print('Accuracy of SVR on test set: {:.5f}'.format(svr.score(X_test, y_test)))
Accuracy of SVR on training set: 0.12321
Accuracy of SVR on test set: 0.12803
[8]:
np.sqrt(0.12803)
[8]:
0.35781280021821465
[9]:
svr = SVR(epsilon=0.01)
svr.fit(X_train, y_train) # Fit the SVR model according to the given training data.
print('Accuracy of SVR on training set: {:.5f}'.format(svr.score(X_train, y_train))) # Returns the coefficient of determination (R^2) of the prediction.
print('Accuracy of SVR on test set: {:.5f}'.format(svr.score(X_test, y_test)))
Accuracy of SVR on training set: 0.12322
Accuracy of SVR on test set: 0.12804
[10]:
np.sqrt(0.12804)
[10]:
0.35782677373276583
Exercise: explore the other parameters offered by the SVM library and try to make the model better. Some suggestions:
Better cleaning of the data (follow Peppe’s suggestions)
Stronger regularization might be helpful
Play with different kernels
For the brave ones, try to implenent the SVR algorithm from scratch. As we saw in class, the algorithm is quite simple. Here is a simple sketch of the SVM algorithm. Make the appropriate modifications to turn it into a regression. Let us know if your implementation is better than sklearn’s.
[ ]:
## Support Vector Machine
import numpy as np
train_f1 = x_train[:,0]
train_f2 = x_train[:,1]
train_f1 = train_f1.reshape(90,1)
train_f2 = train_f2.reshape(90,1)
w1 = np.zeros((90,1))
w2 = np.zeros((90,1))
epochs = 1
alpha = 0.0001
while(epochs < 10000):
y = w1 * train_f1 + w2 * train_f2
prod = y * y_train
print(epochs)
count = 0
for val in prod:
if(val >= 1):
cost = 0
w1 = w1 - alpha * (2 * 1/epochs * w1)
w2 = w2 - alpha * (2 * 1/epochs * w2)
else:
cost = 1 - val
w1 = w1 + alpha * (train_f1[count] * y_train[count] - 2 * 1/epochs * w1)
w2 = w2 + alpha * (train_f2[count] * y_train[count] - 2 * 1/epochs * w2)
count += 1
epochs += 1