Estimation of tree height using GEDI dataset - Support Vector Machine for Regression (SVR) - 2022

Let’s see a quick example of how to use Suppor Vector Regression for tree height estimation

[2]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR

We will load the data using Pandas and display few samples of it

[3]:

data = pd.read_csv("./tree_height/txt/eu_x_y_height_predictors_select.txt", sep=" ",  index_col=False)
pd.set_option('display.max_columns',None)

print(data.shape)
data.head(10)

(1267239, 23)

[3]:

	ID	X	Y	h	BLDFIE_WeigAver	CECSOL_WeigAver	CHELSA_bio18	CHELSA_bio4	convergence	cti	dev-magnitude	eastness	elev	forestheight	glad_ard_SVVI_max	glad_ard_SVVI_med	glad_ard_SVVI_min	northness	ORCDRC_WeigAver	outlet_dist_dw_basin	SBIO3_Isothermality_5_15cm	SBIO4_Temperature_Seasonality_5_15cm	treecover
0	1	6.050001	49.727499	3139.00	1540	13	2113	5893	-10.486560	-238043120	1.158417	0.069094	353.983124	23	276.871094	46.444092	347.665405	0.042500	9	780403	19.798992	440.672211	85
1	2	6.050002	49.922155	1454.75	1491	12	1993	5912	33.274361	-208915344	-1.755341	0.269112	267.511688	19	-49.526367	19.552734	-130.541748	0.182780	16	772777	20.889412	457.756195	85
2	3	6.050002	48.602377	853.50	1521	17	2124	5983	0.045293	-137479792	1.908780	-0.016055	389.751160	21	93.257324	50.743652	384.522461	0.036253	14	898820	20.695877	481.879700	62
3	4	6.050009	48.151979	3141.00	1526	16	2569	6130	-33.654274	-267223072	0.965787	0.067767	380.207703	27	542.401367	202.264160	386.156738	0.005139	15	831824	19.375000	479.410278	85
4	5	6.050010	49.588410	2065.25	1547	14	2108	5923	27.493824	-107809368	-0.162624	0.014065	308.042786	25	136.048340	146.835205	198.127441	0.028847	17	796962	18.777500	457.880066	85
5	6	6.050014	48.608456	1246.50	1515	19	2124	6010	-1.602039	17384282	1.447979	-0.018912	364.527100	18	221.339844	247.387207	480.387939	0.042747	14	897945	19.398880	474.331329	62
6	7	6.050016	48.571401	2938.75	1520	19	2169	6147	27.856503	-66516432	-1.073956	0.002280	254.679596	19	125.250488	87.865234	160.696777	0.037254	11	908426	20.170450	476.414520	96
7	8	6.050019	49.921613	3294.75	1490	12	1995	5912	22.102139	-297770784	-1.402633	0.309765	294.927765	26	-86.729492	-145.584229	-190.062988	0.222435	15	772784	20.855963	457.195404	86
8	9	6.050020	48.822645	1623.50	1554	18	1973	6138	18.496584	-25336536	-0.800016	0.010370	240.493759	22	-51.470703	-245.886719	172.074707	0.004428	8	839132	21.812290	496.231110	64
9	10	6.050024	49.847522	1400.00	1521	15	2187	5886	-5.660453	-278652608	1.477951	-0.068720	376.671143	12	277.297363	273.141846	-138.895996	0.098817	13	768873	21.137711	466.976685	70

As explained in the previous lecture, ‘h’ is the estimated tree heigth. So let’s use it as our target.

[3]:

tree_height = data['h'].to_numpy()
data = data.drop('h', 1)

/var/folders/96/dnthcv6d22j1gtb3t_m_txr00000gn/T/ipykernel_15382/445300786.py:2: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
  data = data.drop('h', 1)

Now we will split the data into training vs test datasets and perform the normalization.

[4]:

X_train, X_test, y_train, y_test = train_test_split(data.to_numpy()[:20000,:],tree_height[:20000], random_state=0)
print('X_train.shape:{}, X_test.shape:{} '.format(X_train.shape, X_test.shape))
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train.shape:(15000, 22), X_test.shape:(5000, 22)

Now, we will build our SVR regressor. For more details on all the parameters it accepts, please check the documentation

[5]:

svr = SVR()
svr.fit(X_train, y_train) # Fit the SVR model according to the given training data.
print('Accuracy of SVR on training set: {:.5f}'.format(svr.score(X_train, y_train))) # Returns the coefficient of determination (R^2) of the prediction.
print('Accuracy of SVR on test set: {:.5f}'.format(svr.score(X_test, y_test)))

Accuracy of SVR on training set: 0.12321
Accuracy of SVR on test set: 0.12803

[8]:

np.sqrt(0.12803)

[8]:

0.35781280021821465

[9]:

svr = SVR(epsilon=0.01)
svr.fit(X_train, y_train) # Fit the SVR model according to the given training data.
print('Accuracy of SVR on training set: {:.5f}'.format(svr.score(X_train, y_train))) # Returns the coefficient of determination (R^2) of the prediction.
print('Accuracy of SVR on test set: {:.5f}'.format(svr.score(X_test, y_test)))

Accuracy of SVR on training set: 0.12322
Accuracy of SVR on test set: 0.12804

[10]:

np.sqrt(0.12804)

[10]:

0.35782677373276583

Exercise: explore the other parameters offered by the SVM library and try to make the model better. Some suggestions: - Better cleaning of the data (follow Peppe’s suggestions) - Stronger regularization might be helpful - Play with different kernels

For the brave ones, try to implenent the SVR algorithm from scratch. As we saw in class, the algorithm is quite simple. Here is a simple sketch of the SVM algorithm. Make the appropriate modifications to turn it into a regression. Let us know if your implementation is better than sklearn’s.

[ ]:

## Support Vector Machine
import numpy as np

train_f1 = x_train[:,0]
train_f2 = x_train[:,1]

train_f1 = train_f1.reshape(90,1)
train_f2 = train_f2.reshape(90,1)

w1 = np.zeros((90,1))
w2 = np.zeros((90,1))

epochs = 1
alpha = 0.0001

while(epochs < 10000):
    y = w1 * train_f1 + w2 * train_f2
    prod = y * y_train
    print(epochs)
    count = 0
    for val in prod:
        if(val >= 1):
            cost = 0
            w1 = w1 - alpha * (2 * 1/epochs * w1)
            w2 = w2 - alpha * (2 * 1/epochs * w2)

        else:
            cost = 1 - val
            w1 = w1 + alpha * (train_f1[count] * y_train[count] - 2 * 1/epochs * w1)
            w2 = w2 + alpha * (train_f2[count] * y_train[count] - 2 * 1/epochs * w2)
        count += 1
    epochs += 1