{ "cells": [ { "cell_type": "markdown", "id": "62e16498-69a3-4822-aa8f-e3bffc4100d7", "metadata": {}, "source": [ "# Estimation of tree height using GEDI dataset - Support Vector Machine for Regression (SVR) - 2022\n", "\n", "Let's see a quick example of how to use Suppor Vector Regression for tree height estimation" ] }, { "cell_type": "code", "execution_count": 2, "id": "5e40151e-7639-42ae-b65a-a6d3f102a03f", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import MinMaxScaler\n", "from sklearn.svm import SVR" ] }, { "cell_type": "markdown", "id": "9249b3e1-eff7-4f68-bf95-3d5fd82b8f1c", "metadata": {}, "source": [ "We will load the data using Pandas and display few samples of it" ] }, { "cell_type": "code", "execution_count": 3, "id": "beb6a6bf-f7ba-41ae-b83f-826a019264de", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1267239, 23)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDXYhBLDFIE_WeigAverCECSOL_WeigAverCHELSA_bio18CHELSA_bio4convergencectidev-magnitudeeastnesselevforestheightglad_ard_SVVI_maxglad_ard_SVVI_medglad_ard_SVVI_minnorthnessORCDRC_WeigAveroutlet_dist_dw_basinSBIO3_Isothermality_5_15cmSBIO4_Temperature_Seasonality_5_15cmtreecover
016.05000149.7274993139.0015401321135893-10.486560-2380431201.1584170.069094353.98312423276.87109446.444092347.6654050.042500978040319.798992440.67221185
126.05000249.9221551454.751491121993591233.274361-208915344-1.7553410.269112267.51168819-49.52636719.552734-130.5417480.1827801677277720.889412457.75619585
236.05000248.602377853.50152117212459830.045293-1374797921.908780-0.016055389.7511602193.25732450.743652384.5224610.0362531489882020.695877481.87970062
346.05000948.1519793141.0015261625696130-33.654274-2672230720.9657870.067767380.20770327542.401367202.264160386.1567380.0051391583182419.375000479.41027885
456.05001049.5884102065.251547142108592327.493824-107809368-0.1626240.014065308.04278625136.048340146.835205198.1274410.0288471779696218.777500457.88006685
566.05001448.6084561246.5015151921246010-1.602039173842821.447979-0.018912364.52710018221.339844247.387207480.3879390.0427471489794519.398880474.33132962
676.05001648.5714012938.751520192169614727.856503-66516432-1.0739560.002280254.67959619125.25048887.865234160.6967770.0372541190842620.170450476.41452096
786.05001949.9216133294.751490121995591222.102139-297770784-1.4026330.309765294.92776526-86.729492-145.584229-190.0629880.2224351577278420.855963457.19540486
896.05002048.8226451623.501554181973613818.496584-25336536-0.8000160.010370240.49375922-51.470703-245.886719172.0747070.004428883913221.812290496.23111064
9106.05002449.8475221400.0015211521875886-5.660453-2786526081.477951-0.068720376.67114312277.297363273.141846-138.8959960.0988171376887321.137711466.97668570
\n", "
" ], "text/plain": [ " ID X Y h BLDFIE_WeigAver CECSOL_WeigAver \\\n", "0 1 6.050001 49.727499 3139.00 1540 13 \n", "1 2 6.050002 49.922155 1454.75 1491 12 \n", "2 3 6.050002 48.602377 853.50 1521 17 \n", "3 4 6.050009 48.151979 3141.00 1526 16 \n", "4 5 6.050010 49.588410 2065.25 1547 14 \n", "5 6 6.050014 48.608456 1246.50 1515 19 \n", "6 7 6.050016 48.571401 2938.75 1520 19 \n", "7 8 6.050019 49.921613 3294.75 1490 12 \n", "8 9 6.050020 48.822645 1623.50 1554 18 \n", "9 10 6.050024 49.847522 1400.00 1521 15 \n", "\n", " CHELSA_bio18 CHELSA_bio4 convergence cti dev-magnitude eastness \\\n", "0 2113 5893 -10.486560 -238043120 1.158417 0.069094 \n", "1 1993 5912 33.274361 -208915344 -1.755341 0.269112 \n", "2 2124 5983 0.045293 -137479792 1.908780 -0.016055 \n", "3 2569 6130 -33.654274 -267223072 0.965787 0.067767 \n", "4 2108 5923 27.493824 -107809368 -0.162624 0.014065 \n", "5 2124 6010 -1.602039 17384282 1.447979 -0.018912 \n", "6 2169 6147 27.856503 -66516432 -1.073956 0.002280 \n", "7 1995 5912 22.102139 -297770784 -1.402633 0.309765 \n", "8 1973 6138 18.496584 -25336536 -0.800016 0.010370 \n", "9 2187 5886 -5.660453 -278652608 1.477951 -0.068720 \n", "\n", " elev forestheight glad_ard_SVVI_max glad_ard_SVVI_med \\\n", "0 353.983124 23 276.871094 46.444092 \n", "1 267.511688 19 -49.526367 19.552734 \n", "2 389.751160 21 93.257324 50.743652 \n", "3 380.207703 27 542.401367 202.264160 \n", "4 308.042786 25 136.048340 146.835205 \n", "5 364.527100 18 221.339844 247.387207 \n", "6 254.679596 19 125.250488 87.865234 \n", "7 294.927765 26 -86.729492 -145.584229 \n", "8 240.493759 22 -51.470703 -245.886719 \n", "9 376.671143 12 277.297363 273.141846 \n", "\n", " glad_ard_SVVI_min northness ORCDRC_WeigAver outlet_dist_dw_basin \\\n", "0 347.665405 0.042500 9 780403 \n", "1 -130.541748 0.182780 16 772777 \n", "2 384.522461 0.036253 14 898820 \n", "3 386.156738 0.005139 15 831824 \n", "4 198.127441 0.028847 17 796962 \n", "5 480.387939 0.042747 14 897945 \n", "6 160.696777 0.037254 11 908426 \n", "7 -190.062988 0.222435 15 772784 \n", "8 172.074707 0.004428 8 839132 \n", "9 -138.895996 0.098817 13 768873 \n", "\n", " SBIO3_Isothermality_5_15cm SBIO4_Temperature_Seasonality_5_15cm treecover \n", "0 19.798992 440.672211 85 \n", "1 20.889412 457.756195 85 \n", "2 20.695877 481.879700 62 \n", "3 19.375000 479.410278 85 \n", "4 18.777500 457.880066 85 \n", "5 19.398880 474.331329 62 \n", "6 20.170450 476.414520 96 \n", "7 20.855963 457.195404 86 \n", "8 21.812290 496.231110 64 \n", "9 21.137711 466.976685 70 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv(\"./tree_height/txt/eu_x_y_height_predictors_select.txt\", sep=\" \", index_col=False)\n", "pd.set_option('display.max_columns',None)\n", "\n", "print(data.shape)\n", "data.head(10)" ] }, { "cell_type": "markdown", "id": "d3b005fe-c5bf-4399-b692-2c3b5c3dc2ef", "metadata": {}, "source": [ "As explained in the previous lecture, 'h' is the estimated tree heigth. So let's use it as our target." ] }, { "cell_type": "code", "execution_count": 3, "id": "7560d75b-2bfe-4e65-8079-895d825ff39b", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/96/dnthcv6d22j1gtb3t_m_txr00000gn/T/ipykernel_15382/445300786.py:2: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.\n", " data = data.drop('h', 1)\n" ] } ], "source": [ "tree_height = data['h'].to_numpy()\n", "data = data.drop('h', 1)" ] }, { "cell_type": "markdown", "id": "85e7fb69-1866-4009-a954-70394fadf2d0", "metadata": {}, "source": [ "Now we will split the data into training vs test datasets and perform the normalization." ] }, { "cell_type": "code", "execution_count": 4, "id": "0a2c23e6-fcd5-43d4-b330-6e39e8b1d078", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X_train.shape:(15000, 22), X_test.shape:(5000, 22) \n" ] } ], "source": [ "X_train, X_test, y_train, y_test = train_test_split(data.to_numpy()[:20000,:],tree_height[:20000], random_state=0)\n", "print('X_train.shape:{}, X_test.shape:{} '.format(X_train.shape, X_test.shape))\n", "scaler = MinMaxScaler()\n", "X_train = scaler.fit_transform(X_train)\n", "X_test = scaler.transform(X_test)" ] }, { "cell_type": "markdown", "id": "d064bba1-fbb6-41b3-a799-1007f9686d0e", "metadata": {}, "source": [ "Now, we will build our SVR regressor. For more details on all the parameters it accepts, please check the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html)" ] }, { "cell_type": "code", "execution_count": 5, "id": "8d9f7d40-a999-4a3e-95e1-4c9aee9bb3b0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy of SVR on training set: 0.12321\n", "Accuracy of SVR on test set: 0.12803\n" ] } ], "source": [ "svr = SVR()\n", "svr.fit(X_train, y_train) # Fit the SVR model according to the given training data.\n", "print('Accuracy of SVR on training set: {:.5f}'.format(svr.score(X_train, y_train))) # Returns the coefficient of determination (R^2) of the prediction.\n", "print('Accuracy of SVR on test set: {:.5f}'.format(svr.score(X_test, y_test)))" ] }, { "cell_type": "code", "execution_count": 8, "id": "241bf481-d6ee-4958-9863-f91ecdbbf7c1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.35781280021821465" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sqrt(0.12803)" ] }, { "cell_type": "code", "execution_count": 9, "id": "b8dd204f-fe75-4b6e-8513-06498eeff02a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy of SVR on training set: 0.12322\n", "Accuracy of SVR on test set: 0.12804\n" ] } ], "source": [ "svr = SVR(epsilon=0.01)\n", "svr.fit(X_train, y_train) # Fit the SVR model according to the given training data.\n", "print('Accuracy of SVR on training set: {:.5f}'.format(svr.score(X_train, y_train))) # Returns the coefficient of determination (R^2) of the prediction.\n", "print('Accuracy of SVR on test set: {:.5f}'.format(svr.score(X_test, y_test)))" ] }, { "cell_type": "code", "execution_count": 10, "id": "e895a720-8d8e-45f0-bf34-0317f2c706f6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.35782677373276583" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sqrt(0.12804)" ] }, { "cell_type": "markdown", "id": "b6c9b04e-3d17-4df3-81ab-2207ebd20b59", "metadata": {}, "source": [ "Exercise: explore the other parameters offered by the SVM library and try to make the model better. Some suggestions:\n", "- Better cleaning of the data (follow Peppe's suggestions)\n", "- Stronger regularization might be helpful\n", "- Play with different kernels" ] }, { "cell_type": "markdown", "id": "8e80d594-9fd4-4115-9cd4-5a32a68095d5", "metadata": {}, "source": [ "**For the brave ones**, try to implenent the SVR algorithm from scratch. As we saw in class, the algorithm is quite simple. Here is a simple sketch of the SVM algorithm. Make the appropriate modifications to turn it into a regression. Let us know if your implementation is better than sklearn's. " ] }, { "cell_type": "code", "execution_count": null, "id": "b2d5a461-6cfe-476f-9568-197038326bbe", "metadata": {}, "outputs": [], "source": [ "## Support Vector Machine \n", "import numpy as np\n", "\n", "train_f1 = x_train[:,0]\n", "train_f2 = x_train[:,1]\n", "\n", "train_f1 = train_f1.reshape(90,1)\n", "train_f2 = train_f2.reshape(90,1)\n", "\n", "w1 = np.zeros((90,1))\n", "w2 = np.zeros((90,1))\n", "\n", "epochs = 1\n", "alpha = 0.0001\n", "\n", "while(epochs < 10000):\n", " y = w1 * train_f1 + w2 * train_f2\n", " prod = y * y_train\n", " print(epochs)\n", " count = 0\n", " for val in prod:\n", " if(val >= 1):\n", " cost = 0\n", " w1 = w1 - alpha * (2 * 1/epochs * w1)\n", " w2 = w2 - alpha * (2 * 1/epochs * w2)\n", " \n", " else:\n", " cost = 1 - val \n", " w1 = w1 + alpha * (train_f1[count] * y_train[count] - 2 * 1/epochs * w1)\n", " w2 = w2 + alpha * (train_f2[count] * y_train[count] - 2 * 1/epochs * w2)\n", " count += 1\n", " epochs += 1" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }