{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# AWK Tutorial\n", "\n", "-----------------------\n", "\n", "[Recorded lecture: 41:30 - 1:24:50](https://youtu.be/gCQ33jknzZY)\n", "\n", "-----------------------\n", "\n", "The [AWK](https://en.wikipedia.org/wiki/AWK) language is a data-driven scripting language, created at Bell Labs in the 1970s. \n", "It consists of a set of actions to be taken against streams of textual data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The AWK syntax\n", "\n", "The awk syntax is:\n", "\n", " awk '[pattern] {action} [pattern]' input.txt > output.txt\n", "\n", "The awk line is always placed between single-quotation (') and not back-quotation (`).\n", "\n", "The awk structure is composed of patterns and actions.\n", "\n", "- The [pattern] (optional feature) can be:\n", "\n", " - BEGIN {action} (actions computed before reading the input).\n", " - The main action ({action}) is computed by reading the input line by line. The actions include commands and operations. \n", " - END {action} (actions computed at the end of the input).\n", "\n", "\n", "Therefore the full awk syntax can be represented in this way:\n", "\n", " awk 'BEGIN {action} {action} END {action}' input.txt > output.txt\n", "\n", "in practice you can read it as:\n", "\n", " awk 'BEGIN {action before the input.txt reading } {action during the input.txt reading} END {action after the input.txt reading}' input.txt > output.txt\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "|Gsim.no|latitude|longitude|MEAN|\n", "|---|---|---|---|\n", "|US_0001971 | 33.79427255 | -84.4743747 | 0.916785714285714 |\n", "|US_0001977 | 33.65666667 | -84.6736111 | 74.5558064516129 |\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Open the awk.ipynb to follow the explanation\n", "\n", "```\n", "cd /media/sf_LVM_shared/my_SE_data/exercise\n", "jupyter-notebook awk.ipynb\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Columns and rows" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NUTS YYYY MM 0 BAREA\n", "BG311 2005 04 2 0.282594\n", "BG311 2006 11 2 0.600812\n", "BG311 2007 01 3 65.8331\n", "BG311 2007 02 3 9.78246\n", "BG311 2007 04 2 44.4997\n", "BG311 2007 06 2 30.5861\n", "BG311 2007 07 2 5534.21\n", "BG312 2005 04 3 10.6419\n", "BG312 2006 10 2 0.293182\n", "--------\n", "YYYY BAREA\n", "2005 0.282594\n", "2006 0.600812\n", "2007 65.8331\n", "2007 9.78246\n", "2007 44.4997\n", "2007 30.5861\n", "2007 5534.21\n", "2005 10.6419\n", "2006 0.293182\n", "--------\n", "5\n", "5\n", "5\n", "5\n", "5\n", "5\n", "5\n", "5\n", "5\n", "5\n", "--------\n", "53459\n", "53460\n", "53461\n", "53462\n", "53463\n", "53464\n", "53465\n", "53466\n", "53467\n", "53468\n" ] } ], "source": [ "%%bash\n", "head txt/aver_month_nuts3_fire.asc\n", "\n", "echo \"--------\"\n", "\n", "awk '{print $2,$5}' txt/aver_month_nuts3_fire.asc | head # print the columns 1 and 2 \n", "\n", "echo \"--------\"\n", "\n", "awk '{ print NF }' txt/aver_month_nuts3_fire.asc | head -10 # print number of column # very usefull to check if all the row have the same number of columns \n", "\n", "echo \"--------\"\n", "\n", "awk '{ print NR }' txt/aver_month_nuts3_fire.asc | tail -10 # print number of row # very usefull to ad ID column e.g. awk '{ print NR , $0 }' ${fIn} " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## BEGIN and END" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sum\n", "11467947.039565\n" ] } ], "source": [ "! awk 'BEGIN { print \"sum\" } { sum += $5 } END {printf ( \"%f\\n\", sum)}' txt/aver_month_nuts3_fire.asc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pipe " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0005158\n", "0.000521008\n", "0.000522439\n", "0.000552927\n", "0.000556051\n", "0.000558328\n", "0.000578183\n", "0.000581179\n", "0.000603623\n", "0.000604238\n" ] } ], "source": [ "%%bash\n", "awk '{ print $5 }' txt/aver_month_nuts3_fire.asc | sort -k 1,1 | head " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bash variables in AWK" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NUTS\n", "BG311\n", "BG311\n", "---\n", "YYYY\n", "2005\n", "2006\n", "---\n", "MM\n", "04\n", "11\n", "---\n", "0\n", "2\n", "2\n", "---\n" ] } ], "source": [ "%%bash\n", "for (( i=1 ; i<=4 ; i++ )) ; do\n", "awk -v i=$i '{ print $i }' txt/aver_month_nuts3_fire.asc | head -3\n", "echo \"---\"\n", "done " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## AWK Operations" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-1.26374\n", "-0.509473\n", "4.18712\n", "2.28059\n", "3.79548\n", "3.42055\n", "8.6187\n", "2.3648\n", "-1.22696\n", "4.52247\n" ] } ], "source": [ "! awk 'NR>1{ print log($5) }' txt/aver_month_nuts3_fire.asc | head" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NUTS\n", "BG31\n", "BG31\n" ] } ], "source": [ "! awk '{ print substr($1,1,4) }' txt/aver_month_nuts3_fire.asc | head -3" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NUTS YYYY MM 0 BAREA\n", "BG311 2005 04 2 0.282594\n", "BG311 2006 11 2 0.600812\n", "BG311 2007 04 2 44.4997\n", "BG311 2007 06 2 30.5861\n", "BG311 2007 07 2 5534.21\n", "BG312 2005 04 3 10.6419\n", "BG312 2006 10 2 0.293182\n", "BG312 2006 11 5 92.0624\n", "BG312 2006 12 3 29.1471\n" ] } ], "source": [ "! awk '{ if($3>2) print }' txt/aver_month_nuts3_fire.asc | head" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "index NUTS YYYY MM 0 BAREA\n", "1 BG311 2005 04 2 0.282594\n", "2 BG311 2006 11 2 0.600812\n", "3 BG311 2007 01 3 65.8331\n", "4 BG311 2007 02 3 9.78246\n", "5 BG311 2007 04 2 44.4997\n", "6 BG311 2007 06 2 30.5861\n", "7 BG311 2007 07 2 5534.21\n", "8 BG312 2005 04 3 10.6419\n", "9 BG312 2006 10 2 0.293182\n" ] } ], "source": [ "! awk 'NR==1 { print \"index \"$0 } NR>1 { print NR-1,$0 }' txt/aver_month_nuts3_fire.asc | head" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3211\n" ] } ], "source": [ "! awk '{ if (NR>1) {if ($4>max) max= $4 }} END {print max }' txt/aver_month_nuts3_fire.asc" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2789009.889985" ] } ], "source": [ "! awk '{ if($3==7) sum=sum+$5} END {printf (\"%f\",sum) }' txt/aver_month_nuts3_fire.asc" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "122835\n" ] } ], "source": [ "! awk '{ if($3==7 && $2==2005 ) sum=sum+$5} END {print sum }' txt/aver_month_nuts3_fire.asc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Query operation for file management \n", "If we want to count files in a folder tmp of size bigger than 1Mb we can concatenate 3 functions piping each output as input for the next functiont" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5\n" ] } ], "source": [ "! ls -l /tmp/ | awk '{s=$5; if(s>1000000){print$0}}'| wc -l" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Split a large file into small files of 100 lines each" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "! awk 'NR%1000==1{x=\"blockfile\"++i ; }{print > x}' txt/aver_month_nuts3_fire.asc\n", "! mv blockfile* ./txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Query a dataset on the basis of the header name" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "fOut=\"./txt/test.csv\"\n", "echo ID,column_1,colum_b,colum_c,TOTAL > ${fOut}\n", "echo 1,3,5,6,14 >> ${fOut}\n", "echo 2,5,4,6,15 >> ${fOut}\n", "echo 3,8,4,6,18 >> ${fOut}" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ID,column_1,colum_b,colum_c,TOTAL\n", "1,3,5,6,14\n", "2,5,4,6,15\n", "3,8,4,6,18\n" ] } ], "source": [ "cat \"./txt/test.csv\"" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1,5\n", "2,4\n", "3,4\n", "---\n", "1,14\n", "2,15\n", "3,18\n", "---\n" ] } ], "source": [ "%%bash\n", "for COLNAME in colum_b TOTAL ; do\n", "awk -F, -v COLNAME=$COLNAME '{ if (NR==1) {for (col=1;col<=NF;col++) {if ($col==COLNAME) {colprint=col}}} else {print $1 \",\" $colprint }}' ./txt/test.csv \n", "echo \"---\"\n", "done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Subset a limited number of column headers and query the dataset on the basis of the header subset selection" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100\n", "100\n", "100\n", "----\n", "1\n", "2\n", "3\n" ] } ], "source": [ "%%bash \n", "fOut=\"./txt/test.csv\"\n", "seq -s \\, 1 1 100 > ${fOut}\n", "seq -s \\, 2 20 2000 >> ${fOut}\n", "seq -s \\, 2 200 20000 >> ${fOut}\n", "awk -F, ' { print NF }' ${fOut} | head -10\n", "echo \"----\"\n", "awk -F, ' { print NR }' ${fOut} " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We would like to select the first 5, the 25th and the 90th to the 94th colums from the total 100 colums (Fields in AWK) in our dataset. \n", "We redirect the list of the selection in a “listofcolumns.csv” file" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "! awk -F, ' { for(i=1;i<6; i++) if(NR==1) printf(\"%s\",$i\" \") ; if(NR==1) printf(\"%s\",$25\" \") ; for(i=90;i<95; i++) if(NR==1) printf(\"%s\",\" \"$i ) }' ./txt/test.csv > ./txt/list_of_columns.csv" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 2 3 4 5 25 90 91 92 93 94" ] } ], "source": [ "! cat ./txt/list_of_columns.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally we carry out a query of the original dataset on the basis of the columns previously selected." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "2\n", "3\n", "4\n", "5\n", "25\n", "90\n", "91\n", "92\n", "93\n", "94\n" ] } ], "source": [ "%%bash \n", "for COLNAME in $(cat ./txt/list_of_columns.csv) ; do\n", "echo $COLNAME\n", "awk -F, -v COLNAME=$COLNAME ' { if (NR==1) { for (col=1;col<=NF;col++) { if ($col==COLNAME) {colprint=col}}} else {print $1 \",\" $colprint }} ' ./txt/test.csv > ./txt/OUTPUT_subset_$COLNAME.csv \n", "done" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2,2\n", "2,2\n" ] } ], "source": [ "%%bash \n", "head ./txt/OUTPUT_subset_1.csv " ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2,22\n", "2,202\n" ] } ], "source": [ "%%bash \n", "head ./txt/OUTPUT_subset_2.csv " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }