AWK Tutorial


The AWK language is a data-driven scripting language, created at Bell Labs in the 1970s. It consists of a set of actions to be taken against streams of textual data.

0525d195be6b420e9df90dcb385c2dde

The AWK syntax

The awk syntax is:

awk '[pattern] {action} [pattern]' input.txt > output.txt

The awk line is always placed between single-quotation (’) and not back-quotation (`).

The awk structure is composed of patterns and actions.

  • The [pattern] (optional feature) can be:

    • BEGIN {action} (actions computed before reading the input).

    • END {action} (actions computed at the end of the input).

  • The main action ({action}) is computed by reading the input line by line. The actions include commands and operations.

Therefore the full awk syntax can be represented in this way:

awk 'BEGIN {action} {action} END {action}' input.txt > output.txt

in practice you can read it as:

awk 'BEGIN {action before the input.txt reading } {action during the input.txt reading} END {action after the input.txt reading}' input.txt > output.txt

Columns and rows

[2]:
%%bash
fIn="./txt/aver_month_nuts3_fire.asc"

head ${fIn}

echo "--------"

awk '{print $2,$5}' ${fIn} | head # print the columns 1 and 2

echo "--------"

awk  '{ print NF }'  ${fIn} | head -10     # print number of column # very usefull to check if all the row have the same number of columns

echo "--------"

awk  '{ print NR }' ${fIn}  | tail -10     # print number of row # very usefull to ad ID column e.g. awk  '{ print NR , $0 }' ${fIn}
NUTS YYYY MM 0 BAREA
BG311 2005 04 2 0.282594
BG311 2006 11 2 0.600812
BG311 2007 01 3 65.8331
BG311 2007 02 3 9.78246
BG311 2007 04 2 44.4997
BG311 2007 06 2 30.5861
BG311 2007 07 2 5534.21
BG312 2005 04 3 10.6419
BG312 2006 10 2 0.293182
--------
YYYY BAREA
2005 0.282594
2006 0.600812
2007 65.8331
2007 9.78246
2007 44.4997
2007 30.5861
2007 5534.21
2005 10.6419
2006 0.293182
--------
5
5
5
5
5
5
5
5
5
5
--------
53459
53460
53461
53462
53463
53464
53465
53466
53467
53468

BEGIN and END

[3]:
! awk 'BEGIN { print "sum" } { sum += $5 } END {printf ( "%f\n", sum)}' ./txt/aver_month_nuts3_fire.asc
sum
11467947.039565

Pipe

[5]:
%%bash
awk  '{ print $5 }'  ./txt/aver_month_nuts3_fire.asc  |  sort -k 1,1 | head
0.0005158
0.000521008
0.000522439
0.000552927
0.000556051
0.000558328
0.000578183
0.000581179
0.000603623
0.000604238

Bash variables in AWK

[6]:
%%bash
for (( i=1 ; i<=4 ; i++  )) ; do
awk  -v i=$i  '{ print $i }' ./txt/aver_month_nuts3_fire.asc | head -3
echo "---"
done
NUTS
BG311
BG311
---
YYYY
2005
2006
---
MM
04
11
---
0
2
2
---

Awk Operations

[7]:
! awk  'NR>1{ print log($5) }' ./txt/aver_month_nuts3_fire.asc | head
-1.26374
-0.509473
4.18712
2.28059
3.79548
3.42055
8.6187
2.3648
-1.22696
4.52247
[8]:
! awk  '{ print substr($1,1,4) }' "./txt/aver_month_nuts3_fire.asc" | head -3
NUTS
BG31
BG31
[26]:
! awk  '{ if($3>2) print }'  "./txt/aver_month_nuts3_fire.asc" | head
NUTS YYYY MM 0 BAREA
BG311 2005 04 2 0.282594
BG311 2006 11 2 0.600812
BG311 2007 04 2 44.4997
BG311 2007 06 2 30.5861
BG311 2007 07 2 5534.21
BG312 2005 04 3 10.6419
BG312 2006 10 2 0.293182
BG312 2006 11 5 92.0624
BG312 2006 12 3 29.1471
[30]:
! awk 'NR==1 { print "index "$0 } NR>1 { print NR-1,$0 }' "./txt/aver_month_nuts3_fire.asc" | head
index NUTS YYYY MM 0 BAREA
1 BG311 2005 04 2 0.282594
2 BG311 2006 11 2 0.600812
3 BG311 2007 01 3 65.8331
4 BG311 2007 02 3 9.78246
5 BG311 2007 04 2 44.4997
6 BG311 2007 06 2 30.5861
7 BG311 2007 07 2 5534.21
8 BG312 2005 04 3 10.6419
9 BG312 2006 10 2 0.293182
[37]:
! awk '{ if (NR>1) {if ($4>max) max= $4 }} END {print max }'  "./txt/aver_month_nuts3_fire.asc"
3211
[8]:
! awk '{ if($3==7) sum=sum+$5} END {printf ("%f",sum) }'  "./txt/aver_month_nuts3_fire.asc"
2789009.889985
[40]:
! awk '{ if($3==7 && $2==2005  ) sum=sum+$5} END {print sum }'  "./txt/aver_month_nuts3_fire.asc"
122835

Query operation for file management If we want to count files in a folder tmp of size bigger than 1Mb we can concatenate 3 functions piping each output as input for the next functiont

[9]:
! ls -l /tmp/ | awk '{s=$5; if(s>1000000){print$0}}'| wc -l
0

Split a large file into small files of 100 lines each

[30]:
! awk  'NR%1000==1{x="blockfile"++i ; }{print > x}'  "./txt/aver_month_nuts3_fire.asc"
! mv blockfile* ./txt

Query a dataset on the basis of the header name

[17]:
%%bash
fOut="./txt/test.csv"
echo ID,column_1,colum_b,colum_c,TOTAL > ${fOut}
echo 1,3,5,6,14 >> ${fOut}
echo 2,5,4,6,15  >> ${fOut}
echo 3,8,4,6,18  >> ${fOut}
[18]:
cat "./txt/test.csv"
ID,column_1,colum_b,colum_c,TOTAL
1,3,5,6,14
2,5,4,6,15
3,8,4,6,18
[20]:
%%bash
for COLNAME in  colum_b TOTAL ; do
awk -F, -v COLNAME=$COLNAME '{ if (NR==1) {for (col=1;col<=NF;col++) {if ($col==COLNAME) {colprint=col}}} else  {print $1 "," $colprint }}' ./txt/test.csv
echo "---"
done
1,5
2,4
3,4
---
1,14
2,15
3,18
---

Subset a limited number of column headers and query the dataset on the basis of the header subset selection

[23]:
%%bash
fOut="./txt/test.csv"
echo `seq -s \, 1 1 100`  > ${fOut}
echo `seq -s \, 2 20 2000`  >> ${fOut}
echo `seq -s \, 2 200 20000`  >> ${fOut}
awk -F,  ' { print NF }' ${fOut} | head -1
echo "----"
awk -F,  ' { print NR }' ${fOut}
100
----
1
2
3

We would like to select the first 5, the 25th and the 90th to the 94th colums from the total 100 colums (Fields in AWK) in our dataset. We redirect the list of the selection in a “listofcolumns.csv” file

[26]:
! awk -F, ' { for(i=1;i<6; i++) if(NR==1) printf("%s",$i" ") ; if(NR==1)  printf("%s",$25" ") ; for(i=90;i<95; i++) if(NR==1) printf("%s"," "$i  )   }' ./txt/test.csv  > ./txt/list_of_columns.csv
[27]:
! cat ./txt/list_of_columns.csv
1 2 3 4 5 25  90 91 92 93 94

Finally we carry out a query of the original dataset on the basis of the columns previously selected.

[28]:
%%bash
for COLNAME in  `cat ./txt/list_of_columns.csv` ; do
echo $COLNAME
awk -F, -v COLNAME=$COLNAME ' { if (NR==1)  { for (col=1;col<=NF;col++) { if ($col==COLNAME) {colprint=col}}}  else  {print $1 "," $colprint }} ' ./txt/test.csv > ./txt/OUTPUT_subset_$COLNAME.csv
done
1
2
3
4
5
25
90
91
92
93
94