AWK Tutorial

Recorded lecture: 41:30 - 1:24:50

The AWK language is a data-driven scripting language, created at Bell Labs in the 1970s. It consists of a set of actions to be taken against streams of textual data.

The AWK syntax

The awk syntax is:

awk '[pattern] {action} [pattern]' input.txt > output.txt

The awk line is always placed between single-quotation (’) and not back-quotation (`).

The awk structure is composed of patterns and actions.

The [pattern] (optional feature) can be:
- BEGIN {action} (actions computed before reading the input).
- The main action ({action}) is computed by reading the input line by line. The actions include commands and operations.
- END {action} (actions computed at the end of the input).

Therefore the full awk syntax can be represented in this way:

awk 'BEGIN {action} {action} END {action}' input.txt > output.txt

in practice you can read it as:

awk 'BEGIN {action before the input.txt reading } {action during the input.txt reading} END {action after the input.txt reading}' input.txt > output.txt

Gsim.no	latitude	longitude	MEAN
US_0001971	33.79427255	-84.4743747	0.916785714285714
US_0001977	33.65666667	-84.6736111	74.5558064516129

Open the awk.ipynb to follow the explanation

cd /media/sf_LVM_shared/my_SE_data/exercise
jupyter-notebook awk.ipynb

Columns and rows

[3]:

%%bash
head txt/aver_month_nuts3_fire.asc

echo "--------"

awk '{print $2,$5}' txt/aver_month_nuts3_fire.asc | head # print the columns 1 and 2

echo "--------"

awk  '{ print NF }'  txt/aver_month_nuts3_fire.asc | head -10     # print number of column # very usefull to check if all the row have the same number of columns

echo "--------"

awk  '{ print NR }' txt/aver_month_nuts3_fire.asc  | tail -10     # print number of row # very usefull to ad ID column e.g. awk  '{ print NR , $0 }' ${fIn}

NUTS YYYY MM 0 BAREA
BG311 2005 04 2 0.282594
BG311 2006 11 2 0.600812
BG311 2007 01 3 65.8331
BG311 2007 02 3 9.78246
BG311 2007 04 2 44.4997
BG311 2007 06 2 30.5861
BG311 2007 07 2 5534.21
BG312 2005 04 3 10.6419
BG312 2006 10 2 0.293182
--------
YYYY BAREA
2005 0.282594
2006 0.600812
2007 65.8331
2007 9.78246
2007 44.4997
2007 30.5861
2007 5534.21
2005 10.6419
2006 0.293182
--------
5
5
5
5
5
5
5
5
5
5
--------
53459
53460
53461
53462
53463
53464
53465
53466
53467
53468

BEGIN and END

[4]:

! awk 'BEGIN { print "sum" } { sum += $5 } END {printf ( "%f\n", sum)}' txt/aver_month_nuts3_fire.asc

sum
11467947.039565

Pipe

[5]:

%%bash
awk  '{ print $5 }'  txt/aver_month_nuts3_fire.asc  |  sort -k 1,1 | head

0.0005158
0.000521008
0.000522439
0.000552927
0.000556051
0.000558328
0.000578183
0.000581179
0.000603623
0.000604238

Bash variables in AWK

[6]:

%%bash
for (( i=1 ; i<=4 ; i++  )) ; do
awk  -v i=$i  '{ print $i }' txt/aver_month_nuts3_fire.asc | head -3
echo "---"
done

NUTS
BG311
BG311
---
YYYY
2005
2006
---
MM
04
11
---
0
2
2
---

AWK Operations

[7]:

! awk  'NR>1{ print log($5) }' txt/aver_month_nuts3_fire.asc | head

[8]:

! awk  '{ print substr($1,1,4) }' txt/aver_month_nuts3_fire.asc | head -3

NUTS
BG31
BG31

[9]:

! awk  '{ if($3>2) print }'  txt/aver_month_nuts3_fire.asc | head

NUTS YYYY MM 0 BAREA
BG311 2005 04 2 0.282594
BG311 2006 11 2 0.600812
BG311 2007 04 2 44.4997
BG311 2007 06 2 30.5861
BG311 2007 07 2 5534.21
BG312 2005 04 3 10.6419
BG312 2006 10 2 0.293182
BG312 2006 11 5 92.0624
BG312 2006 12 3 29.1471

[10]:

! awk 'NR==1 { print "index "$0 } NR>1 { print NR-1,$0 }' txt/aver_month_nuts3_fire.asc | head

index NUTS YYYY MM 0 BAREA
1 BG311 2005 04 2 0.282594
2 BG311 2006 11 2 0.600812
3 BG311 2007 01 3 65.8331
4 BG311 2007 02 3 9.78246
5 BG311 2007 04 2 44.4997
6 BG311 2007 06 2 30.5861
7 BG311 2007 07 2 5534.21
8 BG312 2005 04 3 10.6419
9 BG312 2006 10 2 0.293182

[11]:

! awk '{ if (NR>1) {if ($4>max) max= $4 }} END {print max }'  txt/aver_month_nuts3_fire.asc

[12]:

! awk '{ if($3==7) sum=sum+$5} END {printf ("%f",sum) }'  txt/aver_month_nuts3_fire.asc

2789009.889985

[13]:

! awk '{ if($3==7 && $2==2005  ) sum=sum+$5} END {print sum }'  txt/aver_month_nuts3_fire.asc

Query operation for file management If we want to count files in a folder tmp of size bigger than 1Mb we can concatenate 3 functions piping each output as input for the next functiont

[14]:

! ls -l /tmp/ | awk '{s=$5; if(s>1000000){print$0}}'| wc -l

Split a large file into small files of 100 lines each

[15]:

! awk  'NR%1000==1{x="blockfile"++i ; }{print > x}'  txt/aver_month_nuts3_fire.asc
! mv blockfile* ./txt

Query a dataset on the basis of the header name

[16]:

%%bash
fOut="./txt/test.csv"
echo ID,column_1,colum_b,colum_c,TOTAL > ${fOut}
echo 1,3,5,6,14 >> ${fOut}
echo 2,5,4,6,15  >> ${fOut}
echo 3,8,4,6,18  >> ${fOut}

[17]:

cat "./txt/test.csv"

ID,column_1,colum_b,colum_c,TOTAL
1,3,5,6,14
2,5,4,6,15
3,8,4,6,18

[18]:

%%bash
for COLNAME in  colum_b TOTAL ; do
awk -F, -v COLNAME=$COLNAME '{ if (NR==1) {for (col=1;col<=NF;col++) {if ($col==COLNAME) {colprint=col}}} else  {print $1 "," $colprint }}' ./txt/test.csv
echo "---"
done

1,5
2,4
3,4
---
1,14
2,15
3,18
---

Subset a limited number of column headers and query the dataset on the basis of the header subset selection

[22]:

%%bash
fOut="./txt/test.csv"
seq -s \, 1 1 100  > ${fOut}
seq -s \, 2 20 2000  >> ${fOut}
seq -s \, 2 200 20000  >> ${fOut}
awk -F,  ' { print NF }' ${fOut} | head -10
echo "----"
awk -F,  ' { print NR }' ${fOut}

We would like to select the first 5, the 25th and the 90th to the 94th colums from the total 100 colums (Fields in AWK) in our dataset. We redirect the list of the selection in a “listofcolumns.csv” file

[23]:

! awk -F, ' { for(i=1;i<6; i++) if(NR==1) printf("%s",$i" ") ; if(NR==1)  printf("%s",$25" ") ; for(i=90;i<95; i++) if(NR==1) printf("%s"," "$i  )   }' ./txt/test.csv  > ./txt/list_of_columns.csv

[24]:

! cat ./txt/list_of_columns.csv

1 2 3 4 5 25  90 91 92 93 94

Finally we carry out a query of the original dataset on the basis of the columns previously selected.

[25]:

%%bash
for COLNAME in  $(cat ./txt/list_of_columns.csv) ; do
echo $COLNAME
awk -F, -v COLNAME=$COLNAME ' { if (NR==1)  { for (col=1;col<=NF;col++) { if ($col==COLNAME) {colprint=col}}}  else  {print $1 "," $colprint }} ' ./txt/test.csv > ./txt/OUTPUT_subset_$COLNAME.csv
done

[26]:

%%bash
head ./txt/OUTPUT_subset_1.csv

2,2
2,2

[27]:

%%bash
head ./txt/OUTPUT_subset_2.csv

2,22
2,202