The AWK Pro­gram­ming Lan­guage: A Tool for Data Ex­trac­tion

In­tro­duc­ing AWK, a pro­gram­ming lan­guage de­signed for text pro­cess­ing and typ­i­cally used as a data ex­trac­tion and re­port­ing tool. This lan­guage is a stan­dard fea­ture of most UNIX-like op­er­at­ing sys­tems.

OpenSource For You - - Contents - By: Neethu C. Sekhar The au­thor is an open source en­thu­si­ast, cur­rently work­ing as as­sis­tant pro­fes­sor in the De­part­ment of Com­puter Sci­ence, AmalJyothi Col­lege of En­gi­neer­ing, Ker­ala. She can be reached at ni­tuc­skr@gmail.com

AWK, one of the most prom­i­nent text-pro­cess­ing util­i­ties on GNU/Linux, takes its name from the ini­tials of its au­thors — Aho, Wein­berger and Kernighan. It is an ex­tremely ver­sa­tile pro­gram­ming lan­guage that looks a lit­tle like C. It is a script­ing lan­guage that pro­cesses data files, es­pe­cially text files that are or­gan­ised in rows and col­umns.

AWK re­ally is a con­sis­tent tool with a few data types.

Its porta­bil­ity and sta­bil­ity have made it very pop­u­lar. It’s a con­cise script­ing lan­guage that can tackle a vast ar­ray of prob­lems. It can teach the reader how to im­ple­ment a data­base, a parser, an in­ter­preter and a com­piler for a small project-spe­cific com­puter lan­guage.

If you are al­ready aware of regex (reg­u­lar ex­pres­sions), it’s quite easy to pick up the ba­sics of AWK. This ar­ti­cle will be use­ful for soft­ware de­vel­op­ers, sys­tems ad­min­is­tra­tors, or any en­thu­si­as­tic reader in­clined to learn how to do text pro­cess­ing and data ex­trac­tion in UNIX-like en­vi­ron­ments. Of course, one could use Perl or Python, but AWK makes it so much sim­pler with a con­cise sin­gle line com­mand. Also, learn­ing AWK is pretty low cost. You can learn the ba­sics in less than an hour, so it doesn’t re­quire as much ef­fort and time as learn­ing any other pro­gram­ming/script­ing lan­guage.

The orig­i­nal ver­sion of AWK was writ­ten in 1977 at AT&T Bell Lab­o­ra­to­ries. In 1985, a new ver­sion made the pro­gram­ming lan­guage more pow­er­ful, in­tro­duc­ing userde­fined func­tions, mul­ti­ple in­put streams and com­puted reg­u­lar ex­pres­sions.

Typ­i­cal ap­pli­ca­tions of AWK in­clude gen­er­at­ing re­ports, val­i­dat­ing data, cre­at­ing small data­bases, etc. AWK is very pow­er­ful and uses a sim­ple pro­gram­ming lan­guage. It can solve com­plex text pro­cess­ing tasks with a few lines of code. Start­ing with an over­view of AWK, its en­vi­ron­ment, and work­flow, this ar­ti­cle pro­ceeds to ex­plain its syn­tax, vari­ables, op­er­a­tors, ar­rays, loops and func­tions.

AWK in­stal­la­tion

Gen­er­ally, AWK is avail­able by de­fault on most GNU/Linux dis­tri­bu­tions. We can use the which com­mand to check whether

it is present on your sys­tem or not. In case you don’t have AWK, then in­stall it on De­bian based GNU/ Linux us­ing the Ad­vanced Pack­age Tool (APT) pack­age man­ager, as fol­lows:

sudo apt-get in­stall gawk

AWK is used for stream pro­cess­ing, where the ba­sic unit is the string. It con­sid­ers a text file as a col­lec­tion of fields and records. Each row is a record, and a record is a col­lec­tion of fields. The syn­tax of AWK is as fol­lows. On the com­mand line:

awk [op­tions] ‘pat­tern {ac­tion}’ in­put file

As an AWK script:

awk [op­tions] scrip­t_­name in­put file

The most com­monly used com­mand-line op­tions of awk are -F and -f :

-F : to change in­put field sep­a­ra­tor -f : to name script file

A ba­sic AWK pro­gram con­sists of pat­terns and ac­tions — if the pat­tern is miss­ing, the ac­tion is ap­plied to all lines, or else, if the ac­tion is miss­ing, the matched line is printed. There are two types of buf­fers used in AWK – the record buf­fer and field buf­fer. The lat­ter is de­noted as $1, $2… $n, where ‘n’ in­di­cates the field num­ber in the in­put file, i.e., $ fol­lowed by the field num­ber (so $2 in­di­cates the sec­ond field). The record buf­fer is de­noted as $0, which in­di­cates the whole record.

For ex­am­ple, to print the first field in a file, use the fol­low­ing com­mand:

awk ‘{print $1}’ file­name

To print the third and first field in a file, use the com­mand given be­low:

awk ‘{print $3, $1}’ file­name

AWK process flow

So how does one write an AWK script?

AWK scripts are di­vided into the fol­low­ing three parts — BE­GIN (pre-pro­cess­ing), body (pro­cess­ing) and END (post-pro­cess­ing).

BE­GIN is the part of the AWK script where vari­ables can be ini­tialised and re­port head­ings can be cre­ated. The pro­cess­ing body con­tains the data that needs to be pro­cessed, like a loop. END or the post-pro­cess­ing part analy­ses or prints the data that has been pro­cessed.

Let’s look at an ex­am­ple for find­ing the to­tal marks and av­er­ages of a set of students.

The AWK script is named as awscript.

#Be­gin Pro­cess­ing

BE­GIN {print “To find the to­tal marks & av­er­age”} {

#body pro­cess­ing tot=$2+$3+$4 avg=tot/3 print “To­tal of “$1 “:”, tot print “Av­er­age of “$1 “:”, avg

}

#End pro­cess­ing

END{print “---Script Fin­ished---”}

In­put file is named as awk­file In­put file (awk­file)

Aby 20 21 25

Amy 22 23 20

Run­ning the awk script as : awk –f awscript awk­file Out­put

To find the to­tal marks & av­er­age To­tal of Aby is : 66

Av­er­age of Aby is : 22

To­tal of Amy is : 65

Av­er­age of Amy is : 21.66

Clas­si­fi­ca­tion of pat­terns

Ex­pres­sions: AWK re­quires two op­er­a­tors while writ­ing

reg­u­lar ex­pres­sions (regex) — match (~) and doesn’t match (!~). Reg­u­lar ex­pres­sions must be en­closed in /slashes/, as fol­lows:

awk ‘$0 ~ /^[a-d]/’ file1 (Iden­tify all lines in a file that starts with a,b,c or d)

awk ‘$0 !~ /^[a-d]/’ file1 (Iden­tify all lines in a file that do not start with a,b,c or d)

An ex­am­ple of an ex­pres­sion is count­ing the num­ber of oc­cur­rences of the pat­tern ‘unix’ in the in­put file ‘data’. Awk sup­ports the fol­low­ing:

Arith­metic op­er­a­tors: +, - , * , /, % , ^ .

Re­la­tional op­er­a­tors: >, >=, < ,<=, ==, != and…

Log­i­cal op­er­a­tors: &&, ||, !.

As an ex­am­ple, con­sider the file awk­test:

1 unix 10 50

2 shell 20 10

3 unix 30 30

4 linux 20 20

• Iden­tify the records with sec­ond field, “unix” and value of third field > 40 awk ‘$2 == “unix” && $3 > 40 {print}’ awk­test

1 unix 10 50

• Iden­tify the records where, prod­uct of third & fourth field is greater than 500

awk ‘$3 * $4 > 500 {print}’ awk­test

3 unix 30 30

Hence, no ad­dress pat­tern is en­tered, and AWK ap­plies ac­tion to all the lines in the in­put file.

The sys­tem vari­ables used by AWK are listed be­low.

FS: Field sep­a­ra­tor (de­fault=white­space)

RS: Record sep­a­ra­tor (de­fault=\n)

NF: Num­ber of fields in the cur­rent record NR: Num­ber of the cur­rent record

OFS: Out­put field sep­a­ra­tor (de­fault=space) ORS: Out­put record sep­a­ra­tor (de­fault=\n) FILE­NAME: Cur­rent file name

There are more than 12 sys­tem vari­ables used by AWK. We can de­fine vari­ables (user-de­fined) also while cre­at­ing an AWK script.

• awk '{OFS="-";print $1 , $2}' marks john-85 an­drea-89 • awk '{print NR, $1, $3}' marks 1 john cse

2 an­drea ece

Range pat­terns: These are as­so­ci­ated with a range of records, which match a range of con­sec­u­tive in­put lines:

Start-pat­tern, end-pat­tern{ac­tions}

Range starts with the record that matches the start pat­tern and ends with the record that matches the end pat­tern.

Here is an ex­am­ple:

Print 3rd line to 5th line, along with line numbers of the file, marks

• awk ‘NR==3, NR==5 {print NR, $0}’ marks

Ac­tion state­ments

Ex­pres­sion state­ments: An ex­pres­sion is eval­u­ated and re­turns a value, which is ei­ther true or false. It con­sists of any com­bi­na­tion of nu­meric and string con­stants, vari­ables, op­er­a­tors, func­tions, and reg­u­lar ex­pres­sions.

Here is an ex­am­ple:

{$3 = “Hello”} {sum += ($2+4)}

Out­put state­ments: There are three out­put ac­tions in AWK: print, printf and sprint. print writes the spec­i­fied data to the stan­dard out­put file. awk ‘{print $1,$2, $3}’ file name prints first, sec­ond and third col­umns. printf uses a for­mat spec­i­fier in a ‘for­mat-string’ that re­quires ar­gu­ments of a match­ing type. string printf (sprintf) stores the for­mat­ted print string as a string.

str = sprintf(“%2d %-12s %9.2f”, $1, $2, $3)

As an ex­am­ple, con­sider the file, ‘data’:

12 abcd 12.2 13 mnop 11.1 • awk ‘{printf(“%@d %-3s %0.3f”, $1, $2, $3)}’ data the above com­mand ap­pends an @ be­fore first field, left as­sign sec­ond field, print third field with 3 dec­i­mal places o/p

@12abcd 12.200

@13mnop 11.100

De­ci­sion state­ments: An if-else de­ci­sion state­ment eval­u­ates an ex­pres­sion and takes proper ac­tion. Nested if state­ments are also ap­plied.

As an ex­am­ple, to print all records with more than three fields, type:

BE­GIN{}

{

If(NF > 3)

print $0 else

print “Less than 3 fields” }

To print the marks and grades of a stu­dent, type: BE­GIN{print “Mark & grade of a stu­dent”} { If($2==”Amy”) S=S+$3 }

END{print “To­tal marks: “S if(S>50)

print “Grade A” else

print “Grade B” }

Loop state­ments: While, do.. while and for are the loop state­ments in AWK. The AWK while loop checks the con­di­tion first, and if the con­di­tion is true, it ex­e­cutes the list of ac­tions. This process re­peats un­til the con­di­tion be­comes false.

Here is an ex­am­ple:

BE­GIN {print “Dis­play even numbers from 10 to 20” } { #ini­tial­iza­tion

I = 10

#loop limit test while (I <=20)

{ #ac­tion print I

I+=2 #up­date

}

} # end script do.. while loop

The AWK do while loop ex­e­cutes the body once, then re­peats the body as long as the con­di­tion is true. Here is an ex­am­ple that dis­plays numbers from 1 to 5:

awk ‘BE­GIN {I=1; do {print i; i++ } while(i < 5)} ‘

Here is an ex­am­ple of the for loop:

pro­gram name : awk­for

BE­GIN { print “Sum of fields in all lines”} { for ( i=1; i<=NF; i++)

{

t=t+$i //sum of $1 + sum of $2

}

END { print “Sum is “t}

Con­sider the in­put file : data

10 30

10 20

Run­ning the script : awk –f awk­for data Sum of fields in all lines

Sum is 70

Con­trol state­ments: next, get­line and exit are the con­trol state­ments in AWK. The ‘next’ state­ment al­ters the flow of the pro­gram — it stops the cur­rent pro­cess­ing of pat­tern space. The pro­gram reads the next line and starts ex­e­cut­ing com­mands with the new line.

Get­line is sim­i­lar to next, but con­tin­ues ex­e­cut­ing the script. The exit state­ment causes AWK to im­me­di­ately stop pro­cess­ing the in­put, and any re­main­ing lines are ig­nored.

Math­e­mat­i­cal func­tions in AWK

The var­i­ous math­e­mat­i­cal func­tions in AWK are: int(x) -- trun­cates the float­ing point to the in­te­ger cos(x) -- re­turns the co­sine of x exp(x) -- re­turns e^x

log(x) -- re­turns the nat­u­ral log­a­rithm of x sin(x) -- re­turns the sine of x sqrt(x) -- re­turns the square root of x

Here is an ex­am­ple:

{ x = 5.3241 y = int(x) printf “trun­cated value is “, y }

Out­put: trun­cated value is 5

String func­tions in AWK

1. length(string): Cal­cu­lates the length of a string. 2. index(string, sub­string): Re­turns the first po­si­tion of the sub­string within a string. For ex­am­ple, in x= index(“pro­gram­ming”, “gra”), x re­turns the value 4. 3. sub­str(): Extracts the sub­string from a string. The two dif­fer­ent ways to use it are: sub­str(string, po­si­tion) and sub­str(string, po­si­tion, length).

Here is an ex­am­ple:

{ x = sub­str(“method­ol­ogy”,3) y = sub­str(“method­ol­ogy”,5,4) print “sub­string starts at “x print “sub­string of length “y }

The out­put of the above code is:

Sub­string starts at hodol­ogy Sub­string of length dolo

4. sub(regex, re­place­ment string, in­put string) or gsub(regex, re­place­ment string, in­put string) sub(/Ben/,” Ann “, $0): Replaces Ben with Ann

(first oc­cur­rence only). gsub(/is/,” was “, $0): Replaces all oc­cur­rences of ‘is’ with ‘was’. 5. match(string, regex) {

x=match($0,/^[0-9]/) #find all lines that start with digit if(x>0) #x re­turns a value > 0 if there’s a match print $0 }

6. toup­per() and tolower(): This is used for con­ve­nient con­ver­sions of case, as fol­lows: { print toup­per($0) #con­verts en­tire file to up­per­case }

User de­fined func­tions in AWK

AWK al­lows us to de­fine our own func­tions, which en­ables reusabil­ity of the code. A large pro­gram can be di­vided into func­tions, and each one can be writ­ten/tested in­de­pen­dently.

The syn­tax is:

Func­tion Func­tion_­name (pa­ram­e­ter list) { Func­tion body

}

The fol­low­ing ex­am­ple finds the largest of two numbers. The pro­gram’s name is aw­funct.

{ print large($1,$2) } func­tion large(m,n) { re­turn m>n ? m : n

}

In­put file is: doc

100 400

Run­ning the script: awk –f aw­funct doc

400

As­so­cia­tive ar­rays

AWK al­lows one-di­men­sional ar­rays, and their size and el­e­ments need not be de­clared. An ar­ray index can be a num­ber or a string.

The syn­tax is:

ar­rayName[index] = value

Index en­try is as­so­ci­ated with an ar­ray el­e­ment, so AWK ar­rays are known as as­so­cia­tive ar­rays.

For ex­am­ple, dept[$2] in­di­cates an el­e­ment in the sec­ond column of the file and is stored in the ar­ray, dept.

Pro­gram name : awkar­ray BE­GIN{print “Eg of ar­rays in awk”} { dept[$2] for (x in dept) {

print a[x]

} }

Con­sider the in­put file, data

S3 CSE A

S4 ECE B

S4 EEE A

Run­ning the script : awk –f awkar­ray data Eg of ar­rays in awk

CSE

ECE

EEE

AWK is ori­ented to­wards de­lim­ited fields on a per­line ba­sis. It has very ro­bust pro­gram­ming con­structs in­clud­ing de­ci­sion state­ments like if..else, and loops like while and do.. while.

We can con­clude by say­ing that AWK is an­other key­stone of UNIX shell pro­gram­ming. It re­ally shines when it comes to sim­pli­fy­ing things like pro­cess­ing mul­ti­line records and in­ter­po­lat­ing mul­ti­ple files si­mul­ta­ne­ously. AWK in­her­its the fea­tures of con­ven­tional pro­gram­ming lan­guages.

So not only was AWK pop­u­lar when it was in­tro­duced but it has also led to the cre­ation of other pop­u­lar lan­guages.

More de­tails about this text pro­cess­ing util­ity can be found in the books: ‘The AWK Pro­gram­ming Lan­guage by Al­fred V. Aho, Brian W. Kernighan, and

Peter J. Wein­berger (1988-01-01); and ‘UNIX and Shell Pro­gram­ming’ by Behrouz A. Forouzan and Richard F. Gil­berg, (Cen­gage Learn­ing). Ref­er­ences [1] https://en.m.wikipedia.org/wiki/AWK [2] www.gry­moire.com/Unix/Awk.html [3] www.thegeek­stuff.com/tag/awk-tu­to­rial-ex­am­ples/

Fig­ure 4: Reg­u­lar Ex­pres­sion in AWK

Fig­ure 5: Ac­tion state­ments

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.