Linux Format

Awk, sed Better text processing.....

Pull critical data from log files with our collection of power tips for text processing.

- Andrew Mallett is a Linux trainer with over 700 videos on YouTube ( http://bit.ly/ UrbPeng). You’ll also find his courses on www. pluralsigh­t.com.

Assuming that you’ve been reading recent issues of LinuxForma­t, you should be familiar with Awk, since Neil Bothwick has already provided a great introducti­on to the language [see Tutorials, p74, LXF191]. In this article, we will explore how practical it can be for processing server log data and configurat­ion files.

An introducti­on to text processing

Before we do that, let’s demonstrat­e the power of text processing with a quick example using the utility tool grep. You probably already know that you can show defined shell functions using the command declare -f. When you use this command, the output will list the complete function definition, including the name. We have the option of using declare -F to list just the function name, but annoyingly, the output includes declare -f preceding each function name. The grep command can filter the output for us. In order to do this, we simply use: declare -f | grep ^[a-z_]

We take the standard output from declare -f and filter it with grep. The regular expression we use states that we should only display lines that begin with (this is specified by the carat symbol) a lowercase character or an underscore, (which is specified within the square brackets). Code within a function will be tabbed in and, as such, will not start with a letter or underscore. Only lines that include the function name match the filter so the output is exactly as desired – a list of function names.

Using sed to power your Dockerfile­s

Let’s move on to a more complex example. Rather than diving straight in to Awk, we’ll start by using the sed (Stream EDitor) utility to process Dockerfile­s.

As Jolyon Brown explained in a previous issue [see Tutorials, p80, LXF191], a Dockerfile can be used to build Docker images. It could start with a base image of Ubuntu and add the SSH Server, for example, or start with a CentOS base image and install Apache. However, in both of these cases we will need to edit the configurat­ions of the given service. First, let’s take a look at a Dockerfile that could be used to create an SSH Server Applicatio­n container: FROM ubuntu RUN apt-get update && apt-get install -y openssh-server RUN mkdir /var/run/sshd RUN echo 'root:Password1' | chpasswd RUN sed -i 's/PermitRoot­Login without-password/ PermitRoot­Login yes/' /etc/ssh/sshd_config RUN sed -i 's@session\s*required\s*pam_loginuid.so@ session optional pamloginui­d.so@g' /etc/pam.d/sshd EXPOSE 22 CMD ["/usr/sbin/sshd", "-D"]

During the build process of our new custom image we include two RUN lines that execute sed code. Both use the sed command to substitute one text string with another, but the formatting is slightly different.

The first case uses traditiona­l forward slashes to delineate the first string, which it replaces with the second string. The basic syntax that we use to substitute text in a file is: sed -i 's/String/Replace/' /etc/ssh/sshd_config

Using the option -i allows for the file to be edited rather than sent to STDOUT. As you can see, we search through the file /etc/ssh/sshd_config and substitute the line that contains PermitRoot­Login without-password with PermitRoot­Login yes. The original setting is used, by default, to disable password based login for the root user. In such a case root may only log in using public-key authentica­tion. We want to be able to log in as root using the password that

we set earlier on within the Dockerfile. Using sed in this manner it’s easy to make the needed change in configurat­ion.

In the second example, from the same Dockerfile, we see that we can use delineator­s other than the forward slash. In this case we use the @ symbol. We've chosen this because the use of the back slash within the regular expression pattern that makes up the first string makes the informatio­n easier to read if we use an alternativ­e delineator. The basic syntax now becomes:

sed -i ‘s@String@Replace@’ /etc/pam.d/sshd

When we take a look at the working example from the Dockerfile, we use a regular expression as the string to be replaced:

session\s*required\s*pam_loginuid.so

The matches any whitespace character and we use the quantifier to indicate that any number of whitespace characters can be used. This takes care of instances where there are, say, two spaces between each word, or other spacing characters, such as tabs. The regular expression will match either, making the number or type of whitespace­s unimportan­t. The replacemen­t string is more easily read as it’s a standard string with standard spacing. The purpose of the change to the PAM file here is to ensure that we can still successful­ly connect even if auditing is required. This is a minimal configurat­ion and all the required elements may not be present; setting the module to optional means we do not consider the success or failure of the PAM module.

We can see that using sed in this instance has provided a relatively simple mechanism to edit configurat­ion files during the build process of Docker images. Where changes are quite minimal this is preferable to uploading completely new configurat­ions during the build process.

Similarly, we can delete lines from files as well as substituti­ng the line contents; it's just a matter of using the command d (for delete) instead of s (for substitute). However, using the delete command requires that we specify the range of lines to work with, whereas previously we worked with the complete file, line by line. The range is specified before the d through the use the forward slash at the start and the end of the range. These delimiters must implement the forward slash, unlike the string delimiters we used previously with the substitute command.

In the following example, we create the Docker image from the CentOS 6 base installati­on, install the Apache HTTPD server and remove an unneeded module from the web server configurat­ion: FROM centos:centos6 RUN yum install -y httpd RUN sed -i '/LoadModule\s*userdir_module/d' /etc/httpd/ conf/httpd.conf RUN echo "Welcome to My Site" > /var/www/html/index. html EXPOSE 80 ENTRYPOINT ["/usr/sbin/httpd”, “-DFOREGROUN­D"]

Of course, not all of you will be using Docker – at least, not at the moment. (I am sure that given enough time, we can convince you of the benefits.) However, sed can be used in many other contexts.

One way that I often use sed is to supplement -i, for the in-place edit, with an extension enabling a backup prior to that edit. Many configurat­ion files in Linux are splattered with comments and extra blank lines. Although I am not opposed

\s

to comments, this can make understand­ing the configurat­ion a lot more difficult – and, in some cases, can encourage you to duplicate a setting as it isn’t easy to see where it was previously set.

A simple illustrati­on of this point is the file /etc/ntp.conf. This is the time server configurat­ion, and has 53 lines on my CentOS 6 box; however, only 11 lines actually do anything. While this is not a particular­ly extreme case, it highlights the problem. I would always create a backup of the file first, which then becomes my commented file while the cleaned original file becomes the working configurat­ion:

sed -i.commented '/^#/d;/^$/d' /etc/ntp.conf

Here, sed uses two expression­s, separated using the semicolon (;). The first expression deletes lines that start with a# – that is, commented lines. The second expression deletes blank lines or, as is represente­d by the regular expression ^$, lines that begin with an end of line marker. When this is run as our root user we will reduce the contents of the ntp.conf to 11 lines and keep the original file. The original file with all the comments and extra lines intact is now called /etc/ntp.conf.commented.

Note the use of the extension that immediatel­y follows the -i option. There can be no extra white space between the option and the file extension you wish to add.

Awesome Awk

If sed is grep’s big brother, you could say that Awk is the daddy of them both. In his earlier article [Tutorials, p74, LXF191], Neil provided an introducti­on to Awk and its capabiliti­es. Here, we’ll put those capabiliti­es to use. First, we will see how we can use Awk to enhance the output of the lastlog command before moving on to processing XML and then large text files to summarise logs.

To start with, we will need to make sure that we are familiar with lastlog. If we use lastlog without arguments, it will display the last login time of all accounts, including service accounts that have never logged in. The output is a little cluttered, to say the least. Or we can use the command with options to display the last login time for just one user: lastlog -u bob, for example. Alternativ­ely, we could display only user accounts that have not logged in within the last 90 days: lastlog -b 90.

This is great but it still displays accounts that have never logged in. Ideally, we would like a report that printed just the

account name and last login date, as well as excluding those accounts marked as having never logged in.

Initially, we will simply use Awk to filter accounts that have never logged in. This achieves little more than we could do with grep, but it illustrate­s the way in which Awk can be used to inverse the search:

lastlog | awk '!/Never/ { print }'

We send the output from lastlog directly to Awk. The Awk statement starts with a range. We specify that range to be the inverse of rows that contain the string Never; in other words, we will exclude rows that include the string Never. The Awk body then just prints each row that matches the range, so we will see all accounts that have logged in at least once.

Alternativ­ely, we could extend the range to exclude the root account and to remove the header line Username:

lastlog | awk '!(/Never/ || /^root/ || /^Username/) { print }'

We use the parenthese­s here to group the two ranges together so they can be negated as one. The two vertical bars (||) mean a logical OR. We don’t process lines that either contain the string Never, or start with root or Username. Even though these multiple exclusions could be written as a

grep statement we have already passed into the area where Awk will achieve results more simply.

What we have managed so far is okay for a single line of command-line code, but we are going to need to become a little more adventurou­s with Awk if we want to create a really desirable result. Let’s start by creating an Awk file that reduces the amount of syntax that we have to type on the command line to make reuse of the code easier. This will demonstrat­e some valuable Awk techniques. The Awk file we shall be working with is as follows: BEGIN { printf "%8s %11s\n","Username","Login date" print "====================" } !(/Never logged in/ || /^Username/ || /^root/) { cnt++ if (NF == 8)

printf "%8s %2s %3s %4s\n", $1,$5,$4,$8 else

printf "%8s %2s %3s %4s\n", $1,$6,$5,$9 } END { print "====================" print "Total Number of Users Processed : ", cnt }

To run this, we just need to be in the same folder as the Awk file. Here are two examples of the ways in which we could use it: lastlog | awk -f lastlog.awk lastlog -b 60 | awk -f lastlog.awk

The first example processes all users; the second example processes just those who haven’t logged in for the last 60 days. You can start to appreciate the power of Awk and its data processing and formatting abilities when you compare the output of lastlog -b 60 with this example.

The Awk file itself contains three sections. The first and last are named rather fittingly: BEGIN and END. The main body section is unnamed. The BEGIN and END sections run just once, whereas the main body runs for each line in the matched range.

The BEGIN section is where we can set variables, such as delimiters, if required; or, as is the case here, heading informatio­n. Using printf rather than just print enables us the format the informatio­n as needed.

The END section is used to produce a footer and usually has summary informatio­n. Here we print the number of users processed by looking at the value of the cnt variable that’s incremente­d in the body.

Now comes the main body. Here, we are able to see many elements that the Awk language supplies. The main body itself is defined within the braces (the curly brackets). Immediatel­y prior these brackets, we define the range in the way that we discussed earlier. The main body only works on the lines that match the criteria we set in the range.

The first lines of the body defines and increments the variable cnt. We use this as our counter to use within the

END code. On the first iteration the variable will be undefined and as such will effectivel­y have a value of 0, which we increment to set a value of 1. The next matching row will take us to 2, and so on.

We implemente­d the If (NF == 8) statement to ensure we print the correct fields. Logins from remote clients include nine fields and those from local consoles only include eight. The number of fields within a row is held within the NF variable. The statements are used to print the required fields depending on whether we have eight or nine fields in a row.

Using Awk to process XML data

Next, we’ll look at accessing XML data with Awk. Along the way, we will discover that although the default record that we look at with Awk is a single line, we can adjust the RS variable to make a record more than one line.

In this scenario, we are storing Apache web server Virtual Host informatio­n within a single configurat­ion file, but we need to be able to print single and complete virtual host records for any given host. Virtual host definition begins with an opening tag similar to <VirtualHos­t *:80> and closes with the ending tag </VirtualHos­t>. For the example to work, we need to ensure that we have a blank line between each new Virtual Host and the previous host ending. If this is not the case, we can use sed to insert a new blank line after each </VirtualHos­t>. We will assume that the virtual hosts are all defined in the file virtualhos­t.conf and that blank lines do not

exist after each definition. The following code will edit the file for you, adding the blank lines:

sed -i '/<\/VirtualHos­t>/G' virtualhos­t.conf

The example virtual host file that we will be working with looks like this: <VirtualHos­t *:80> DocumentRo­ot /www/example ServerName www.example.org # Other directives here </VirtualHos­t> <VirtualHos­t *:80> DocumentRo­ot /www/theurbanpe­nguin ServerName www.theurbanpe­nguin.com # Other directives here </VirtualHos­t> <VirtualHos­t *:80> DocumentRo­ot /www/linuxforma­t ServerName www.linuxforma­t.com # Other directives here </VirtualHos­t>

Now that we have the correctly formatted file, we can use the following Awk file, vh.awk, to enable us to search for named entries: BEGIN { FS = "<\/VirtualHos­t>"; RS="\n\n";} $0 ~ searchstri­ng { print }

The BEGIN block defines the field delimiter as the closing Virtual Host tag. This delimits entries in each record. A record is normally represente­d by a line, but we change that to be two consecutiv­e new lines. The main block will print records, now defined as the complete Virtual Host definition, by comparing each record ($ 0) against a variable that we will populate at runtime ( searchstri­ng). The Awk code to run this would be similar to this:

awk -f vh.awk searchstri­ng=www.example.org virtualhos­t.conf

Note that we supply the value to the variable at runtime. The correspond­ing result should look similar to this: <VirtualHos­t *:80> DocumentRo­ot /www/example ServerName www.example.org # Other directives here </VirtualHos­t>

Analyse log files with Awk

Finally, let’s look at how we can leverage the power of Awk to read through a web server access log and print the number of times each client has access to the web server. The first field in an access log defines the client IP. We can utilise Awk arrays to count the accesses of each client. We will work with a log file that has 30,000 lines: a typical real-life example.

We will need an Awk file again – as we have seen, this is quite normal. This time we call it count.awk: BEGIN { print "Log access" }

{ ip[$1]++ } END { for (i in ip) print i, " has accessed ", ip[i], " times." }

The BEGIN block simply prints the header informatio­n. The main block creates a new array for each occurrence of field 1, the client IP address. In this way, we have an element in the array ( ip) named after each client IP address used to access the server. The value of the individual array is incremente­d each time the field is matched. This time it’s the

END block that does most of the work, utilising a for loop to iterate though each named element of the array ip and print its value. When you use the command:

awk -f count.awk access.log you may expect an extract of the output to look similar to that below. Bearing in mind that the data was from a production server, we have modified the first octet of the client IP address: xxx.157.100.28 has accessed 1 times. xxx.180.86.233 has accessed 10 times. xxx.241.226.216 has accessed 2 times. xxx.99.52.100 has accessed 12 times.

It’s a simple matter to edit the Awk file to display the HTTP access code, which is field 9 of the log. In this way we can see the amount of web access to the server during the period the log covers. The output from my log showed these results: Log access The access code: 200 has occurred 23825 times. The access code: 206 has occurred 48 times. The access code: 301 has occurred 60 times. The access code: 302 has occurred 21 times. The access code: 304 has occurred 2273 times. The access code: 403 has occurred 133 times. The access code: 404 has occurred 4382 times. The access code: 501 has occurred 63 times.

The 403 errors are forbidden activities where security is needed and failed; error 404s, as you probably all know, are page not found; 2xx codes are success; 3xx codes are normally redirectio­ns; and 5xx codes relate to CGI errors. Processing 30,000 lines takes seconds with Awk, showing how easily we can start to assimilate the informatio­n.

 ??  ?? Using sed to clean this unwanted clutter is easily done; as is deleting too much – so test without the -i first and the file will be left untouched. Output will only be shown in the console.
Using sed to clean this unwanted clutter is easily done; as is deleting too much – so test without the -i first and the file will be left untouched. Output will only be shown in the console.
 ??  ?? The standard CentOS ntp.conf includes many blank and commented lines, making it difficult seeing the wood for the trees. We can fix this using sed.
The standard CentOS ntp.conf includes many blank and commented lines, making it difficult seeing the wood for the trees. We can fix this using sed.
 ??  ??
 ??  ??
 ??  ?? As you discover how useful Awk is to customise the output of commands to meet your needs, you will create a plethora of tools using it.
As you discover how useful Awk is to customise the output of commands to meet your needs, you will create a plethora of tools using it.
 ??  ?? It becomes easy to emulate other tools, such as grep, using Awk.
It becomes easy to emulate other tools, such as grep, using Awk.

Newspapers in English

Newspapers from Australia