Awk, sed Better text processing.....
Pull critical data from log files with our collection of power tips for text processing.
Assuming that you’ve been reading recent issues of LinuxFormat, you should be familiar with Awk, since Neil Bothwick has already provided a great introduction to the language [see Tutorials, p74, LXF191]. In this article, we will explore how practical it can be for processing server log data and configuration files.
An introduction to text processing
Before we do that, let’s demonstrate the power of text processing with a quick example using the utility tool grep. You probably already know that you can show defined shell functions using the command declare -f. When you use this command, the output will list the complete function definition, including the name. We have the option of using declare -F to list just the function name, but annoyingly, the output includes declare -f preceding each function name. The grep command can filter the output for us. In order to do this, we simply use: declare -f | grep ^[a-z_]
We take the standard output from declare -f and filter it with grep. The regular expression we use states that we should only display lines that begin with (this is specified by the carat symbol) a lowercase character or an underscore, (which is specified within the square brackets). Code within a function will be tabbed in and, as such, will not start with a letter or underscore. Only lines that include the function name match the filter so the output is exactly as desired – a list of function names.
Using sed to power your Dockerfiles
Let’s move on to a more complex example. Rather than diving straight in to Awk, we’ll start by using the sed (Stream EDitor) utility to process Dockerfiles.
As Jolyon Brown explained in a previous issue [see Tutorials, p80, LXF191], a Dockerfile can be used to build Docker images. It could start with a base image of Ubuntu and add the SSH Server, for example, or start with a CentOS base image and install Apache. However, in both of these cases we will need to edit the configurations of the given service. First, let’s take a look at a Dockerfile that could be used to create an SSH Server Application container: FROM ubuntu RUN apt-get update && apt-get install -y openssh-server RUN mkdir /var/run/sshd RUN echo 'root:Password1' | chpasswd RUN sed -i 's/PermitRootLogin without-password/ PermitRootLogin yes/' /etc/ssh/sshd_config RUN sed -i 's@session\s*required\s*pam_loginuid.so@ session optional pamloginuid.so@g' /etc/pam.d/sshd EXPOSE 22 CMD ["/usr/sbin/sshd", "-D"]
During the build process of our new custom image we include two RUN lines that execute sed code. Both use the sed command to substitute one text string with another, but the formatting is slightly different.
The first case uses traditional forward slashes to delineate the first string, which it replaces with the second string. The basic syntax that we use to substitute text in a file is: sed -i 's/String/Replace/' /etc/ssh/sshd_config
Using the option -i allows for the file to be edited rather than sent to STDOUT. As you can see, we search through the file /etc/ssh/sshd_config and substitute the line that contains PermitRootLogin without-password with PermitRootLogin yes. The original setting is used, by default, to disable password based login for the root user. In such a case root may only log in using public-key authentication. We want to be able to log in as root using the password that
we set earlier on within the Dockerfile. Using sed in this manner it’s easy to make the needed change in configuration.
In the second example, from the same Dockerfile, we see that we can use delineators other than the forward slash. In this case we use the @ symbol. We've chosen this because the use of the back slash within the regular expression pattern that makes up the first string makes the information easier to read if we use an alternative delineator. The basic syntax now becomes:
sed -i ‘s@String@Replace@’ /etc/pam.d/sshd
When we take a look at the working example from the Dockerfile, we use a regular expression as the string to be replaced:
session\s*required\s*pam_loginuid.so
The matches any whitespace character and we use the quantifier to indicate that any number of whitespace characters can be used. This takes care of instances where there are, say, two spaces between each word, or other spacing characters, such as tabs. The regular expression will match either, making the number or type of whitespaces unimportant. The replacement string is more easily read as it’s a standard string with standard spacing. The purpose of the change to the PAM file here is to ensure that we can still successfully connect even if auditing is required. This is a minimal configuration and all the required elements may not be present; setting the module to optional means we do not consider the success or failure of the PAM module.
We can see that using sed in this instance has provided a relatively simple mechanism to edit configuration files during the build process of Docker images. Where changes are quite minimal this is preferable to uploading completely new configurations during the build process.
Similarly, we can delete lines from files as well as substituting the line contents; it's just a matter of using the command d (for delete) instead of s (for substitute). However, using the delete command requires that we specify the range of lines to work with, whereas previously we worked with the complete file, line by line. The range is specified before the d through the use the forward slash at the start and the end of the range. These delimiters must implement the forward slash, unlike the string delimiters we used previously with the substitute command.
In the following example, we create the Docker image from the CentOS 6 base installation, install the Apache HTTPD server and remove an unneeded module from the web server configuration: FROM centos:centos6 RUN yum install -y httpd RUN sed -i '/LoadModule\s*userdir_module/d' /etc/httpd/ conf/httpd.conf RUN echo "Welcome to My Site" > /var/www/html/index. html EXPOSE 80 ENTRYPOINT ["/usr/sbin/httpd”, “-DFOREGROUND"]
Of course, not all of you will be using Docker – at least, not at the moment. (I am sure that given enough time, we can convince you of the benefits.) However, sed can be used in many other contexts.
One way that I often use sed is to supplement -i, for the in-place edit, with an extension enabling a backup prior to that edit. Many configuration files in Linux are splattered with comments and extra blank lines. Although I am not opposed
\s
to comments, this can make understanding the configuration a lot more difficult – and, in some cases, can encourage you to duplicate a setting as it isn’t easy to see where it was previously set.
A simple illustration of this point is the file /etc/ntp.conf. This is the time server configuration, and has 53 lines on my CentOS 6 box; however, only 11 lines actually do anything. While this is not a particularly extreme case, it highlights the problem. I would always create a backup of the file first, which then becomes my commented file while the cleaned original file becomes the working configuration:
sed -i.commented '/^#/d;/^$/d' /etc/ntp.conf
Here, sed uses two expressions, separated using the semicolon (;). The first expression deletes lines that start with a# – that is, commented lines. The second expression deletes blank lines or, as is represented by the regular expression ^$, lines that begin with an end of line marker. When this is run as our root user we will reduce the contents of the ntp.conf to 11 lines and keep the original file. The original file with all the comments and extra lines intact is now called /etc/ntp.conf.commented.
Note the use of the extension that immediately follows the -i option. There can be no extra white space between the option and the file extension you wish to add.
Awesome Awk
If sed is grep’s big brother, you could say that Awk is the daddy of them both. In his earlier article [Tutorials, p74, LXF191], Neil provided an introduction to Awk and its capabilities. Here, we’ll put those capabilities to use. First, we will see how we can use Awk to enhance the output of the lastlog command before moving on to processing XML and then large text files to summarise logs.
To start with, we will need to make sure that we are familiar with lastlog. If we use lastlog without arguments, it will display the last login time of all accounts, including service accounts that have never logged in. The output is a little cluttered, to say the least. Or we can use the command with options to display the last login time for just one user: lastlog -u bob, for example. Alternatively, we could display only user accounts that have not logged in within the last 90 days: lastlog -b 90.
This is great but it still displays accounts that have never logged in. Ideally, we would like a report that printed just the
account name and last login date, as well as excluding those accounts marked as having never logged in.
Initially, we will simply use Awk to filter accounts that have never logged in. This achieves little more than we could do with grep, but it illustrates the way in which Awk can be used to inverse the search:
lastlog | awk '!/Never/ { print }'
We send the output from lastlog directly to Awk. The Awk statement starts with a range. We specify that range to be the inverse of rows that contain the string Never; in other words, we will exclude rows that include the string Never. The Awk body then just prints each row that matches the range, so we will see all accounts that have logged in at least once.
Alternatively, we could extend the range to exclude the root account and to remove the header line Username:
lastlog | awk '!(/Never/ || /^root/ || /^Username/) { print }'
We use the parentheses here to group the two ranges together so they can be negated as one. The two vertical bars (||) mean a logical OR. We don’t process lines that either contain the string Never, or start with root or Username. Even though these multiple exclusions could be written as a
grep statement we have already passed into the area where Awk will achieve results more simply.
What we have managed so far is okay for a single line of command-line code, but we are going to need to become a little more adventurous with Awk if we want to create a really desirable result. Let’s start by creating an Awk file that reduces the amount of syntax that we have to type on the command line to make reuse of the code easier. This will demonstrate some valuable Awk techniques. The Awk file we shall be working with is as follows: BEGIN { printf "%8s %11s\n","Username","Login date" print "====================" } !(/Never logged in/ || /^Username/ || /^root/) { cnt++ if (NF == 8)
printf "%8s %2s %3s %4s\n", $1,$5,$4,$8 else
printf "%8s %2s %3s %4s\n", $1,$6,$5,$9 } END { print "====================" print "Total Number of Users Processed : ", cnt }
To run this, we just need to be in the same folder as the Awk file. Here are two examples of the ways in which we could use it: lastlog | awk -f lastlog.awk lastlog -b 60 | awk -f lastlog.awk
The first example processes all users; the second example processes just those who haven’t logged in for the last 60 days. You can start to appreciate the power of Awk and its data processing and formatting abilities when you compare the output of lastlog -b 60 with this example.
The Awk file itself contains three sections. The first and last are named rather fittingly: BEGIN and END. The main body section is unnamed. The BEGIN and END sections run just once, whereas the main body runs for each line in the matched range.
The BEGIN section is where we can set variables, such as delimiters, if required; or, as is the case here, heading information. Using printf rather than just print enables us the format the information as needed.
The END section is used to produce a footer and usually has summary information. Here we print the number of users processed by looking at the value of the cnt variable that’s incremented in the body.
Now comes the main body. Here, we are able to see many elements that the Awk language supplies. The main body itself is defined within the braces (the curly brackets). Immediately prior these brackets, we define the range in the way that we discussed earlier. The main body only works on the lines that match the criteria we set in the range.
The first lines of the body defines and increments the variable cnt. We use this as our counter to use within the
END code. On the first iteration the variable will be undefined and as such will effectively have a value of 0, which we increment to set a value of 1. The next matching row will take us to 2, and so on.
We implemented the If (NF == 8) statement to ensure we print the correct fields. Logins from remote clients include nine fields and those from local consoles only include eight. The number of fields within a row is held within the NF variable. The statements are used to print the required fields depending on whether we have eight or nine fields in a row.
Using Awk to process XML data
Next, we’ll look at accessing XML data with Awk. Along the way, we will discover that although the default record that we look at with Awk is a single line, we can adjust the RS variable to make a record more than one line.
In this scenario, we are storing Apache web server Virtual Host information within a single configuration file, but we need to be able to print single and complete virtual host records for any given host. Virtual host definition begins with an opening tag similar to <VirtualHost *:80> and closes with the ending tag </VirtualHost>. For the example to work, we need to ensure that we have a blank line between each new Virtual Host and the previous host ending. If this is not the case, we can use sed to insert a new blank line after each </VirtualHost>. We will assume that the virtual hosts are all defined in the file virtualhost.conf and that blank lines do not
exist after each definition. The following code will edit the file for you, adding the blank lines:
sed -i '/<\/VirtualHost>/G' virtualhost.conf
The example virtual host file that we will be working with looks like this: <VirtualHost *:80> DocumentRoot /www/example ServerName www.example.org # Other directives here </VirtualHost> <VirtualHost *:80> DocumentRoot /www/theurbanpenguin ServerName www.theurbanpenguin.com # Other directives here </VirtualHost> <VirtualHost *:80> DocumentRoot /www/linuxformat ServerName www.linuxformat.com # Other directives here </VirtualHost>
Now that we have the correctly formatted file, we can use the following Awk file, vh.awk, to enable us to search for named entries: BEGIN { FS = "<\/VirtualHost>"; RS="\n\n";} $0 ~ searchstring { print }
The BEGIN block defines the field delimiter as the closing Virtual Host tag. This delimits entries in each record. A record is normally represented by a line, but we change that to be two consecutive new lines. The main block will print records, now defined as the complete Virtual Host definition, by comparing each record ($ 0) against a variable that we will populate at runtime ( searchstring). The Awk code to run this would be similar to this:
awk -f vh.awk searchstring=www.example.org virtualhost.conf
Note that we supply the value to the variable at runtime. The corresponding result should look similar to this: <VirtualHost *:80> DocumentRoot /www/example ServerName www.example.org # Other directives here </VirtualHost>
Analyse log files with Awk
Finally, let’s look at how we can leverage the power of Awk to read through a web server access log and print the number of times each client has access to the web server. The first field in an access log defines the client IP. We can utilise Awk arrays to count the accesses of each client. We will work with a log file that has 30,000 lines: a typical real-life example.
We will need an Awk file again – as we have seen, this is quite normal. This time we call it count.awk: BEGIN { print "Log access" }
{ ip[$1]++ } END { for (i in ip) print i, " has accessed ", ip[i], " times." }
The BEGIN block simply prints the header information. The main block creates a new array for each occurrence of field 1, the client IP address. In this way, we have an element in the array ( ip) named after each client IP address used to access the server. The value of the individual array is incremented each time the field is matched. This time it’s the
END block that does most of the work, utilising a for loop to iterate though each named element of the array ip and print its value. When you use the command:
awk -f count.awk access.log you may expect an extract of the output to look similar to that below. Bearing in mind that the data was from a production server, we have modified the first octet of the client IP address: xxx.157.100.28 has accessed 1 times. xxx.180.86.233 has accessed 10 times. xxx.241.226.216 has accessed 2 times. xxx.99.52.100 has accessed 12 times.
It’s a simple matter to edit the Awk file to display the HTTP access code, which is field 9 of the log. In this way we can see the amount of web access to the server during the period the log covers. The output from my log showed these results: Log access The access code: 200 has occurred 23825 times. The access code: 206 has occurred 48 times. The access code: 301 has occurred 60 times. The access code: 302 has occurred 21 times. The access code: 304 has occurred 2273 times. The access code: 403 has occurred 133 times. The access code: 404 has occurred 4382 times. The access code: 501 has occurred 63 times.
The 403 errors are forbidden activities where security is needed and failed; error 404s, as you probably all know, are page not found; 2xx codes are success; 3xx codes are normally redirections; and 5xx codes relate to CGI errors. Processing 30,000 lines takes seconds with Awk, showing how easily we can start to assimilate the information.