Linux Format

Build the wc command

Mihalis Tsoukalos shows you how you can develop a handy system tool for accessing text file informatio­n in Python 3 that will make your life easier.

- Mihalis Tsoukalos (@ mactsouk) has an M.Sc. in IT from UCL and a B.Sc. in Mathematic­s, which makes him a DB-admining, software-coding, Unix-using, mathematic­al machine. You can reach him at www. mtsoukalos.eu.

Mihalis Tsoukalos shows you what you need to know to develop a handy system tool in Python 3 that will make your life easier, as we recreate the wc tool in our own image.

The subject of this tutorial is programmin­g the famous wc command line utility in Python 3. This utility is a relatively simple tool with only three main command line switches, which we’ll also implement here. In other words, when you implement such a famous utility, it’s easy to decide what you want to support, so you don’t have to think a lot about the features your program will have.

One of the oldest Unix command-line utilities, wc, which is short for word count, enables you to quickly find out informatio­n about a text file. The wc utility does various things such as counting the words, lines and characters of its input, which is usually one or more plain text files. Before you write in: We know the more recent GNU wc utility has more options than the original wc implementa­tion.

Reading text files

The most important task of the implementa­tion is being able to read a text file. The most convenient way to read the file is line by line and process each line individual­ly. The following Python 3 code show you how to do this and opens a file and processes a text file line by line: f = open(filename,'r') for line in f: line = line.rstrip() print(line) The rstrip() function is called in order to remove the new line character that’s present at the end of each line. So far then, you know how to read a text file line by line, which solves the problem of counting the lines of a text file. The following program, saved as countLines.py, implements this functional­ity: #!/usr/bin/env python3 import os import sys filename = str(sys.argv[1]) nLines = 0 f = open(filename,‘r') for line in f: nLines = nLines + 1 print(nLines) You can also process a text file word by word using the following technique: f = open(filename, ‘r') for word in f.read().split(): print(word)

As wc has to count the number of words per line and not process them, you can use the following code instead: nWords = len(line.split())

The split() method separates the words of a line and the len() method returns the number of words returned by split() . Last, you can process a text file character by character using this Python 3 code: f = open(filename, ‘r') for word in f.read().split(): for ch in word:

print(ch) However, in order to count the number of characters a line has you don’t need to process each line character by character, which would be too slow, you just have to count its length using the len() function.

Creating wc

By combining the previous code, we can come up with a program that implements the central functional­ity of wc, which is character, word and line counting. This script will be called count3.py: #!/usr/bin/env python3 import os import sys if len(sys.argv) >= 2: filename = str(sys.argv[1]) else: print('Not enough arguments!') sys.exit(0) nLines = 0 nWords = 0 nChars = 0 f= open(filename,‘r') for line in f: nLines = nLines + 1 nChars = nChars + len(line) nWords = nWords + len(line.split()) print('Lines:’, nLines, ‘Words:’, nWords, ‘Chars:’, nChars) Executing count3.py file generates the following kind of output: $ ./count3.py count3.py Lines: 22 Words: 55 Chars: 385 $ wc count3.py 22 55 385 count3.py

The second command uses wc to verify that the count3. py script works correctly and remember: always test your code! By testing your code when learning how to program you can gain a better understand­ing of how the code works. If there is an error with your programmin­g (and let’s face it— at some point there will be especially when you’re starting out) then it’s better to catch those errors early. The next section will make the code of count3.py even better.

Reading from standard input

The wc utility can get its input from standard input. Therefore, you will need to learn how to do the same in Python 3. But, when do you need to read from standard input? The following script will read from standard input when there’s no filename given as a command line argument.

The other command line options will still be valid and working if they are present. The two simplest ways to create a pipe and pass the output of the first program to wc are to use the following: $ cat aTextFile | wc $ ls | wc

Currently, if you try to execute the next command, you will get an error message: $ cat count3.py | ./count3.py Not enough arguments!

An improved version of count3.py implements the desired functional­ity: #!/usr/bin/env python3 import os import sys if len(sys.argv) >= 2:

filename = str(sys.argv[1]) else: filename = None nLines = 0 nWords = 0 nChars = 0 if filename == None: for line in sys.stdin: nLines = nLines + 1 nChars = nChars + len(line) nWords = nWords + len(line.split()) else: f = open(filename,'r') for line in f: nLines = nLines + 1 nChars = nChars + len(line) nWords = nWords + len(line.split()) print('Lines:', nLines, 'Words:', nWords, 'Chars:', nChars) All of the work here is done by declaring the filename as None , which is a special value in the Python language that means that the filename variable has no value. Next you use sys.stdin to read from standard input as if it was a regular file. Now, you can use count3.py in two new ways (although the old one still works): $ cat count3.py | ./count3.py Lines: 27 Words: 78 Chars: 536 $ ./count3.py 1234 Lines: 1 Words: 1 Chars: 5 $ ./count3.py count3.py Lines: 27 Words: 78 Chars: 536 $ cat count3.py count3.py | ./count3.py Lines: 54 Words: 156 Chars: 1072

Here we can see that the last command we ran proves that count3.py can even accept multiple files in the standard input. However, count3.py will not read from standard input if it has a file to process, even if the filename doesn’t exist: $ cat myWC.py | ./count3.py count3.py Lines: 27 Words: 78 Chars: 536 $ wc myWC.py

37 99 753 myWC.py $ cat myWC.py | ./count3.py count3 Traceback (most recent call last): File "./count3.py”, line 21, in <module>

f = open(filename,‘r') FileNotFou­ndError: [Errno 2] No such file or directory: ‘count3’

Command line arguments

The original wc utility supports three main switches: -m for counting characters only, -l for counting lines only and -w for counting words only. As a result, our implementa­tion should also support these three switches as well. Dealing with more than two command line options without a module to help is silly.

The next section of Python 3 code, which is saved as comLine.py, shows you how to deal with both command line options and switches with the help of a a very useful module called argparse: #!/usr/bin/env python3 import os import sys import argparse

parser = argparse.ArgumentPa­rser() parser.add_argument("-m", default = False, action="store_ true", help="Counting Characters", required=False) parser.add_argument("-l", default = False, action="store_ true", help="Counting Lines", required=False) parser.add_argument("-w", default = False, action="store_ true", help="Counting Words", required=False) parser.add_argument('filenames', default = None, help="Filenames", nargs='*') args = parser.parse_args() if args.filenames == None:

print('No filenames given!') else: for f in args.filenames:

print(f) if args.m == True:

print('-m is on!') else:

print('-m if off!') if args.l == True:

print('-l is on!') else:

print('-l if off!') if args.w == True: print('-w is on!') else: print('-w if off!') The parser.add_argument() method adds a new switch whereas, the args variable holds the values of the defined switches. Executing comLine.py generates the following kind of output: $ ./comLine.py -l 1 1 -m if off! -l is on! -w if off! $ ./comLine.py -l -w -m if off! -l is on! -w is on! However, the next form will not work because it includes a switch and then a filename:

$ usage: ./comLine.py comLine.py1 2 3 [-h]-l 12 [-m][-l] [-w] [filenames [filenames ...]] comLine.py:Generally speaking,error: unrecogniz­edit’s better to arguments:include the 12 switches first and then the filenames. You can also get help: $ ./comLine.py -h usage: comLine.py [-h][-m] [-l] [-w] [filenames [filenames ...]] positional arguments: filenames Filenames optional arguments: -h, --help show this help message and exit -m Counting Characters -l Counting Lines -w Counting Words

As you can see comLine.py works just fine, so we can continue with the actual Python 3 implementa­tion. You can find more informatio­n about argparse at https://docs. python.org/3/library/argparse.html.

The final version

Once you’ve learnt all the previous things we’ve covered, implementi­ng wc in Python 3 should be relatively easy and straightfo­rward. Below, you’ll find the Python 3 code for our version of wc, called myWC.py: #!/usr/bin/env python3 import os import sys import argparse def count(filename): nLines = 0 nWords = 0 nChars = 0 if filename == None: myText = sys.stdin.read() chars = len(myText) words = len(myText.split()) lines = len(myText.split('\n')) return(lines-1, words, chars) else: f = open(filename,'r') for line in f:

nLines = nLines + 1

nChars = nChars + len(line) nWords = nWords + len(line.split()) return(nLines, nWords, nChars) def main(): characters = 0 words = 0 lines = 0 totalC = 0 totalW = 0 totalL = 0 nFiles = 0 toPrint = '' parser = argparse.ArgumentPa­rser() parser.add_argument("-m", default = False, action="store_ true", help="Counting Characters", required=False)

parser.add_argument("-l", default = False, action="store_ true", help="Counting Lines", required=False)

parser.add_argument("-w", default = False, action="store_ true", help="Counting Words", required=False)

parser.add_argument('filenames', default = None, help="Filenames”", nargs='*') args = parser.parse_args() if args.filenames == []: (lines, words, characters) = count(None) if args.l == True:

toPrint = '{:>10}'.format(lines) if args.w == True:

toPrint = toPrint + '{:>10}'.format(words) if args.m == True:

toPrint = toPrint + '{:>10}'.format(characters) if args.m == False and args.w == False and args.l == False:

toPrint = '{:>10}'.format(lines) + '{:>10}'.format(words) + '{:>8}'.format(characters) if toPrint != '':

print(toPrint) toPrint = '' else: for f in args.filenames: nFiles = nFiles + 1 (lines, words, characters) = count(f) totalC = totalC + characters totalW = totalW + words totalL = totalL + lines if args.l == True:

toPrint = '{:>10}'.format(lines) if args.w == True:

toPrint = toPrint + '{:>10}'.format(words) if args.m == True:

toPrint = toPrint + '{:>10}'.format(characters) if args.m == False and args.w == False and args.l == False:

toPrint = '{:>10}'.format(lines) + '{:>10}'. format(words) + '{:>10}'.format(characters) if toPrint != '': toPrint = toPrint + ' ' + '{:15}'.format(f) print(toPrint) toPrint = '' # Print totals if nFiles > 1:

print('{:>10}'.format(totalL) + '{:>10}'.format(totalW) + '{:>10}'.format(totalC) + ' ' + '{:15}'.format('total')) if __name__ == '__main__':

main() else: print("This is a standalone program not a module!")

The code is pretty clear—most of it deals with printing the desired informatio­n according to the switches given.

The core functional­ity of myWC.py, which is counting characters, words and lines, needs less code than you might expect and is implemente­d easily enough inside the count() function, which returns three values: number of characters, number of words and number of lines. The rest of our homegrown wc utility is handled inside the main() function.

Testing and benchmarki­ng

No program is ready for use until it has been extensivel­y tested it. The following tests will be performed in order to make sure that myWC.py works as expected: $ ./myWC.py myWC.py $ ./myWC.py myWC.py myWC.py $ ./myWC.py myWC.py myWC.py | ./myWC.py $ cat myWC.py | ./myWC.py -m -l $ cat myWC.py myWC.py | ./myWC.py -l $ ls | ./myWC.py

Generally speaking, test cases can also be used for learning how to use a new command.

In order to have reliable benchmarks, you will need to process big text files using both wc and myWC.py and find out which has the best performanc­e.

This part of the tutorial will teach you how to create a new place where you can put your own Python 3 scripts and make them available from anywhere on your Linux system. You will want to do this in order to be able to find and execute them from anywhere on your Linux system without the need to use their full path or put ‘./’ in front of them. As the default shell on Linux machines is Bash, this section will show you how to change the PATH variable of the Bash shell; if you use a different shell you will need to make small changes to the presented commands that we’ve used.

Changing the PATH variable

First, we need to check what the current definition of the PATH variable is: $ echo $PATH /usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games

So, create a directory named bin inside your home directory and add its full path to the PATH variable: $ mkdir ~/bin $ export PATH="$HOME/bin:$PATH” $ echo $PATH /home/mtsouk/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/ games:/usr/games

Please note that the tilde (~) character is an alias for your home directory. So, if your username is ‘python’, then your home directory will be most likely called /home/python, which will also be the value of ~.

Next, put aScript.py inside the bin directory and use the which command to find it: $ mv aScript.py ~/bin/ $ which aScript.py /home/mtsouk/bin/aScript.py $ aScript.py

If you want to make the changes to the PATH variable permanent, you should edit ~/.profile or ~/.bashrc. If you do not know how to do this, you should contact your local administra­tor for help.

 ??  ??

Newspapers in English

Newspapers from Australia