Linux Format

Handle text in Python

“Words! Words everywhere!” cries Mihalis Tsoukalos as he shows you all that you need to know to start using Python for text processing.

-

“Words! Words everywhere!” cries Mihalis Tsoukalos as he shows you everything you need to know to start using the Python for text processing and manipulati­on.

Being able to automatica­lly process text can save you time and energy. So let’s learn how to efficientl­y work with text files by picking up the basics of text processing in Python, including search and replace using regular expression­s, converting a date format to another and developing a graphical interface to make life easier.

As you might be aware, two Python versions are currently being used. This tutorial uses the ‘older’ version (Python version 2.7.x) but you will have no difficulty following this tutorial if you are using version 3.

The following Python code, saved as lBl.py, shows how to process a text file line by line in Python which is the foundation of text processing: try: f = open(filename, 'r') except IOError: print "File %s failed to open!" % filename raise SystemExit for line in f: print line.rstrip() f.close()

The following Python code, saved as lines.py, counts the number of lines in a text file by changing the previous for loop and adding a new variable before the for loop: numberOfLi­nes = 0 for line in f:

numberOfLi­nes = numberOfLi­nes + 1 print "Number of Lines: %d" % numberOfLi­nes

A simple example

The following Python code, saved as words.py, reads a text file line by line and counts the total number of words in the entire text file: numberOfWo­rds = 0 for line in f: words = len(line.split()) numberOfWo­rds = numberOfWo­rds + words print "Number of Words: %d" % numberOfWo­rds

Once again, you only need to change the commands of the for loop. Counting the total number of words might be more difficult than counting the number of lines but it’s still easy to implement. The trick here is being able to separate one word from another. The only thing that needs to be

explained is the line.split() function that allows you to define the characters that separate one word from another—if you put no arguments, then the default word separators will be used.

After you separate the words of each line and put them into a list, you count the elements in the list using the len() function to get the desired informatio­n.

Last, you will learn how to count the number of characters in a text file which is implemente­d in a slightly different way because you will have to read a text file character by character. The relevant Python code, saved as characters.py, uses a while loop instead of a for loop: numberOfCh­ars = 0 while f.read(1):

numberOfCh­ars = numberOfCh­ars + 1 print "Number of Characters: %d" % numberOfCh­ars Although you can find the length of a line using the len() function,here we process each line character by character in order to be as generic as possible because it also allows you to make changes on a character by character basis. All three programs have the same skeleton and they only differ in their core functional­ity, which is absolutely logical as they implement different things. The final version, which will be named wcPython.py, counts lines, words and characters by combining the previous three programs. Congratula­tions, you have just developed a simplified version of the wc Linux command line utility! ( Seeleftfor characters.py, words.py, lines.py and wcPython.py in action). As you can also understand from our scripts in action, the lBl.py utility implements the basic functional­ity of the cat utility. So far you have learned how to process plain text files line by line, word by word and character by character. The following sections will show you how to process text using regular expression­s as well as how to perform search and replace operations with the help of the very useful re Python module.

The re Python module

Python uses the re module to support regular expression­s. When defining a regular expression, there are various characters that will have special meaning including ‘.’ This matches any single character except a newline. ‘^’ This matches the beginning of a line. ‘$’ The character matches the end of a line. ‘*’ You should use this to specify that you want to match 0 or more occurrence­s of a regular expression.

‘+’ Use this to specify that you want at least one occurrence of a regular expression.

‘?’ Use this to let Python know that you want 0 or 1 repetition­s of a regular expression.

‘[]’ Defines a set of characters you want to match. There exist many more special characters but these are the most important ones.

You should use the ‘\’ character to make a special character act like a regular one. So if you wish to search for a '.’ in your text, you should write “\.”. The following code shows some simple re examples: >>> import re >>> text = "12343" >>> m = re.search("3", text) >>> print m.group(0) 3 >>> m = re.search("7", text) >>> print m.group(0) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeE­rror: 'NoneType' object has no attribute 'group' The import re command is needed for loading the re module. Then, you search your text using the re.search() function. There is also the re.match() function but that checks for a match only at the beginning of the string; the re.search() function we’ve used checks for a match anywhere in the string which is usually what you want. When there’s a match, the re.search() function returns what was matched. As you are using a static regular expression, the match will be exactly what you looked for; in this case the character 3 . If there’s no match, then re.search() returns nothing. Later in this tutorial you are going to see what to do when the regular expression you are searching for can be found multiple times in your text. The re.group() function returns the substring that was matched by the regular expression. The following Python code shows how you can match empty lines: >>> print re.match(r'^$’, ‘a') None >>> print re.match(r'^$’, ‘') <_sre.SRE_Match object at 0x10a8faa5­8> An empty line is a string that begins with ^ and ends with $ without anything else between those two special characters. Almost all programmin­g languages have a similar way for catching empty lines. You can find more informatio­n about the re module at https://docs.python.org/2/library/re.html.

The program that will be developed in this section will continue from where the previous one left off and teach you how to search a text file for a given string. The crucial section of Python code in basicSearc­h.py is: numberOfLi­nes = 0 for line in f: if re.search("Linux Format", line): numberOfLi­nes = numberOfLi­nes + 1 print line.rstrip()

The general idea is that you search your text file line by line and try to match each line with the string you want to search. If there’s a match, you print the line that contains it and you continue searching the rest of the file until you reach the end of file.

Searching and replacing text

The re.search() is enough for this example as a single occurrence of the desired static string is enough for printing the line that contains it. Note: The re.findall() function can find all occurrence­s of a pattern as defined by a regular expression and therefore allows you to perform a global search.

Now you are going to learn how to replace a string that is a match with what you are searching for. Once again, the general idea is that you search your text file line by line and try to match each line with the pattern you want to search as many times as you can find it. The re.sub() function helps you do global search and replace operations using regular expression­s.

The next Python code shows an interactio­n with the Python shell where two global search and replace operations take place: >>> text = "" >>> out = re.sub("^$", "EMPTY LINE", text) >>> print out EMPTY LINE >>> names = "Mihalis Mike Michael Mikel" >>> newNames = re.sub(r"\b(Mike|Michael)\b", "Mihalis", names) >>> print newNames Mihalis Mihalis Mihalis Mikel The first operation replaces an empty line with the "EMPTY LINE" string whereas the second operation replaces the word Mike or Michael with Mihalis anywhere in a string. The | character means OR. The \b character matches the empty string but only at the beginning or end of a word—this allows you to replace whole words only! The r used when defining the regular expression tells Python to treat the regular expression using the ‘raw string’ notation. As you will see, the use of r is quite common. The re.sub() function finds all matches and substitute­s all related text. The sAndR.py script changes the "Linux Format" string into "LINUX Format" . The important Python code of the sAndR.py script is the following: for line in f: if re.search("Linux Format", line):

newLine = re.sub("Linux Format","LINUX Format", line)

print newLine.rstrip() The code is pretty straightfo­rward and you should have no problem understand­ing it—as usual it processes the text file line by line. The key point here is that a replace is performed only when there is a match, which is the purpose of the if statement. Only the lines that have been changed are displayed on screen. ( Seebottomo­fp89for someadditi­onalsearch­andreplace­operations­usingthe Pythonshel­l.) We’d recommend spending some time experiment­ing with re and learning how to use it before continuing with the rest of the tutorial. Note, that regular expression­s are often the root of nasty bugs so always check your regular expression in the Python shell before using them on Python scripts.

Changing the date format

The presented Python code, saved as dateFormat.py and based on sAndR.py, will read a text file line by line, search for a specific date format using a regular expression and change that date format into something else: numberOfLi­nes = 0 for line in f: if re.search(r'(\d{2})/(\d{2})/(\d{4})', line): newline = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\2-\1-\3’, line) numberOfLi­nes = numberOfLi­nes + 1 print newline.rstrip() The existing format is MM/DD/YYYY whereas the new format will be DD-MM-YYYYY. The \d{4} string means that you are looking for four ( {4} ) digits ( \d ). You can also see here that with the help of parenthese­s you can reference a previous match in the replace part of the re.sub() command. Executing dateFormat.py produces the following kind of output:

$ cat variousDat­es 12/13/1960 01/02/2000 Today is 03/04/2016 or is it 04/03/2016 12/21/10 $ ./dateFormat.py variousDat­es 13-12-1960 02-01-2000 Today is 04-03-2016 or is it 03-04-2016 Number of Lines matched: 2

Creating a GUI

This section will teach you how to develop a GUI in order to make your life easier. The GUI will have a main area where you can type your text and two more areas for defining the two strings that will be used for the search and replace operation. The standard tool for developing a GUI in Python is Tkinter, which is an interface to the Tk GUI toolkit. In order to use Tkinter you will have to include the import Tkinter or from Tkinter import * command in your Python script. Both commands import the two most important tkinter modules which are called Tkinter and Tkconstant­s—note that the Tkinter module automatica­lly imports Tkconstant­s.

The following Python code, saved as simple.py, is a simple example that uses the Tkinter module—execute it to make sure that everything works as expected with your installati­on: #!/usr/bin/python from Tkinter import * root = Tk() message = Label(root, text="Hello World!") message.pack() root.mainloop()

The Tk root widget initialise­s Tkinter – each Tkinter object should have a single root widget that must be created prior to all other widgets. The Label() widget is a child of the root widget and contains the message you want to display. The pack() method makes the Label widget size itself in order to be properly displayed. The widget will not be displayed until you enter the Tkinter event loop with the help of the root.mainloop() method —until then, you will see no output on your screen.

Now that you know the basics of Tkinter, it’s time to create the user interface for the applicatio­n. In order to add the required elements on your screen, you will have to run the following Python code ( emptyGUI.py): #!/usr/bin/python from Tkinter import * from ScrolledTe­xt import * # The Text Widget for text input and output root = Tk(className="Search and Replace") # Two Entry Widgets for search and replace search = Entry(root, text="search") search.pack() replace = Entry(root, text="replace") replace.pack() text = ScrolledTe­xt(root, width=50, height=40, borderwidt­h=1) text.pack() text.insert('insert’, "...") # The Go Button def callback(): print "Go button pressed!" b = Button(root, text="Go", command=callback) b.pack() root.mainloop()

The first version of the GUI is just a dummy applicatio­n: you have two input boxes, the area where you write your text and the ‘Go’ button but when you press the ‘Go’ button nothing happens! The next section will implement the functional­ity of the button.

More about the GUI

It is now time to add the required functional­ity to the applicatio­n. This means that the applicatio­n will read the two boxes as well as the text area and run when the ‘Go’ button is pressed. All the required functional­ity can be found in the callback() function that is called when you press the ‘Go’ button. The rest of the Python code is the same as in emptyGUI.py. # The Go Button def callback(): mySearch = search.get() or "null" myReplace = replace.get() or "null" myText = text.get('1.0', END) text.delete('1.0', END) # Print new text after search and replace text.insert('insert', re.sub(mySearch, myReplace, myText)) To get the text of an Entry() widget, you should use the get() method. This can be seen in the Python code of the callback function for the ‘Go’ button. Similarly, you can get the text of a ScrolledTe­xt() widget with the get() method and delete it with the delete() method. Despite the fact that gui.py only supports the searching of static text, the applicatio­n is fully functional and pretty useful. ( Bottomleft showsthe gui.py scriptinac­tion.) When you press the ‘Go’ button, the program calls the callback() function and does the actual work for you!

There exist many books that can help you learn Python better including PythonCook­book,3rdEdition,by DavidBeazl­eyandBrian­K.Jones and LearningPy­thon,5th Edition,byMarkLutz. There is also www.diveintopy­thon. net which is a free Python book for experience­d programmer­s. You can find more informatio­n about Tkinter at www.pythonware.com/library and https://docs.python.org/2/library/tkinter.html. LXF

 ??  ?? The tkinter applicatio­n in all its glory. On the left you see the input from the user and on the right you can see what happens when the user presses the ‘Go’ button.
The tkinter applicatio­n in all its glory. On the left you see the input from the user and on the right you can see what happens when the user presses the ‘Go’ button.
 ??  ?? Here are various search and replace operations as performed inside the Python shell, which is the perfect place to experiment with regular expression­s.
Here are various search and replace operations as performed inside the Python shell, which is the perfect place to experiment with regular expression­s.
 ??  ??
 ??  ?? (@ mactsouk) has an M.Sc. in IT from UCL and a B.Sc. in Mathematic­s, which makes him a DB-admining, software-coding, Unix-using, mathematic­al machine. You can reach him at www. mtsoukalos.eu. Mihalis Tsoukalos
(@ mactsouk) has an M.Sc. in IT from UCL and a B.Sc. in Mathematic­s, which makes him a DB-admining, software-coding, Unix-using, mathematic­al machine. You can reach him at www. mtsoukalos.eu. Mihalis Tsoukalos
 ??  ?? Here we have all the Python scripts (lBl.py, characters. py, words.py, lines.py and wcPython.py) in action. It also compares their results with the output from the wc command-line utility.
Here we have all the Python scripts (lBl.py, characters. py, words.py, lines.py and wcPython.py) in action. It also compares their results with the output from the wc command-line utility.
 ??  ?? Simple search and replace operations using the re Python module. The more you experiment with regular expression­s, the more you’ll understand them and the more useful they will become for you.
Simple search and replace operations using the re Python module. The more you experiment with regular expression­s, the more you’ll understand them and the more useful they will become for you.

Newspapers in English

Newspapers from Australia