Handle text in Python
“Words! Words everywhere!” cries Mihalis Tsoukalos as he shows you all that you need to know to start using Python for text processing.
“Words! Words everywhere!” cries Mihalis Tsoukalos as he shows you everything you need to know to start using the Python for text processing and manipulation.
Being able to automatically process text can save you time and energy. So let’s learn how to efficiently work with text files by picking up the basics of text processing in Python, including search and replace using regular expressions, converting a date format to another and developing a graphical interface to make life easier.
As you might be aware, two Python versions are currently being used. This tutorial uses the ‘older’ version (Python version 2.7.x) but you will have no difficulty following this tutorial if you are using version 3.
The following Python code, saved as lBl.py, shows how to process a text file line by line in Python which is the foundation of text processing: try: f = open(filename, 'r') except IOError: print "File %s failed to open!" % filename raise SystemExit for line in f: print line.rstrip() f.close()
The following Python code, saved as lines.py, counts the number of lines in a text file by changing the previous for loop and adding a new variable before the for loop: numberOfLines = 0 for line in f:
numberOfLines = numberOfLines + 1 print "Number of Lines: %d" % numberOfLines
A simple example
The following Python code, saved as words.py, reads a text file line by line and counts the total number of words in the entire text file: numberOfWords = 0 for line in f: words = len(line.split()) numberOfWords = numberOfWords + words print "Number of Words: %d" % numberOfWords
Once again, you only need to change the commands of the for loop. Counting the total number of words might be more difficult than counting the number of lines but it’s still easy to implement. The trick here is being able to separate one word from another. The only thing that needs to be
explained is the line.split() function that allows you to define the characters that separate one word from another—if you put no arguments, then the default word separators will be used.
After you separate the words of each line and put them into a list, you count the elements in the list using the len() function to get the desired information.
Last, you will learn how to count the number of characters in a text file which is implemented in a slightly different way because you will have to read a text file character by character. The relevant Python code, saved as characters.py, uses a while loop instead of a for loop: numberOfChars = 0 while f.read(1):
numberOfChars = numberOfChars + 1 print "Number of Characters: %d" % numberOfChars Although you can find the length of a line using the len() function,here we process each line character by character in order to be as generic as possible because it also allows you to make changes on a character by character basis. All three programs have the same skeleton and they only differ in their core functionality, which is absolutely logical as they implement different things. The final version, which will be named wcPython.py, counts lines, words and characters by combining the previous three programs. Congratulations, you have just developed a simplified version of the wc Linux command line utility! ( Seeleftfor characters.py, words.py, lines.py and wcPython.py in action). As you can also understand from our scripts in action, the lBl.py utility implements the basic functionality of the cat utility. So far you have learned how to process plain text files line by line, word by word and character by character. The following sections will show you how to process text using regular expressions as well as how to perform search and replace operations with the help of the very useful re Python module.
The re Python module
Python uses the re module to support regular expressions. When defining a regular expression, there are various characters that will have special meaning including ‘.’ This matches any single character except a newline. ‘^’ This matches the beginning of a line. ‘$’ The character matches the end of a line. ‘*’ You should use this to specify that you want to match 0 or more occurrences of a regular expression.
‘+’ Use this to specify that you want at least one occurrence of a regular expression.
‘?’ Use this to let Python know that you want 0 or 1 repetitions of a regular expression.
‘[]’ Defines a set of characters you want to match. There exist many more special characters but these are the most important ones.
You should use the ‘\’ character to make a special character act like a regular one. So if you wish to search for a '.’ in your text, you should write “\.”. The following code shows some simple re examples: >>> import re >>> text = "12343" >>> m = re.search("3", text) >>> print m.group(0) 3 >>> m = re.search("7", text) >>> print m.group(0) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'group' The import re command is needed for loading the re module. Then, you search your text using the re.search() function. There is also the re.match() function but that checks for a match only at the beginning of the string; the re.search() function we’ve used checks for a match anywhere in the string which is usually what you want. When there’s a match, the re.search() function returns what was matched. As you are using a static regular expression, the match will be exactly what you looked for; in this case the character 3 . If there’s no match, then re.search() returns nothing. Later in this tutorial you are going to see what to do when the regular expression you are searching for can be found multiple times in your text. The re.group() function returns the substring that was matched by the regular expression. The following Python code shows how you can match empty lines: >>> print re.match(r'^$’, ‘a') None >>> print re.match(r'^$’, ‘') <_sre.SRE_Match object at 0x10a8faa58> An empty line is a string that begins with ^ and ends with $ without anything else between those two special characters. Almost all programming languages have a similar way for catching empty lines. You can find more information about the re module at https://docs.python.org/2/library/re.html.
The program that will be developed in this section will continue from where the previous one left off and teach you how to search a text file for a given string. The crucial section of Python code in basicSearch.py is: numberOfLines = 0 for line in f: if re.search("Linux Format", line): numberOfLines = numberOfLines + 1 print line.rstrip()
The general idea is that you search your text file line by line and try to match each line with the string you want to search. If there’s a match, you print the line that contains it and you continue searching the rest of the file until you reach the end of file.
Searching and replacing text
The re.search() is enough for this example as a single occurrence of the desired static string is enough for printing the line that contains it. Note: The re.findall() function can find all occurrences of a pattern as defined by a regular expression and therefore allows you to perform a global search.
Now you are going to learn how to replace a string that is a match with what you are searching for. Once again, the general idea is that you search your text file line by line and try to match each line with the pattern you want to search as many times as you can find it. The re.sub() function helps you do global search and replace operations using regular expressions.
The next Python code shows an interaction with the Python shell where two global search and replace operations take place: >>> text = "" >>> out = re.sub("^$", "EMPTY LINE", text) >>> print out EMPTY LINE >>> names = "Mihalis Mike Michael Mikel" >>> newNames = re.sub(r"\b(Mike|Michael)\b", "Mihalis", names) >>> print newNames Mihalis Mihalis Mihalis Mikel The first operation replaces an empty line with the "EMPTY LINE" string whereas the second operation replaces the word Mike or Michael with Mihalis anywhere in a string. The | character means OR. The \b character matches the empty string but only at the beginning or end of a word—this allows you to replace whole words only! The r used when defining the regular expression tells Python to treat the regular expression using the ‘raw string’ notation. As you will see, the use of r is quite common. The re.sub() function finds all matches and substitutes all related text. The sAndR.py script changes the "Linux Format" string into "LINUX Format" . The important Python code of the sAndR.py script is the following: for line in f: if re.search("Linux Format", line):
newLine = re.sub("Linux Format","LINUX Format", line)
print newLine.rstrip() The code is pretty straightforward and you should have no problem understanding it—as usual it processes the text file line by line. The key point here is that a replace is performed only when there is a match, which is the purpose of the if statement. Only the lines that have been changed are displayed on screen. ( Seebottomofp89for someadditionalsearchandreplaceoperationsusingthe Pythonshell.) We’d recommend spending some time experimenting with re and learning how to use it before continuing with the rest of the tutorial. Note, that regular expressions are often the root of nasty bugs so always check your regular expression in the Python shell before using them on Python scripts.
Changing the date format
The presented Python code, saved as dateFormat.py and based on sAndR.py, will read a text file line by line, search for a specific date format using a regular expression and change that date format into something else: numberOfLines = 0 for line in f: if re.search(r'(\d{2})/(\d{2})/(\d{4})', line): newline = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\2-\1-\3’, line) numberOfLines = numberOfLines + 1 print newline.rstrip() The existing format is MM/DD/YYYY whereas the new format will be DD-MM-YYYYY. The \d{4} string means that you are looking for four ( {4} ) digits ( \d ). You can also see here that with the help of parentheses you can reference a previous match in the replace part of the re.sub() command. Executing dateFormat.py produces the following kind of output:
$ cat variousDates 12/13/1960 01/02/2000 Today is 03/04/2016 or is it 04/03/2016 12/21/10 $ ./dateFormat.py variousDates 13-12-1960 02-01-2000 Today is 04-03-2016 or is it 03-04-2016 Number of Lines matched: 2
Creating a GUI
This section will teach you how to develop a GUI in order to make your life easier. The GUI will have a main area where you can type your text and two more areas for defining the two strings that will be used for the search and replace operation. The standard tool for developing a GUI in Python is Tkinter, which is an interface to the Tk GUI toolkit. In order to use Tkinter you will have to include the import Tkinter or from Tkinter import * command in your Python script. Both commands import the two most important tkinter modules which are called Tkinter and Tkconstants—note that the Tkinter module automatically imports Tkconstants.
The following Python code, saved as simple.py, is a simple example that uses the Tkinter module—execute it to make sure that everything works as expected with your installation: #!/usr/bin/python from Tkinter import * root = Tk() message = Label(root, text="Hello World!") message.pack() root.mainloop()
The Tk root widget initialises Tkinter – each Tkinter object should have a single root widget that must be created prior to all other widgets. The Label() widget is a child of the root widget and contains the message you want to display. The pack() method makes the Label widget size itself in order to be properly displayed. The widget will not be displayed until you enter the Tkinter event loop with the help of the root.mainloop() method —until then, you will see no output on your screen.
Now that you know the basics of Tkinter, it’s time to create the user interface for the application. In order to add the required elements on your screen, you will have to run the following Python code ( emptyGUI.py): #!/usr/bin/python from Tkinter import * from ScrolledText import * # The Text Widget for text input and output root = Tk(className="Search and Replace") # Two Entry Widgets for search and replace search = Entry(root, text="search") search.pack() replace = Entry(root, text="replace") replace.pack() text = ScrolledText(root, width=50, height=40, borderwidth=1) text.pack() text.insert('insert’, "...") # The Go Button def callback(): print "Go button pressed!" b = Button(root, text="Go", command=callback) b.pack() root.mainloop()
The first version of the GUI is just a dummy application: you have two input boxes, the area where you write your text and the ‘Go’ button but when you press the ‘Go’ button nothing happens! The next section will implement the functionality of the button.
More about the GUI
It is now time to add the required functionality to the application. This means that the application will read the two boxes as well as the text area and run when the ‘Go’ button is pressed. All the required functionality can be found in the callback() function that is called when you press the ‘Go’ button. The rest of the Python code is the same as in emptyGUI.py. # The Go Button def callback(): mySearch = search.get() or "null" myReplace = replace.get() or "null" myText = text.get('1.0', END) text.delete('1.0', END) # Print new text after search and replace text.insert('insert', re.sub(mySearch, myReplace, myText)) To get the text of an Entry() widget, you should use the get() method. This can be seen in the Python code of the callback function for the ‘Go’ button. Similarly, you can get the text of a ScrolledText() widget with the get() method and delete it with the delete() method. Despite the fact that gui.py only supports the searching of static text, the application is fully functional and pretty useful. ( Bottomleft showsthe gui.py scriptinaction.) When you press the ‘Go’ button, the program calls the callback() function and does the actual work for you!
There exist many books that can help you learn Python better including PythonCookbook,3rdEdition,by DavidBeazleyandBrianK.Jones and LearningPython,5th Edition,byMarkLutz. There is also www.diveintopython. net which is a free Python book for experienced programmers. You can find more information about Tkinter at www.pythonware.com/library and https://docs.python.org/2/library/tkinter.html. LXF