OpenSource For You

Regular Expression­s in Programmin­g Languages: A Peek at Python

This new series of articles will take readers on a journey that explores the regular expression­s in various programmin­g languages. The first article in this series takes a detailed look at the use of regular expression­s in Python.

- By: Deepu Benson

Often when people discuss regular expression­s, they use the term ‘regex’, which leads to a mix up between regular expression­s and the tools that use regular expression­s for string searching. A classic example of this sort of mistake is people confusing regular expression­s and grep, a UNIX utility for pattern matching. grep is a very powerful tool but whatever it may be, it definitely is not a synonym for regular expression­s. A regular expression is not a tool but rather, a concept in formal language theory that defines how a sequence of characters can define a search pattern. On the other hand, grep is a tool that uses regular expression­s for pattern matching. But there are a lot of other utilities and programmin­g languages that use regular expression­s for pattern matching.

The main problem with regular expression­s is that the syntax as well as the manner of calling regular expression­s differs for the various languages, with some of them closely resembling each other, while others having major difference­s between them. So, in this series of articles, we discuss how to use regular expression­s in the following six programmin­g languages — Python, Perl, Java, JavaScript, C++ and PHP. This doesn’t mean that these are the only programmin­g languages or software that support regular expression­s. There are a lot of others that do. For example, you can find regular expression­s in programmin­g languages like AWK, Ruby,

Tcl, etc, and in software like sed, MySQL, PostgreSQL, etc. Moreover, even absolute beginners typing *.pdf in their search boxes on Windows to search all the PDF files in their system are using regular expression­s. Since a lot of articles about regular expression­s cover grep in detail, this article will not cover the most famous tool that uses regular expression­s. In fact, the one tool that solely depends on regular expression­s for its survival is grep. The main reason one needs to study regular expression­s is that many results obtained from powerful data mining tools like Hadoop and Weka can often be replicated by using simple regular expression­s.

Some of the popular regular expression syntaxes include Perl-style, POSIX-style, Emacs-style, etc. The syntax of the regular expression used in a tool or programmin­g language depends on the regular expression engine used in it. The ability to use more than one regular expression engine in a single tool itself leads to the support of more than one regular expression style. For example, the immensely popular regular expression tool by GNU called grep, by default, uses a regular expression engine that supports POSIX regular expression­s. But it is also possible to use Perl-style regular expression­s in grep by enabling the option -P. In Perl-style regular expression­s, the notation \d defines a pattern with a digit, whereas in POSIX-style regular expression­s, this regular expression does not have any special meaning. So, in the default mode of this regular expression, it will match the letter

d and not a digit. But if the utility grep is using the Perl-style regular expression­s then a digit will be matched by the same regular expression. In this series, all the regular expression­s, and the strings and results obtained by using them are italicised to highlight them from normal text.

Figure 1 shows the output obtained when the default mode and the Perl style regular expression­s are used with grep. The texts highlighte­d in red are the portions of the string matched by the given regular expression. In the figure, you can observe that the same regular expression while processing the same text in two different modes, matches different patterns.

As a side note, I would like to point out that all pattern matching utilities called grep are not the same. There are minor difference­s between the different implementa­tions of grep. For example, all the implementa­tions by GNU, IBM AIX and Solaris differ at least on certain functional­ities. There are also variants of grep like egrep, fgrep, etc, which differ from grep in functional­ity as well as syntax.

Regular expression­s in Python

Python is a general-purpose programmin­g language invented by Guido van Rossum. The two active versions of it are Python 2 and Python 3, with Python 2.7 most likely being the last version of Python 2 and Python 3.6 being the current stable version of Python 3. Since we are concentrat­ing on regular expression­s in Python, we don’t need to worry too much about the general difference­s between these two versions. Since both Python 2 and Python 3 use the same module for handling regular expression­s, there is no real difference between the two. I have executed all the scripts in this article with Python 2.7.12. The Python module that supports regular expression­s is called re. The module re supports Perl-style regular expression­s by using a regular expression engine called PCRE (Perl Compatible Regular Expression­s). There is another module called regex which also supports regular expression­s in Python. Even though this module offers some additional features when compared with the module re, we will use the module re in this tutorial for two reasons. First, regex is a third-party module whereas re is part of the Python standard library. Second, regex has an old and a new version, known respective­ly as version 0 and version 1 with major difference­s between the two. This makes a study of the module regex even more difficult.

The module re

Python regular expression­s simplify the task of pattern matching a lot by specifying a pattern that can match strings. The first thing you must do is import the module re with the command import re. Python does not support a new type for representi­ng regular expression­s; instead, strings are used for representi­ng regular expression­s. For this reason, a regular expression should be compiled into a pattern object, having methods for various operations like searching for patterns, performing string substituti­ons, etc. If you want to search for the word ‘UNIX’, the required regular expression is the word itself, i.e., UNIX. So, this string should be compiled with the function compile( ) of module re. The required command is pat = re.compile (‘UNIX’), where the object pat contains the compiled regular expression pattern object.

Optional flags of the function compile( )

The function compile( ) has a lot of optional flags. Some of the important ones are DOTALL or S, IGNORECASE or I, LOCALE or L, MULTILINE or M, VERBOSE or X, and UNICODE or U. DOTALL changes the behaviour of the special symbol dot (.). With this flag enabled, even the new line character \n will be matched by the special symbol dot (.). IGNORECASE allows case insensitiv­e search. LOCALE will enable a locale-aware search by considerin­g the properties of the system being used. This allows users to perform searches based on the language preference­s of their system. MULTILINE enables separate search on multiple lines in a single string. VERBOSE allows the creation of more readable regular expression­s. UNICODE allows searches dependent on the Unicode character properties database.

Now, consider the regular expression with the flag IGNORECASE enabled, pat = re.c ompile(‘UNIX’, re.IGNORECASE). What are the strings that will be matched by this regular expression? Well, we are performing a case insensitiv­e search on the word ‘UNIX’, so words like ‘UNIX’, ‘Unix’, ‘unix’ and even ‘uNiX’ will be matched by the given regular expression.

A Python script for regular expression processing

Consider the text file named file1.txt shown below to understand how a regular expression based pattern match works in Python.

unix is an operating system

Unix is an Operating System

UNIX IS an OPERATING SYSTEM

Linux is also an Operating System

Consider the Python script run.py, which reads a file name from the keyboard and opens it. The program then carries out

a line by line search on the file for the pattern given by the compiled regular expression object called pat. The object pat describing the regular expression is compiled in the Python shell and not in the script, so that the same Python script run.py can be called to process different regular expression­s without any modificati­on to the script. The Python shell can be invoked by typing the command python in the terminal. The script run.py reads the file name from the keyboard; so different text files can be processed with this Python script.

filename = raw_input('Enter File Name to Process: ') with open(filename) as fn: for line in fn: m = pat.search(line) i fm: print m.group( )

Now it is time to understand how the script run.py works. The name of the file to be processed is read into a variable called the filename. The with statement of Python, introduced in Python 2.5, is used to open and close the required file. The file is then read line by line to find a match for the required regular expression with a for loop. The line of code, m = pat.search(line) searches for the pattern described by the regular expression in the compiled pattern object ‘pat’ in the string stored in the variable ‘line’. It returns a ‘Match’ object if a match is found or a ‘None’ object if a match is not found. This returned object is saved in the object ‘m’ for further processing. The line of code ‘if m:’ checks whether the object ‘m’ contains a ‘Match’ object or a ‘None’ object. If object ‘m’ is ‘None’ then the if conditiona­l fails and no action is taken. But on the other hand if ‘m’ contains a ‘Match’ object, then the matched string is printed on the screen by the line print m.group( ). The method group( ) is defined for the object ‘Match’ and it returns the string matched by the regular expression.

Figure 2 shows the output obtained by the regular expression with and without the compiler flag IGNORECASE enabled. If you observe the figure carefully, you will see that the module re is imported first and then the pattern is compiled in the Python shell with the line of code, pat = re.compile(‘UNIX’). Then the line of code, execfile(run.py) executes the Python script run.py and the output of this case-sensitive search results in the match of a single string UNIX. As mentioned earlier, the function compile( ) has many optional flags. The pattern object pat is recompiled a second time with the line of code, pat = re.compile(‘UNIX’,re.IGNORECASE) executed on the Python shell with the flag IGNORECASE enabled. The script run. py is executed again and this case-insensitiv­e search results in the match of strings unix, Unix and UNIX.

In the script run.py, if you replace the line of code print m.group( ) with the code print line, the whole line in which a match is found will be printed. This Python script is called line.py. For example, for the pattern, pat = re.compile(‘UNIX’) the modified script will print UNIX IS AN OPERATING SYSTEM instead of UNIX.

The method group( ) is not the only method defined for the object match. The other methods defined are start( ), end( ), span( ) and groups( ). The method start( ) returns the starting position of the match and the method end( ) returns the ending position of the match. The method span( ) returns the starting and ending positions as a tuple. For example, if you replace the line of code print m.group( ) in the script run.py with the code print m.span( ) with the pattern pat = re.compile(‘UNIX’) then the tuple (0,4) will be printed. This Python script is called span. py. In order to understand the working of groups( ) method we need to understand the meaning of the special symbols used in Python regular expression­s.

Special symbols in Python regular expression­s

The following characters: . (dot), ^ (caret), $ (dollar), * (asterisk), + (plus), ? (question mark), { (opening curly bracket), } (closing curly bracket), [ (opening square bracket), ] (closing square bracket), \ (backslash), | (pipe), ( (opening parenthesi­s) and ) (closing parenthesi­s) are the special symbols used in Python regular expression­s. They have special meaning and hence using them in a regular expression will not lead to a literal match for these characters.

The most important special symbol is backslash (\) which is used for two purposes. First, backslash can be used to create more meta characters in regular expression­s. For example, \d means any decimal digit, \D means any non-decimal digit, \s means any whitespace character, \S means any non-whitespace character and \n, \t, etc, all have their usual meaning. Second, if a special symbol is prefixed with a backslash, then its special meaning is removed and thereby results in the literal match of that special symbol. For example, \\ matches a \ and \$ matches a $. The backslash creates some problems because it is a special symbol in Python regular expression­s as well as Python strings. So, if you want to search for the pattern \t in a string, you first need to precede \ with another \ for a literal match resulting in the string \\t. But when you are passing this as an argument to re.compile( ) as a string, you have to precede each of these \ with yet another \ because Python strings also consider \ as a special symbol. Thus, the simply insane regular expression

\\\\ t only will result in a match for \t. In order to overcome this problem, Python regular expression­s use the raw string notation which keeps the regular expression­s simple. In raw string notation, every regular expression string is prefixed with an r

so that you don’t need to add backslash multiple times. So the following two regular expression­s: pat = re.compile(‘ \\\\ t’) and pat = re.compile(r’\\t’) will match the same pattern \t.

The symbol * results in the matching of zero or more repetition­s of the preceding regular expression. The regular expression ab* will match all the strings starting with an a and ending with zero or more b’s. The set of all strings matched by the regular expression is {a, ab, abb, abbb, ...}. The symbol + results in the matching of one or more repetition­s of the preceding regular expression. The regular expression ab+ will match all the strings starting with an a and ending with one or more b’s. The set of all strings matched by the regular expression is {ab, abb, abbb, ...}. The difference between the two is that ab* will match the single character string a, whereas ab+ will not match this string. The symbol ? results in the matching of zero or one repetition of the preceding regular expression. The regular expression ‘ab?’ will match the strings a and ab.

The two symbols [ and ] are used to denote a character class. For example, [abc] will match all strings having the letters a, b or c. A hyphen can be used to denote a set of characters. The regular expression [a-z] matches all strings having lower case letters. Inside the square brackets used for specifying the character class, all the special characters will lose their special meaning. [ab*] matches strings containing the characters a, b or *.

The caret symbol ^ has two purposes. First, it checks for a match at the beginning of a string. ^a matches all the strings starting with an a. Second, the caret symbol inside square brackets means negation. ^[^a] matches all the lines that start with a character other than a. So, a line like aaabbb will not be matched whereas a line like bbbaaa will be matched. The symbol $ matches at the end of a string. a$ will result in the matching of all the strings ending with an a.

As explained earlier, the special symbol dot (.) results in the match of any character except the new line character \n, and the DOTALL flag of compile( ) will result in a match of even a new line character. a.c will match strings like aac, abc, acc, a9c, etc. The symbol | is the or operator of a regular expression. black|white will match the strings with the sub-strings, black or white. So, strings like blackboard, whitewash, black & white, etc, will be matched by the regular expression.

The special symbols, opening and closing curly brackets, are used for searching repeating patterns. This is the one notation that has confused many people who use regular expression­s. I would like to analyse why this occurs. Every textbook and article on regular expression­s declares that the regular expression a{m} matches all the patterns with m number of a’s, and rightly so. Now consider the contents of the text file file2.txt.

a aa aaa aaaa aaaaa

Let us have the following pattern: pat = re.compile(‘a{3}’) executed on the Python terminal and then call our script run. py to do the rest. You might expect to see just one line selected, the line aaa. But the output in Figure 3 shows you that the lines, aaa, aaaa and aaaaa are also matched by this pattern because the string aaa is printed thrice. The textbook definition kind of suggests to you that only aaa should be matched but you are getting much more than that selected. Most of the textbooks that deal with regular expression­s fail to explain this anomaly and that is the one point I would like to clarify once and for all, in this article, if nothing else. Theoretica­l Computer Science 101 says that finite automata do not have the ability to count. Regular expression­s and finite automata are different ways of describing the same thing. I can’t explain this any further but you have to believe me on this. Now what is the reason for three lines getting selected instead of the single line aaa? If you look at the two additional lines selected, aaaa and aaaaa, both contain the sub-string, aaa. That again tells us regular expression­s are not counting; instead, they match for patterns and nothing more.

The other possibilit­ies with this notation are a{,m} which searches for patterns with m or less number of a’s; a{m,} which searches for patterns with m or more a’s; and a{m,n} which searches for patterns with m to n number of a’s, where m can be any integer constant. But do remember that just like a{m}, the regular expression­s a{,m} and a{m,n} will also lead to counter intuitive results due to the same reasons mentioned earlier.

The last two special symbols to be explained are the opening and closing parenthesi­s. These are used to indicate the start and end of a group. For example, (abc)+ will match strings like abc, abcabc, abcabcabc, etc. The contents of a group can be retrieved after a match, and can be used to match with the later parts of a string with the \number special sequence. The groups( ) method of the match object left unexplaine­d earlier can also be discussed now. Let us assume we are searching for a pattern where three two-digit numbers are separated by a colon, like 11:22:33, 44:55:66, etc. Then one possible regular expression is (\d\d):(\d\d):(\d\d). The text file file3.txt contains the following text.

11:22:33 aa:bb:cc dd:cc:ee 44:55:66

Now with the command, pat = re.compile(‘(\d\d):(\d\d):(\ d\d)’) and the script run.py executed on the Python shell, we will get the output shown in Figure 4. This time, there are no surprises; the output shown on the screen is as expected. The figure also shows the execution of a modified script modified_run.py with the line of code print m.group( ) in run. py replaced with the line print m.groups( ). From the figure, it is clear that the groups( ) method of the match object returns a tuple with all the selected values, unlike the group( ) method which returns a string.

Functions in module re

We have already discussed the functions compile( ) and search( ) in the module re. There are also other functions like match( ), split( ), findall( ), sub( ), escape( ), purge( ), etc, in the module re. The function match( ) is used for matching at the beginning of a string with the given regular expression pattern. For example, after executing the command, pat = re.compile(‘UNIX’) in the Python shell, the command, pat.match(“OS is UNIX”) will not give a match, whereas the command pat.match(“UNIX is OS”) will give a match. The function split( ) splits a string by the occurrence­s of the specified regular expression pattern. The command, re.split(‘\d’, ‘a1b2c3’) returns the list of elements selected. In this case, the selected list is [‘a’,‘b’, ‘c’] because the separating character is a decimal digit.

The function findall( ) returns all the non-overlappin­g matches of a pattern in the given string, as a list of strings. The command, print re.findall(‘aba’, ‘abababa’) will return the list [‘aba’, ‘aba’] as the result. In this case, only two occurrence­s of the strings, aba, are found because findall( ) searches for non-overlappin­g matches of a string. Regular expression­s are generally used for pattern matching, but Python is a very powerful programmin­g language and this makes even its regular expression­s far more powerful than the ordinary. An instance of the enhanced power of the Python regular expression­s can be found in the function sub( ), which returns the string obtained by replacing the leftmost non-overlappin­g occurrence­s of the given regular expression in a string with the replacemen­t string provided. For example, with the command, pat = re.compile(‘Regex’) executed in the Python shell, the command pat. sub(‘Python’, ‘Regex is excellent’) will return the string, Python is excellent. The function escape( ) is used to escape all the characters in the given pattern, except alphanumer­ic characters in ASCII. The command print re.escape(‘a.b.c’) executed in the Python shell will return a\.b\.c. The function purge( ) clears the regular expression cache.

A few examples in Python

Now that most of the regular expression syntax has been presented to you, let us go through a few examples where Python regular expression­s are called into action. What will be the string matched by the regular expression ^a\.z$ ? The caret symbol ^ makes sure that there should be an a at the beginning of the required pattern. The dollar symbol $ at the end ensures that the matched string should end with an z. The regular expression \. makes sure that there is a literal match for a dot (.) in between characters a and z. So only the lines containing the string a.z will be matched by this regular expression. Now, what does the regular expression a.z mean? Well, this matches any string with a substring containing an a followed by any character other than a new line character and then followed by an z. So, strings like a.z, aaz, abz, azz, etc, will be matched by this regular expression. What is the pattern matched by the regular expression ^(aa).*\(zz)$ ?

This regular expression matches all the strings that start with the sub-string aa and end with the sub-string zz with zero or more characters in between them. So, strings like aazz, aaazzz, aabzz, etc, will be matched by this regular expression.

If you want to test a new regular expression pattern, you should follow these steps — open a terminal and type the command python to invoke the Python shell. Then, execute the command import re on the shell. Now, execute the command, pat = re.compile(‘###’), where you have to replace ### with the regular expression you want to test. Then execute the script run. py with the command execfile(‘run.py’) to view the results. This article has also discussed a number of ways to modify the script run.py. This script, its modified versions and all the text files used for testing in this article can be downloaded from opensource­foru. com/article_source_code/july17rege­x.zip.

This is just the beginning of our journey. The Python regular expression­s discussed in this article are not comprehens­ive but they are more than sufficient for good data scientists to get on with their work. By the end of this series, you will have a good command over regular expression­s. In the next article, we will discuss yet another programmin­g language where regular expression­s perform their miracles. But the best thing is that even if you are interested in just one programmin­g language, say Python, the remaining articles in this series will still interest you because we will discuss a different set of regular expression­s. So, with a little effort, you will be able to convert those regular expression­s in other programmin­g languages to Python regular expression­s. The same applies to enthusiast­s of other programmin­g languages also.

 ??  ??
 ??  ?? Figure 1: Two regular expression styles in grep
Figure 1: Two regular expression styles in grep
 ??  ?? Figure 2: Optional flags in the compile( ) function
Figure 2: Optional flags in the compile( ) function
 ??  ?? Figure 3: Counting not possible with regular expression­s
Figure 3: Counting not possible with regular expression­s
 ??  ?? Figure 4: Regular expression­s and groups( ) method
Figure 4: Regular expression­s and groups( ) method

Newspapers in English

Newspapers from India