Reg­u­lar Ex­pres­sions in Pro­gram­ming Lan­guages: A Peek at Python

This new se­ries of ar­ti­cles will take read­ers on a jour­ney that ex­plores the reg­u­lar ex­pres­sions in var­i­ous pro­gram­ming lan­guages. The first ar­ti­cle in this se­ries takes a de­tailed look at the use of reg­u­lar ex­pres­sions in Python.

OpenSource For You - - Contents - By: Deepu Ben­son

Of­ten when peo­ple dis­cuss reg­u­lar ex­pres­sions, they use the term ‘regex’, which leads to a mix up be­tween reg­u­lar ex­pres­sions and the tools that use reg­u­lar ex­pres­sions for string search­ing. A clas­sic ex­am­ple of this sort of mis­take is peo­ple con­fus­ing reg­u­lar ex­pres­sions and grep, a UNIX util­ity for pat­tern match­ing. grep is a very pow­er­ful tool but what­ever it may be, it def­i­nitely is not a syn­onym for reg­u­lar ex­pres­sions. A reg­u­lar ex­pres­sion is not a tool but rather, a con­cept in for­mal lan­guage the­ory that de­fines how a se­quence of char­ac­ters can de­fine a search pat­tern. On the other hand, grep is a tool that uses reg­u­lar ex­pres­sions for pat­tern match­ing. But there are a lot of other util­i­ties and pro­gram­ming lan­guages that use reg­u­lar ex­pres­sions for pat­tern match­ing.

The main prob­lem with reg­u­lar ex­pres­sions is that the syn­tax as well as the man­ner of call­ing reg­u­lar ex­pres­sions dif­fers for the var­i­ous lan­guages, with some of them closely re­sem­bling each other, while oth­ers hav­ing ma­jor dif­fer­ences be­tween them. So, in this se­ries of ar­ti­cles, we dis­cuss how to use reg­u­lar ex­pres­sions in the fol­low­ing six pro­gram­ming lan­guages — Python, Perl, Java, JavaScript, C++ and PHP. This doesn’t mean that these are the only pro­gram­ming lan­guages or soft­ware that sup­port reg­u­lar ex­pres­sions. There are a lot of oth­ers that do. For ex­am­ple, you can find reg­u­lar ex­pres­sions in pro­gram­ming lan­guages like AWK, Ruby,

Tcl, etc, and in soft­ware like sed, MySQL, Post­greSQL, etc. More­over, even ab­so­lute begin­ners typ­ing *.pdf in their search boxes on Win­dows to search all the PDF files in their sys­tem are us­ing reg­u­lar ex­pres­sions. Since a lot of ar­ti­cles about reg­u­lar ex­pres­sions cover grep in de­tail, this ar­ti­cle will not cover the most fa­mous tool that uses reg­u­lar ex­pres­sions. In fact, the one tool that solely de­pends on reg­u­lar ex­pres­sions for its sur­vival is grep. The main rea­son one needs to study reg­u­lar ex­pres­sions is that many re­sults ob­tained from pow­er­ful data min­ing tools like Hadoop and Weka can of­ten be repli­cated by us­ing sim­ple reg­u­lar ex­pres­sions.

Some of the pop­u­lar reg­u­lar ex­pres­sion syn­taxes in­clude Perl-style, POSIX-style, Emacs-style, etc. The syn­tax of the reg­u­lar ex­pres­sion used in a tool or pro­gram­ming lan­guage de­pends on the reg­u­lar ex­pres­sion en­gine used in it. The abil­ity to use more than one reg­u­lar ex­pres­sion en­gine in a sin­gle tool it­self leads to the sup­port of more than one reg­u­lar ex­pres­sion style. For ex­am­ple, the im­mensely pop­u­lar reg­u­lar ex­pres­sion tool by GNU called grep, by de­fault, uses a reg­u­lar ex­pres­sion en­gine that sup­ports POSIX reg­u­lar ex­pres­sions. But it is also pos­si­ble to use Perl-style reg­u­lar ex­pres­sions in grep by en­abling the op­tion -P. In Perl-style reg­u­lar ex­pres­sions, the no­ta­tion \d de­fines a pat­tern with a digit, whereas in POSIX-style reg­u­lar ex­pres­sions, this reg­u­lar ex­pres­sion does not have any spe­cial mean­ing. So, in the de­fault mode of this reg­u­lar ex­pres­sion, it will match the let­ter

d and not a digit. But if the util­ity grep is us­ing the Perl-style reg­u­lar ex­pres­sions then a digit will be matched by the same reg­u­lar ex­pres­sion. In this se­ries, all the reg­u­lar ex­pres­sions, and the strings and re­sults ob­tained by us­ing them are ital­i­cised to high­light them from nor­mal text.

Fig­ure 1 shows the out­put ob­tained when the de­fault mode and the Perl style reg­u­lar ex­pres­sions are used with grep. The texts high­lighted in red are the por­tions of the string matched by the given reg­u­lar ex­pres­sion. In the fig­ure, you can ob­serve that the same reg­u­lar ex­pres­sion while pro­cess­ing the same text in two dif­fer­ent modes, matches dif­fer­ent pat­terns.

As a side note, I would like to point out that all pat­tern match­ing util­i­ties called grep are not the same. There are mi­nor dif­fer­ences be­tween the dif­fer­ent im­ple­men­ta­tions of grep. For ex­am­ple, all the im­ple­men­ta­tions by GNU, IBM AIX and So­laris dif­fer at least on cer­tain func­tion­al­i­ties. There are also vari­ants of grep like egrep, fgrep, etc, which dif­fer from grep in func­tion­al­ity as well as syn­tax.

Reg­u­lar ex­pres­sions in Python

Python is a gen­eral-pur­pose pro­gram­ming lan­guage in­vented by Guido van Ros­sum. The two ac­tive ver­sions of it are Python 2 and Python 3, with Python 2.7 most likely be­ing the last ver­sion of Python 2 and Python 3.6 be­ing the cur­rent sta­ble ver­sion of Python 3. Since we are con­cen­trat­ing on reg­u­lar ex­pres­sions in Python, we don’t need to worry too much about the gen­eral dif­fer­ences be­tween these two ver­sions. Since both Python 2 and Python 3 use the same mod­ule for han­dling reg­u­lar ex­pres­sions, there is no real dif­fer­ence be­tween the two. I have ex­e­cuted all the scripts in this ar­ti­cle with Python 2.7.12. The Python mod­ule that sup­ports reg­u­lar ex­pres­sions is called re. The mod­ule re sup­ports Perl-style reg­u­lar ex­pres­sions by us­ing a reg­u­lar ex­pres­sion en­gine called PCRE (Perl Com­pat­i­ble Reg­u­lar Ex­pres­sions). There is an­other mod­ule called regex which also sup­ports reg­u­lar ex­pres­sions in Python. Even though this mod­ule of­fers some ad­di­tional fea­tures when com­pared with the mod­ule re, we will use the mod­ule re in this tu­to­rial for two rea­sons. First, regex is a third-party mod­ule whereas re is part of the Python stan­dard li­brary. Sec­ond, regex has an old and a new ver­sion, known re­spec­tively as ver­sion 0 and ver­sion 1 with ma­jor dif­fer­ences be­tween the two. This makes a study of the mod­ule regex even more dif­fi­cult.

The mod­ule re

Python reg­u­lar ex­pres­sions sim­plify the task of pat­tern match­ing a lot by spec­i­fy­ing a pat­tern that can match strings. The first thing you must do is im­port the mod­ule re with the com­mand im­port re. Python does not sup­port a new type for rep­re­sent­ing reg­u­lar ex­pres­sions; in­stead, strings are used for rep­re­sent­ing reg­u­lar ex­pres­sions. For this rea­son, a reg­u­lar ex­pres­sion should be com­piled into a pat­tern ob­ject, hav­ing meth­ods for var­i­ous op­er­a­tions like search­ing for pat­terns, per­form­ing string sub­sti­tu­tions, etc. If you want to search for the word ‘UNIX’, the re­quired reg­u­lar ex­pres­sion is the word it­self, i.e., UNIX. So, this string should be com­piled with the func­tion com­pile( ) of mod­ule re. The re­quired com­mand is pat = re.com­pile (‘UNIX’), where the ob­ject pat con­tains the com­piled reg­u­lar ex­pres­sion pat­tern ob­ject.

Op­tional flags of the func­tion com­pile( )

The func­tion com­pile( ) has a lot of op­tional flags. Some of the im­por­tant ones are DOTALL or S, IGNORECASE or I, LO­CALE or L, MUL­TI­LINE or M, VER­BOSE or X, and UNICODE or U. DOTALL changes the be­hav­iour of the spe­cial sym­bol dot (.). With this flag en­abled, even the new line char­ac­ter \n will be matched by the spe­cial sym­bol dot (.). IGNORECASE al­lows case in­sen­si­tive search. LO­CALE will en­able a lo­cale-aware search by con­sid­er­ing the prop­er­ties of the sys­tem be­ing used. This al­lows users to per­form searches based on the lan­guage pref­er­ences of their sys­tem. MUL­TI­LINE en­ables sep­a­rate search on mul­ti­ple lines in a sin­gle string. VER­BOSE al­lows the cre­ation of more read­able reg­u­lar ex­pres­sions. UNICODE al­lows searches de­pen­dent on the Unicode char­ac­ter prop­er­ties data­base.

Now, con­sider the reg­u­lar ex­pres­sion with the flag IGNORECASE en­abled, pat = re.c om­pile(‘UNIX’, re.IGNORECASE). What are the strings that will be matched by this reg­u­lar ex­pres­sion? Well, we are per­form­ing a case in­sen­si­tive search on the word ‘UNIX’, so words like ‘UNIX’, ‘Unix’, ‘unix’ and even ‘uNiX’ will be matched by the given reg­u­lar ex­pres­sion.

A Python script for reg­u­lar ex­pres­sion pro­cess­ing

Con­sider the text file named file1.txt shown be­low to un­der­stand how a reg­u­lar ex­pres­sion based pat­tern match works in Python.

unix is an op­er­at­ing sys­tem

Unix is an Op­er­at­ing Sys­tem

UNIX IS an OP­ER­AT­ING SYS­TEM

Linux is also an Op­er­at­ing Sys­tem

Con­sider the Python script run.py, which reads a file name from the key­board and opens it. The pro­gram then car­ries out

a line by line search on the file for the pat­tern given by the com­piled reg­u­lar ex­pres­sion ob­ject called pat. The ob­ject pat de­scrib­ing the reg­u­lar ex­pres­sion is com­piled in the Python shell and not in the script, so that the same Python script run.py can be called to process dif­fer­ent reg­u­lar ex­pres­sions with­out any mod­i­fi­ca­tion to the script. The Python shell can be in­voked by typ­ing the com­mand python in the ter­mi­nal. The script run.py reads the file name from the key­board; so dif­fer­ent text files can be pro­cessed with this Python script.

file­name = raw_in­put('En­ter File Name to Process: ') with open(file­name) as fn: for line in fn: m = pat.search(line) i fm: print m.group( )

Now it is time to un­der­stand how the script run.py works. The name of the file to be pro­cessed is read into a vari­able called the file­name. The with state­ment of Python, in­tro­duced in Python 2.5, is used to open and close the re­quired file. The file is then read line by line to find a match for the re­quired reg­u­lar ex­pres­sion with a for loop. The line of code, m = pat.search(line) searches for the pat­tern de­scribed by the reg­u­lar ex­pres­sion in the com­piled pat­tern ob­ject ‘pat’ in the string stored in the vari­able ‘line’. It re­turns a ‘Match’ ob­ject if a match is found or a ‘None’ ob­ject if a match is not found. This re­turned ob­ject is saved in the ob­ject ‘m’ for fur­ther pro­cess­ing. The line of code ‘if m:’ checks whether the ob­ject ‘m’ con­tains a ‘Match’ ob­ject or a ‘None’ ob­ject. If ob­ject ‘m’ is ‘None’ then the if con­di­tional fails and no ac­tion is taken. But on the other hand if ‘m’ con­tains a ‘Match’ ob­ject, then the matched string is printed on the screen by the line print m.group( ). The method group( ) is de­fined for the ob­ject ‘Match’ and it re­turns the string matched by the reg­u­lar ex­pres­sion.

Fig­ure 2 shows the out­put ob­tained by the reg­u­lar ex­pres­sion with and with­out the com­piler flag IGNORECASE en­abled. If you ob­serve the fig­ure care­fully, you will see that the mod­ule re is im­ported first and then the pat­tern is com­piled in the Python shell with the line of code, pat = re.com­pile(‘UNIX’). Then the line of code, ex­ec­file(run.py) ex­e­cutes the Python script run.py and the out­put of this case-sen­si­tive search re­sults in the match of a sin­gle string UNIX. As men­tioned ear­lier, the func­tion com­pile( ) has many op­tional flags. The pat­tern ob­ject pat is re­com­piled a sec­ond time with the line of code, pat = re.com­pile(‘UNIX’,re.IGNORECASE) ex­e­cuted on the Python shell with the flag IGNORECASE en­abled. The script run. py is ex­e­cuted again and this case-in­sen­si­tive search re­sults in the match of strings unix, Unix and UNIX.

In the script run.py, if you re­place the line of code print m.group( ) with the code print line, the whole line in which a match is found will be printed. This Python script is called line.py. For ex­am­ple, for the pat­tern, pat = re.com­pile(‘UNIX’) the mod­i­fied script will print UNIX IS AN OP­ER­AT­ING SYS­TEM in­stead of UNIX.

The method group( ) is not the only method de­fined for the ob­ject match. The other meth­ods de­fined are start( ), end( ), span( ) and groups( ). The method start( ) re­turns the start­ing po­si­tion of the match and the method end( ) re­turns the ending po­si­tion of the match. The method span( ) re­turns the start­ing and ending po­si­tions as a tu­ple. For ex­am­ple, if you re­place the line of code print m.group( ) in the script run.py with the code print m.span( ) with the pat­tern pat = re.com­pile(‘UNIX’) then the tu­ple (0,4) will be printed. This Python script is called span. py. In or­der to un­der­stand the work­ing of groups( ) method we need to un­der­stand the mean­ing of the spe­cial sym­bols used in Python reg­u­lar ex­pres­sions.

Spe­cial sym­bols in Python reg­u­lar ex­pres­sions

The fol­low­ing char­ac­ters: . (dot), ^ (caret), $ (dol­lar), * (as­ter­isk), + (plus), ? (ques­tion mark), { (open­ing curly bracket), } (clos­ing curly bracket), [ (open­ing square bracket), ] (clos­ing square bracket), \ (back­slash), | (pipe), ( (open­ing paren­the­sis) and ) (clos­ing paren­the­sis) are the spe­cial sym­bols used in Python reg­u­lar ex­pres­sions. They have spe­cial mean­ing and hence us­ing them in a reg­u­lar ex­pres­sion will not lead to a lit­eral match for these char­ac­ters.

The most im­por­tant spe­cial sym­bol is back­slash (\) which is used for two pur­poses. First, back­slash can be used to cre­ate more meta char­ac­ters in reg­u­lar ex­pres­sions. For ex­am­ple, \d means any dec­i­mal digit, \D means any non-dec­i­mal digit, \s means any white­space char­ac­ter, \S means any non-white­space char­ac­ter and \n, \t, etc, all have their usual mean­ing. Sec­ond, if a spe­cial sym­bol is pre­fixed with a back­slash, then its spe­cial mean­ing is re­moved and thereby re­sults in the lit­eral match of that spe­cial sym­bol. For ex­am­ple, \\ matches a \ and \$ matches a $. The back­slash cre­ates some prob­lems be­cause it is a spe­cial sym­bol in Python reg­u­lar ex­pres­sions as well as Python strings. So, if you want to search for the pat­tern \t in a string, you first need to pre­cede \ with an­other \ for a lit­eral match re­sult­ing in the string \\t. But when you are pass­ing this as an ar­gu­ment to re.com­pile( ) as a string, you have to pre­cede each of these \ with yet an­other \ be­cause Python strings also con­sider \ as a spe­cial sym­bol. Thus, the sim­ply in­sane reg­u­lar ex­pres­sion

\\\\ t only will re­sult in a match for \t. In or­der to over­come this prob­lem, Python reg­u­lar ex­pres­sions use the raw string no­ta­tion which keeps the reg­u­lar ex­pres­sions sim­ple. In raw string no­ta­tion, ev­ery reg­u­lar ex­pres­sion string is pre­fixed with an r

so that you don’t need to add back­slash mul­ti­ple times. So the fol­low­ing two reg­u­lar ex­pres­sions: pat = re.com­pile(‘ \\\\ t’) and pat = re.com­pile(r’\\t’) will match the same pat­tern \t.

The sym­bol * re­sults in the match­ing of zero or more rep­e­ti­tions of the pre­ced­ing reg­u­lar ex­pres­sion. The reg­u­lar ex­pres­sion ab* will match all the strings start­ing with an a and ending with zero or more b’s. The set of all strings matched by the reg­u­lar ex­pres­sion is {a, ab, abb, abbb, ...}. The sym­bol + re­sults in the match­ing of one or more rep­e­ti­tions of the pre­ced­ing reg­u­lar ex­pres­sion. The reg­u­lar ex­pres­sion ab+ will match all the strings start­ing with an a and ending with one or more b’s. The set of all strings matched by the reg­u­lar ex­pres­sion is {ab, abb, abbb, ...}. The dif­fer­ence be­tween the two is that ab* will match the sin­gle char­ac­ter string a, whereas ab+ will not match this string. The sym­bol ? re­sults in the match­ing of zero or one rep­e­ti­tion of the pre­ced­ing reg­u­lar ex­pres­sion. The reg­u­lar ex­pres­sion ‘ab?’ will match the strings a and ab.

The two sym­bols [ and ] are used to de­note a char­ac­ter class. For ex­am­ple, [abc] will match all strings hav­ing the let­ters a, b or c. A hy­phen can be used to de­note a set of char­ac­ters. The reg­u­lar ex­pres­sion [a-z] matches all strings hav­ing lower case let­ters. In­side the square brack­ets used for spec­i­fy­ing the char­ac­ter class, all the spe­cial char­ac­ters will lose their spe­cial mean­ing. [ab*] matches strings con­tain­ing the char­ac­ters a, b or *.

The caret sym­bol ^ has two pur­poses. First, it checks for a match at the be­gin­ning of a string. ^a matches all the strings start­ing with an a. Sec­ond, the caret sym­bol in­side square brack­ets means nega­tion. ^[^a] matches all the lines that start with a char­ac­ter other than a. So, a line like aaabbb will not be matched whereas a line like bb­baaa will be matched. The sym­bol $ matches at the end of a string. a$ will re­sult in the match­ing of all the strings ending with an a.

As ex­plained ear­lier, the spe­cial sym­bol dot (.) re­sults in the match of any char­ac­ter ex­cept the new line char­ac­ter \n, and the DOTALL flag of com­pile( ) will re­sult in a match of even a new line char­ac­ter. a.c will match strings like aac, abc, acc, a9c, etc. The sym­bol | is the or op­er­a­tor of a reg­u­lar ex­pres­sion. black|white will match the strings with the sub-strings, black or white. So, strings like black­board, white­wash, black & white, etc, will be matched by the reg­u­lar ex­pres­sion.

The spe­cial sym­bols, open­ing and clos­ing curly brack­ets, are used for search­ing re­peat­ing pat­terns. This is the one no­ta­tion that has con­fused many peo­ple who use reg­u­lar ex­pres­sions. I would like to an­a­lyse why this oc­curs. Ev­ery text­book and ar­ti­cle on reg­u­lar ex­pres­sions de­clares that the reg­u­lar ex­pres­sion a{m} matches all the pat­terns with m num­ber of a’s, and rightly so. Now con­sider the con­tents of the text file file2.txt.

a aa aaa aaaa aaaaa

Let us have the fol­low­ing pat­tern: pat = re.com­pile(‘a{3}’) ex­e­cuted on the Python ter­mi­nal and then call our script run. py to do the rest. You might ex­pect to see just one line se­lected, the line aaa. But the out­put in Fig­ure 3 shows you that the lines, aaa, aaaa and aaaaa are also matched by this pat­tern be­cause the string aaa is printed thrice. The text­book def­i­ni­tion kind of sug­gests to you that only aaa should be matched but you are get­ting much more than that se­lected. Most of the text­books that deal with reg­u­lar ex­pres­sions fail to ex­plain this anom­aly and that is the one point I would like to clar­ify once and for all, in this ar­ti­cle, if noth­ing else. The­o­ret­i­cal Com­puter Sci­ence 101 says that fi­nite au­tom­ata do not have the abil­ity to count. Reg­u­lar ex­pres­sions and fi­nite au­tom­ata are dif­fer­ent ways of de­scrib­ing the same thing. I can’t ex­plain this any fur­ther but you have to be­lieve me on this. Now what is the rea­son for three lines get­ting se­lected in­stead of the sin­gle line aaa? If you look at the two ad­di­tional lines se­lected, aaaa and aaaaa, both con­tain the sub-string, aaa. That again tells us reg­u­lar ex­pres­sions are not count­ing; in­stead, they match for pat­terns and noth­ing more.

The other pos­si­bil­i­ties with this no­ta­tion are a{,m} which searches for pat­terns with m or less num­ber of a’s; a{m,} which searches for pat­terns with m or more a’s; and a{m,n} which searches for pat­terns with m to n num­ber of a’s, where m can be any in­te­ger con­stant. But do re­mem­ber that just like a{m}, the reg­u­lar ex­pres­sions a{,m} and a{m,n} will also lead to counter in­tu­itive re­sults due to the same rea­sons men­tioned ear­lier.

The last two spe­cial sym­bols to be ex­plained are the open­ing and clos­ing paren­the­sis. These are used to in­di­cate the start and end of a group. For ex­am­ple, (abc)+ will match strings like abc, ab­cabc, ab­cab­cabc, etc. The con­tents of a group can be re­trieved af­ter a match, and can be used to match with the later parts of a string with the \num­ber spe­cial se­quence. The groups( ) method of the match ob­ject left un­ex­plained ear­lier can also be dis­cussed now. Let us as­sume we are search­ing for a pat­tern where three two-digit numbers are sep­a­rated by a colon, like 11:22:33, 44:55:66, etc. Then one pos­si­ble reg­u­lar ex­pres­sion is (\d\d):(\d\d):(\d\d). The text file file3.txt con­tains the fol­low­ing text.

11:22:33 aa:bb:cc dd:cc:ee 44:55:66

Now with the com­mand, pat = re.com­pile(‘(\d\d):(\d\d):(\ d\d)’) and the script run.py ex­e­cuted on the Python shell, we will get the out­put shown in Fig­ure 4. This time, there are no sur­prises; the out­put shown on the screen is as ex­pected. The fig­ure also shows the ex­e­cu­tion of a mod­i­fied script mod­i­fied_run.py with the line of code print m.group( ) in run. py re­placed with the line print m.groups( ). From the fig­ure, it is clear that the groups( ) method of the match ob­ject re­turns a tu­ple with all the se­lected values, un­like the group( ) method which re­turns a string.

Func­tions in mod­ule re

We have al­ready dis­cussed the func­tions com­pile( ) and search( ) in the mod­ule re. There are also other func­tions like match( ), split( ), find­all( ), sub( ), es­cape( ), purge( ), etc, in the mod­ule re. The func­tion match( ) is used for match­ing at the be­gin­ning of a string with the given reg­u­lar ex­pres­sion pat­tern. For ex­am­ple, af­ter ex­e­cut­ing the com­mand, pat = re.com­pile(‘UNIX’) in the Python shell, the com­mand, pat.match(“OS is UNIX”) will not give a match, whereas the com­mand pat.match(“UNIX is OS”) will give a match. The func­tion split( ) splits a string by the oc­cur­rences of the spec­i­fied reg­u­lar ex­pres­sion pat­tern. The com­mand, re.split(‘\d’, ‘a1b2c3’) re­turns the list of el­e­ments se­lected. In this case, the se­lected list is [‘a’,‘b’, ‘c’] be­cause the sep­a­rat­ing char­ac­ter is a dec­i­mal digit.

The func­tion find­all( ) re­turns all the non-over­lap­ping matches of a pat­tern in the given string, as a list of strings. The com­mand, print re.find­all(‘aba’, ‘abababa’) will re­turn the list [‘aba’, ‘aba’] as the re­sult. In this case, only two oc­cur­rences of the strings, aba, are found be­cause find­all( ) searches for non-over­lap­ping matches of a string. Reg­u­lar ex­pres­sions are gen­er­ally used for pat­tern match­ing, but Python is a very pow­er­ful pro­gram­ming lan­guage and this makes even its reg­u­lar ex­pres­sions far more pow­er­ful than the or­di­nary. An in­stance of the en­hanced power of the Python reg­u­lar ex­pres­sions can be found in the func­tion sub( ), which re­turns the string ob­tained by re­plac­ing the left­most non-over­lap­ping oc­cur­rences of the given reg­u­lar ex­pres­sion in a string with the re­place­ment string pro­vided. For ex­am­ple, with the com­mand, pat = re.com­pile(‘Regex’) ex­e­cuted in the Python shell, the com­mand pat. sub(‘Python’, ‘Regex is ex­cel­lent’) will re­turn the string, Python is ex­cel­lent. The func­tion es­cape( ) is used to es­cape all the char­ac­ters in the given pat­tern, ex­cept al­phanu­meric char­ac­ters in ASCII. The com­mand print re.es­cape(‘a.b.c’) ex­e­cuted in the Python shell will re­turn a\.b\.c. The func­tion purge( ) clears the reg­u­lar ex­pres­sion cache.

A few ex­am­ples in Python

Now that most of the reg­u­lar ex­pres­sion syn­tax has been pre­sented to you, let us go through a few ex­am­ples where Python reg­u­lar ex­pres­sions are called into ac­tion. What will be the string matched by the reg­u­lar ex­pres­sion ^a\.z$ ? The caret sym­bol ^ makes sure that there should be an a at the be­gin­ning of the re­quired pat­tern. The dol­lar sym­bol $ at the end en­sures that the matched string should end with an z. The reg­u­lar ex­pres­sion \. makes sure that there is a lit­eral match for a dot (.) in be­tween char­ac­ters a and z. So only the lines con­tain­ing the string a.z will be matched by this reg­u­lar ex­pres­sion. Now, what does the reg­u­lar ex­pres­sion a.z mean? Well, this matches any string with a sub­string con­tain­ing an a fol­lowed by any char­ac­ter other than a new line char­ac­ter and then fol­lowed by an z. So, strings like a.z, aaz, abz, azz, etc, will be matched by this reg­u­lar ex­pres­sion. What is the pat­tern matched by the reg­u­lar ex­pres­sion ^(aa).*\(zz)$ ?

This reg­u­lar ex­pres­sion matches all the strings that start with the sub-string aa and end with the sub-string zz with zero or more char­ac­ters in be­tween them. So, strings like aazz, aaazzz, aabzz, etc, will be matched by this reg­u­lar ex­pres­sion.

If you want to test a new reg­u­lar ex­pres­sion pat­tern, you should fol­low these steps — open a ter­mi­nal and type the com­mand python to in­voke the Python shell. Then, ex­e­cute the com­mand im­port re on the shell. Now, ex­e­cute the com­mand, pat = re.com­pile(‘###’), where you have to re­place ### with the reg­u­lar ex­pres­sion you want to test. Then ex­e­cute the script run. py with the com­mand ex­ec­file(‘run.py’) to view the re­sults. This ar­ti­cle has also dis­cussed a num­ber of ways to mod­ify the script run.py. This script, its mod­i­fied ver­sions and all the text files used for test­ing in this ar­ti­cle can be down­loaded from open­source­foru. com/ar­ti­cle_­source_­code/ju­ly17regex.zip.

This is just the be­gin­ning of our jour­ney. The Python reg­u­lar ex­pres­sions dis­cussed in this ar­ti­cle are not com­pre­hen­sive but they are more than suf­fi­cient for good data sci­en­tists to get on with their work. By the end of this se­ries, you will have a good com­mand over reg­u­lar ex­pres­sions. In the next ar­ti­cle, we will dis­cuss yet an­other pro­gram­ming lan­guage where reg­u­lar ex­pres­sions per­form their mir­a­cles. But the best thing is that even if you are in­ter­ested in just one pro­gram­ming lan­guage, say Python, the re­main­ing ar­ti­cles in this se­ries will still in­ter­est you be­cause we will dis­cuss a dif­fer­ent set of reg­u­lar ex­pres­sions. So, with a lit­tle ef­fort, you will be able to con­vert those reg­u­lar ex­pres­sions in other pro­gram­ming lan­guages to Python reg­u­lar ex­pres­sions. The same ap­plies to en­thu­si­asts of other pro­gram­ming lan­guages also.

Fig­ure 1: Two reg­u­lar ex­pres­sion styles in grep

Fig­ure 2: Op­tional flags in the com­pile( ) func­tion

Fig­ure 3: Count­ing not pos­si­ble with reg­u­lar ex­pres­sions

Fig­ure 4: Reg­u­lar ex­pres­sions and groups( ) method

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.