Reg­u­lar Ex­pres­sions in Pro­gram­ming Lan­guages: Java for You

OpenSource For You - - Contents - By: Deepu Ben­son The au­thor is a free soft­ware en­thu­si­ast whose area of in­ter­est is the­o­ret­i­cal com­puter sci­ence. He main­tains a tech­ni­cal blog at www.com­put­ing­for­be­gin­ners.blogspot.in and can be reached at deep­umb@hot­mail.com.

Java is an ob­ject-ori­ented gen­eral-pur­pose pro­gram­ming lan­guage. Java ap­pli­ca­tions are ini­tially compiled to byte­code, which can then be run on a Java virtual ma­chine (JVM), in­de­pen­dent of the un­der­ly­ing com­puter ar­chi­tec­ture. Ac­cord­ing to Wikipedia, “A Java virtual ma­chine is an ab­stract com­put­ing ma­chine that en­ables a com­puter to run a Java pro­gram.” Don’t get con­fused with this com­pli­cated def­i­ni­tion—just imag­ine that JVM acts as soft­ware ca­pa­ble of run­ning Java byte­code. JVM acts as an in­ter­preter for Java byte­code. This is the rea­son why Java is of­ten called a compiled and in­ter­preted lan­guage. The de­vel­op­ment of Java—ini­tially called Oak— be­gan in 1991 by James Gosling, Mike Sheri­dan and Pa­trick Naughton. The first pub­lic im­ple­men­ta­tion of Java was re­leased as Java 1.0 in 1996 by

Sun Mi­crosys­tems. Cur­rently,

Or­a­cle Cor­po­ra­tion owns Sun

Mi­crosys­tems. Un­like many other pro­gram­ming lan­guages,

Java has a mas­cot called Duke

(shown in Fig­ure 1).

As with previous ar­ti­cles in this se­ries I re­ally wanted to be­gin with a brief dis­cus­sion about the his­tory of Java by de­scrib­ing the dif­fer­ent plat­forms and ver­sions of Java. But here I am at a loss.

The avail­abil­ity of a large num­ber of Java plat­forms and the com­pli­cated ver­sion num­ber­ing scheme fol­lowed by

Sun Mi­crosys­tems makes such a dis­cus­sion dif­fi­cult. For ex­am­ple, in or­der to ex­plain terms like Java 2, Java SE, Core Java, JDK, Java EE, etc, in de­tail, a se­ries of ar­ti­cles might be re­quired. Such a dis­cus­sion about the his­tory of Java might be a wor­thy pur­suit for an­other time but def­i­nitely not for this ar­ti­cle. So, all I am go­ing to do is ex­plain a few key points re­gard­ing var­i­ous Java im­ple­men­ta­tions.

First of all, Java Card, Java ME (Mi­cro Edi­tion),

Java SE (Stan­dard Edi­tion) and Java EE (En­ter­prise Edi­tion) are all dif­fer­ent Java plat­forms that tar­get dif­fer­ent classes of de­vices and ap­pli­ca­tion do­mains. For ex­am­ple, Java SE is cus­tomised for gen­eral-pur­pose use on desk­top PCs, servers and sim­i­lar de­vices. An­other im­por­tant ques­tion that re­quires an an­swer is, ‘What is the dif­fer­ence be­tween Java SE and Java 2?’ Books like ‘Learn Java 2 in 48 Hours’ or ‘Learn Java SE in Two Days’ can con­fuse be­gin­ners a lot while mak­ing a choice. In a nut­shell, there is no dif­fer­ence be­tween the two. All this con­fu­sion arises due to the com­pli­cated nam­ing con­ven­tion fol­lowed by Sun Mi­crosys­tems.

The De­cem­ber 1998 re­lease of Java was called Java 2, and the ver­sion name J2SE 1.2 was given to JDK 1.2 to dis­tin­guish it from the other plat­forms of Java. Again, J2SE 1.5 (JDK

1.5) was re­named J2SE 5.0 and later as Java SE 5, cit­ing the ma­tu­rity of J2SE over the years as the rea­son for this name change. The lat­est ver­sion of Java is Java SE 9, which was re­leased in Septem­ber 2017. But ac­tu­ally, when you say Java 9, you mean JDK 1.9. So, keep in mind that Java SE was for­merly known as Java 2 Platform, Stan­dard Edi­tion or J2SE.

The Java De­vel­op­ment Kit (JDK) is an im­ple­men­ta­tion of one of the Java Plat­forms, Stan­dard Edi­tion, En­ter­prise Edi­tion, or Mi­cro Edi­tion in the form of a bi­nary prod­uct.

The JDK in­cludes the JVM and a few other tools like the com­piler (javac), de­bug­ger (jdb), ap­plet viewer, etc, which are re­quired for the de­vel­op­ment of Java ap­pli­ca­tions and ap­plets. The lat­est ver­sion of JDK is JDK 9.0.1 re­leased in Oc­to­ber 2017. Open­JDK is a free and open source im­ple­men­ta­tion of Java SE. The Open­JDK im­ple­men­ta­tion is li­censed un­der the GNU Gen­eral Pub­lic Li­cense (GNU GPL). The Java Class Li­brary (JCL) is a set of dy­nam­i­cally load­able li­braries that Java ap­pli­ca­tions can call at run time. JCL con­tains a num­ber of pack­ages, and each of them con­tains a num­ber of classes to pro­vide var­i­ous func­tion­al­i­ties. Some of the pack­ages in JCL in­clude java.lang, java.io, java.net, java.util, etc.

The ‘Hello World’ pro­gram in Java

Other than con­sole based Java ap­pli­ca­tion pro­grams, spe­cial classes like the ap­plet, servlet, swing, etc, are used to de­velop Java pro­grams to com­plete a va­ri­ety of tasks. For ex­am­ple, Java ap­plets are pro­grams that are em­bed­ded in other ap­pli­ca­tions, typ­i­cally in a Web page dis­played in a browser. Reg­u­lar ex­pres­sions can be used in Java ap­pli­ca­tion pro­grams and pro­grams based on other classes like the ap­plet, swing, servlet, etc, with­out mak­ing any changes. Since there is no dif­fer­ence in the use of reg­u­lar ex­pres­sions, all our dis­cus­sions are based on simple Java ap­pli­ca­tion pro­grams. But be­fore ex­plor­ing Java pro­grams us­ing reg­u­lar ex­pres­sions let us build our mus­cles by ex­e­cut­ing a simple ‘Hello World’ pro­gram in Java. The code given be­low shows the pro­gram Hel­loWorld.java.

To ex­e­cute the Java source file Hel­loWorld.java open a ter­mi­nal in the same di­rec­tory con­tain­ing the file and ex­e­cute the com­mand: javac Hel­loWorld.java.

Now a Java class file called Hel­loWorld.class con­tain­ing the Java byte­code is cre­ated in the di­rec­tory. The JVM can be in­voked to ex­e­cute this class file con­tain­ing byte­code with the com­mand: java Hel­loWorld.class

The mes­sage ‘Hello World’ is dis­played on the ter­mi­nal. Fig­ure 2 shows the ex­e­cu­tion and out­put of the Java pro­gram Hel­loWorld.java. The pro­gram con­tains a spe­cial method named main( ), the start­ing point of this pro­gram, which will be iden­ti­fied and ex­e­cuted by the JVM. Re­mem­ber that a method in an ob­ject ori­ented pro­gram­ming par­a­digm is noth­ing but a func­tion in a pro­ce­dural pro­gram­ming par­a­digm. The main( ) method con­tains the fol­low­ing line of code, which prints the mes­sage ‘Hello World’ on the ter­mi­nal: ‘Sys­tem. out. println (“Hello World ”);’

The pro­gram Hel­loWorld.java and all the other pro­grams dis­cussed in this ar­ti­cle can be down­loaded from open­source­foru.com/ar­ti­cle_­source_­code/ January 18 java for you. zip. Fig­ure 2: Hello World pro­gram in Java

Reg­u­lar ex­pres­sions in Java

Now com­ing down to busi­ness, let us dis­cuss reg­u­lar ex­pres­sions in Java. The first ques­tion to be an­swered is ‘What flavour of reg­u­lar ex­pres­sion is be­ing used in

Java?’ Well, Java uses PCRE (Perl Com­pat­i­ble Reg­u­lar Ex­pres­sions). So, all the reg­u­lar ex­pres­sions we have de­vel­oped in the previous ar­ti­cles de­scrib­ing reg­u­lar ex­pres­sions in Python, Perl and PHP will work in Java with­out any mod­i­fi­ca­tions, be­cause Python, Perl and PHP also use the PCRE flavour of reg­u­lar ex­pres­sions.

Since we have al­ready cov­ered much of the syn­tax of PCRE in the previous ar­ti­cles on Python, Perl and PHP,

I am not go­ing to rein­tro­duce them here. But I would like to point out a few mi­nor dif­fer­ences be­tween the clas­sic PCRE and the PCRE stan­dard tai­lor-made for

Java. For ex­am­ple, the reg­u­lar ex­pres­sions in Java lack the em­bed­ded com­ment syn­tax avail­able in pro­gram­ming lan­guages like Perl. An­other dif­fer­ence is re­gard­ing the quan­ti­fiers used in reg­u­lar ex­pres­sions in Java and other PCRE based pro­gram­ming lan­guages. Quan­ti­fiers al­low you to spec­ify the num­ber of oc­cur­rences of a char­ac­ter to match against a string. Al­most all the PCRE flavours have a greedy quan­ti­fier and a re­luc­tant quan­ti­fier. In addition to these two, the reg­u­lar ex­pres­sion syn­tax of Java has a pos­ses­sive quan­ti­fier also.

To dif­fer­en­ti­ate be­tween these three quan­ti­fiers, con­sider the string aaaaaa. The reg­u­lar ex­pres­sion pat­tern ‘a+a’ in­volves a greedy quan­ti­fier by de­fault. This pat­tern will re­sult in a greedy match of the whole string aaaaaa be­cause the pat­tern ‘a+’ will match only the string aaaaa. Now con­sider the re­luc­tant quan­ti­fier ‘a+?a’. This pat­tern will only re­sult in a match for the string aa since the pat­tern ‘a+?’ will only match the sin­gle char­ac­ter string a. Now let us see the ef­fect of the Java spe­cific pos­ses­sive quan­ti­fier de­noted by the pat­tern ‘a++a’. This pat­tern will not re­sult in any match be­cause the pos­ses­sive quan­ti­fier be­haves like a greedy quan­ti­fier, ex­cept that it is pos­ses­sive. So, the pat­tern ‘a++’ it­self will pos­ses­sively match the whole string aaaaaa, and the last char­ac­ter a in the reg­u­lar ex­pres­sion pat­tern ‘a++a’ will not have a match. So, a pos­ses­sive quan­ti­fier will match greed­ily and af­ter a match it will never give away a char­ac­ter.

You can down­load and test the three ex­am­ple Java files Greedy.java, Re­luc­tant.java and Pos­ses­sive.java for a bet­ter un­der­stand­ing of these con­cepts. In Java, reg­u­lar ex­pres­sion pro­cess­ing is en­abled with the help of the pack­age java.util.regex. This pack­age was in­cluded in the Java Class Li­brary (JCL) by J2SE 1.4 (JDK 1.4). So, if you are go­ing to use reg­u­lar ex­pres­sions in Java, make sure that you have JDK 1.4 or later in­stalled on your sys­tem. Ex­e­cute the com­mand: java -ver­sion

… at the ter­mi­nal to find the par­tic­u­lar ver­sion of Java in­stalled on your sys­tem. The later ver­sions of Java have fixed many bugs and added sup­port for fea­tures like named cap­ture and Uni­code based reg­u­lar ex­pres­sion pro­cess­ing. There are also some third party pack­ages that sup­port reg­u­lar ex­pres­sion pro­cess­ing in Java but our dis­cus­sion strictly cov­ers the classes of­fered by the pack­age java.util. regex, which is stan­dard and part of the JCL. The pack­age java.util.regex of­fers two classes called Pat­tern and Matcher two classes called Pat­tern and Matcher that are used are used jointly for reg­u­lar ex­pres­sion pro­cess­ing. The Pat­tern class en­ables us to de­fine a reg­u­lar ex­pres­sion pat­tern. The Matcher class helps us match a reg­u­lar ex­pres­sion pat­tern with the con­tents of a string.

Java pro­grams us­ing reg­u­lar ex­pres­sions

Let us now ex­e­cute and an­a­lyse a simple Java pro­gram us­ing reg­u­lar ex­pres­sions. The code given be­low shows the pro­gram Regex1.java.

Open a ter­mi­nal in the same di­rec­tory con­tain­ing the file Regex1.java and ex­e­cute the fol­low­ing com­mands to view the out­put: and javac Regex1.java Java Regex1

You will be sur­prised to see the mes­sage ‘No Match Found’ dis­played in the ter­mi­nal. Let us an­a­lyse the code in de­tail to un­der­stand the rea­son for this out­put. The first line of code: ‘im­port java.util.regex.*;’ …im­ports the classes Pat­tern and Matcher from the pack­age java.util.regex. The line of code: ‘Pat­tern pat= Pat­tern. com­pile (“Open Source ”);’

…gen­er­ates the reg­u­lar ex­pres­sion pat­tern with the help of the method com­pile( ) pro­vided by the Pat­tern class. The Pat­tern ob­ject thus gen­er­ated is stored in the ob­ject pat. A Pat­ternSyn­taxEx­cep­tion is thrown if the reg­u­lar ex­pres­sion syn­tax is in­valid. The line of code: ‘Matcher mat= pat. match er (“Magazine Open Source For You ”);’

…uses the matcher( ) method of Pat­tern class to gen­er­ate a Matcher ob­ject, be­cause the Matcher class does not have a con­struc­tor. The Matcher ob­ject thus gen­er­ated is stored in the ob­ject mat. The line of code: ‘if(mat.matches( ))’

…uses the method matches( ) pro­vided by the class Pat­tern to per­form a match­ing be­tween the reg­u­lar ex­pres­sion pat­tern ‘Open Source’ and the string ‘Magazine Open Source For You’. The method matches( ) re­turns True if there is a match and re­turns False if there is no match. But the im­por­tant thing to re­mem­ber is that the method matches( ) re­turns True only if the

pat­tern matches the whole string. In this case, the string ‘Open Source’ is just a sub­string of the string ‘Magazine Open Source For You’ and since there is no match, the method matches( ) re­turns False, and the if state­ment dis­plays the mes­sage ‘No Match Found’ on the ter­mi­nal.

If you re­place the line of code: ‘Pat­tern pat = Pat­tern.com­pile(“Open Source”);’ …with the line of code: ‘Pat­tern pat= Pat­tern. com­pile (“Magazine Open Source For You”);’

…then you will get a match and the matches( ) method will re­turn True. The file with this mod­i­fi­ca­tion Regex2.java is also avail­able for down­load. The line of code: ‘Sys­tem.out.println(“Match from “+ (mat.start( )+1) + “to “+ (mat.end( )));’

…uses two meth­ods pro­vided by the Matcher class, start( ) and end( ). The method start( ) re­turns the start in­dex of the previous match and the method end( ) re­turns the off­set af­ter the last char­ac­ter matched. So, the out­put of the pro­gram Regex2.java will be ‘Match from 1 to 28’.

Fig­ure 3 shows the out­put of Regex1.java and Regex2. java. An im­por­tant point to re­mem­ber is that the in­dex­ing starts at 0 and that is the rea­son why 1 is added to the value re­turned by the method start( ) as (mat.start( )+1). Since the method end( ) re­turns the in­dex im­me­di­ately af­ter the last matched char­ac­ter, noth­ing needs to be done there.

The matches( ) method of Pat­tern class with this sort of a com­par­i­son is al­most use­less. But many other use­ful meth­ods are pro­vided by the class Matcher to carry out dif­fer­ent types of com­par­isons. The method find( ) pro­vided by the class Matcher is use­ful if you want to find a sub­string match. Fig­ure 3: Out­put of Regex1.java and Regex2.java Re­place the line of code: ‘if(mat.matches( ))’ …in Regex1.java with the line of code: ‘if(mat.find( ))’ …to ob­tain the pro­gram Regex3.java. On ex­e­cu­tion, Regex3.java will dis­play the mes­sage ‘Match from 10 to

20’ on the ter­mi­nal. This is due to the fact that the sub­string ‘Open Source’ ap­pears from the 10th char­ac­ter to the 20th char­ac­ter in the string ‘Magazine Open Source For You’. The method find( ) also re­turns True in case of a match and False in case if there is no match. The method find( ) can be used re­peat­edly to find all the match­ing sub­strings present in a string. Con­sider the pro­gram Regex4.java shown be­low.

In this case, the method find( ) will search the whole string and find matches at po­si­tions start­ing at the first, fifth and ninth char­ac­ters. The line of code: ‘String str = “abcd­abcd­abcd”;’ …is used to store the string to be searched, and in the line of code: ‘Matcher mat = pat.matcher(str);’

…this string is used by the method matcher( ) for fur­ther pro­cess­ing. Fig­ure 4 shows the out­put of the pro­grams Regex3.java and Regex4.java.

Now, what if you want the matched string dis­played in­stead of the in­dex at which a match is ob­tained. Well, then you have to use the method group( ) pro­vided by the class Fig­ure 4: Out­put of Regex3.java and Regex4.java Matcher. Con­sider the pro­gram Regex5.java shown be­low: im­port java.util.regex.*;

On ex­e­cu­tion, the pro­gram regex5.java dis­plays the mes­sage ‘Matched String 1 : Sachin Ten­dulkar Hits a Sixer’ on the ter­mi­nal. What is the rea­son for match­ing the whole string? Be­cause the pat­tern ‘S.*r’ searches for a string start­ing with S, fol­lowed by zero or more oc­cur­rences of any char­ac­ter, and fi­nally end­ing with an r. Since the pat­tern ‘.*’ re­sults in a greedy match, the whole string is matched.

Now re­place the line of code: ‘Pat­tern pat= Pat­tern. c om pile(“S.*r ”);’ …in Regex5.java with the line: ‘Pat­tern pat= Pat­tern. c om pile(“S.*?r ”);’

…to get Regex6.java. What will be the out­put of Regex6. java? Since this is the last ar­ti­cle of this se­ries on reg­u­lar ex­pres­sions, I re­quest you to try your best to find the an­swer be­fore pro­ceed­ing any fur­ther. Fig­ure 5 shows the out­put of Regex5.java and Regex6.java. But what is the rea­son for the out­put shown by Regex6.java? Again, I re­quest you to pon­der over the prob­lem for some time and find out the an­swer. If you don’t get the an­swer, down­load the file Regex6.java from the link shown ear­lier, and in that file I have given the ex­pla­na­tion as a com­ment.

So, with that ex­am­ple, let us wind up our dis­cus­sion about reg­u­lar ex­pres­sions in Java. Java is a very pow­er­ful pro­gram­ming lan­guage and the ef­fec­tive use of reg­u­lar ex­pres­sions will make it even more pow­er­ful. The ba­sic stuff dis­cussed here will def­i­nitely kick-start your jour­ney to­wards the ef­fi­cient use of reg­u­lar ex­pres­sions in Java. And now it is time to say farewell.

In this se­ries we have dis­cussed reg­u­lar ex­pres­sion pro­cess­ing in six dif­fer­ent pro­gram­ming lan­guages. Four of these—Python, Perl, PHP and Java—use a reg­u­lar ex­pres­sion style called PCRE (Perl Com­pat­i­ble Reg­u­lar Ex­pres­sions). The other two pro­gram­ming lan­guages we dis­cussed in Fig­ure 5: Out­put of Regex5.java and Regex6.java this se­ries, C++ and JavaScript, use a style known as the ECMAScript reg­u­lar ex­pres­sion style. The ar­ti­cles in this se­ries were never in­tended to de­scribe the com­plex­i­ties of in­tri­cate reg­u­lar ex­pres­sions in de­tail. In­stead, I tried to fo­cus on the dif­fer­ent flavours of reg­u­lar ex­pres­sions and how they can be used in var­i­ous pro­gram­ming lan­guages. Any de­cent text­book on reg­u­lar ex­pres­sions will give a lan­guageag­nos­tic dis­cus­sion of reg­u­lar ex­pres­sions but we were more wor­ried about the ac­tual ex­e­cu­tion of reg­u­lar ex­pres­sions in pro­gram­ming lan­guages.

Be­fore con­clud­ing this se­ries, I would like to go over the im­por­tant take­aways. First, al­ways re­mem­ber the fact that there are many dif­fer­ent reg­u­lar ex­pres­sion flavours.

The dif­fer­ences be­tween many of them are sub­tle, yet they can cause havoc if used in­dis­creetly. Sec­ond, the style of reg­u­lar ex­pres­sion used in a pro­gram­ming lan­guage de­pends on the flavour of the reg­u­lar ex­pres­sion im­ple­mented by the lan­guage’s reg­u­lar ex­pres­sion en­gine. Due to this rea­son, a sin­gle pro­gram­ming lan­guage may sup­port mul­ti­ple reg­u­lar ex­pres­sion styles with the help of dif­fer­ent reg­u­lar ex­pres­sion en­gines and li­brary func­tions. Third, the way dif­fer­ent lan­guages sup­port reg­u­lar ex­pres­sions is dif­fer­ent. In some lan­guages the sup­port for reg­u­lar ex­pres­sions is part of the lan­guage core. An ex­am­ple for such a lan­guage is Perl. In some other lan­guages the reg­u­lar ex­pres­sions are sup­ported with the help of li­brary func­tions. C++ is a pro­gram­ming lan­guage in which reg­u­lar ex­pres­sions are im­ple­mented us­ing li­brary func­tions. Due to this, all the ver­sions and stan­dards of some pro­gram­ming lan­guages may not sup­port the use of reg­u­lar ex­pres­sions. For ex­am­ple, in C++, the sup­port for reg­u­lar ex­pres­sions starts with the C++11 stan­dard.

For the same rea­son, the dif­fer­ent ver­sions of a par­tic­u­lar pro­gram­ming lan­guage it­self might sup­port dif­fer­ent reg­u­lar ex­pres­sion styles. You must be very care­ful about these im­por­tant points while devel­op­ing pro­grams us­ing reg­u­lar ex­pres­sions to avoid dangerous pit­falls.

So, fi­nally, we are at the end of a long jour­ney of learn­ing reg­u­lar ex­pres­sions. But an even longer and far more ex­cit­ing jour­ney of prac­tis­ing and devel­op­ing reg­u­lar ex­pres­sions lies ahead. Good luck!

This is the sixth and fi­nal part of a se­ries of ar­ti­cles on reg­u­lar ex­pres­sions in pro­gram­ming lan­guages. In this ar­ti­cle, we will dis­cuss the use of reg­u­lar ex­pres­sions in Java, a very pow­er­ful pro­gram­ming lan­guage.

Fig­ure 1: Duke – the mas­cot of Java

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.