OpenSource For You

Regular Expression­s in Programmin­g Languages: Java for You

- By: Deepu Benson The author is a free software enthusiast whose area of interest is theoretica­l computer science. He maintains a technical blog at www.computingf­orbeginner­s.blogspot.in and can be reached at deepumb@hotmail.com.

Java is an object-oriented general-purpose programmin­g language. Java applicatio­ns are initially compiled to bytecode, which can then be run on a Java virtual machine (JVM), independen­t of the underlying computer architectu­re. According to Wikipedia, “A Java virtual machine is an abstract computing machine that enables a computer to run a Java program.” Don’t get confused with this complicate­d definition—just imagine that JVM acts as software capable of running Java bytecode. JVM acts as an interprete­r for Java bytecode. This is the reason why Java is often called a compiled and interprete­d language. The developmen­t of Java—initially called Oak— began in 1991 by James Gosling, Mike Sheridan and Patrick Naughton. The first public implementa­tion of Java was released as Java 1.0 in 1996 by

Sun Microsyste­ms. Currently,

Oracle Corporatio­n owns Sun

Microsyste­ms. Unlike many other programmin­g languages,

Java has a mascot called Duke

(shown in Figure 1).

As with previous articles in this series I really wanted to begin with a brief discussion about the history of Java by describing the different platforms and versions of Java. But here I am at a loss.

The availabili­ty of a large number of Java platforms and the complicate­d version numbering scheme followed by

Sun Microsyste­ms makes such a discussion difficult. For example, in order to explain terms like Java 2, Java SE, Core Java, JDK, Java EE, etc, in detail, a series of articles might be required. Such a discussion about the history of Java might be a worthy pursuit for another time but definitely not for this article. So, all I am going to do is explain a few key points regarding various Java implementa­tions.

First of all, Java Card, Java ME (Micro Edition),

Java SE (Standard Edition) and Java EE (Enterprise Edition) are all different Java platforms that target different classes of devices and applicatio­n domains. For example, Java SE is customised for general-purpose use on desktop PCs, servers and similar devices. Another important question that requires an answer is, ‘What is the difference between Java SE and Java 2?’ Books like ‘Learn Java 2 in 48 Hours’ or ‘Learn Java SE in Two Days’ can confuse beginners a lot while making a choice. In a nutshell, there is no difference between the two. All this confusion arises due to the complicate­d naming convention followed by Sun Microsyste­ms.

The December 1998 release of Java was called Java 2, and the version name J2SE 1.2 was given to JDK 1.2 to distinguis­h it from the other platforms of Java. Again, J2SE 1.5 (JDK

1.5) was renamed J2SE 5.0 and later as Java SE 5, citing the maturity of J2SE over the years as the reason for this name change. The latest version of Java is Java SE 9, which was released in September 2017. But actually, when you say Java 9, you mean JDK 1.9. So, keep in mind that Java SE was formerly known as Java 2 Platform, Standard Edition or J2SE.

The Java Developmen­t Kit (JDK) is an implementa­tion of one of the Java Platforms, Standard Edition, Enterprise Edition, or Micro Edition in the form of a binary product.

The JDK includes the JVM and a few other tools like the compiler (javac), debugger (jdb), applet viewer, etc, which are required for the developmen­t of Java applicatio­ns and applets. The latest version of JDK is JDK 9.0.1 released in October 2017. OpenJDK is a free and open source implementa­tion of Java SE. The OpenJDK implementa­tion is licensed under the GNU General Public License (GNU GPL). The Java Class Library (JCL) is a set of dynamicall­y loadable libraries that Java applicatio­ns can call at run time. JCL contains a number of packages, and each of them contains a number of classes to provide various functional­ities. Some of the packages in JCL include java.lang, java.io, java.net, java.util, etc.

The ‘Hello World’ program in Java

Other than console based Java applicatio­n programs, special classes like the applet, servlet, swing, etc, are used to develop Java programs to complete a variety of tasks. For example, Java applets are programs that are embedded in other applicatio­ns, typically in a Web page displayed in a browser. Regular expression­s can be used in Java applicatio­n programs and programs based on other classes like the applet, swing, servlet, etc, without making any changes. Since there is no difference in the use of regular expression­s, all our discussion­s are based on simple Java applicatio­n programs. But before exploring Java programs using regular expression­s let us build our muscles by executing a simple ‘Hello World’ program in Java. The code given below shows the program HelloWorld.java.

To execute the Java source file HelloWorld.java open a terminal in the same directory containing the file and execute the command: javac HelloWorld.java.

Now a Java class file called HelloWorld.class containing the Java bytecode is created in the directory. The JVM can be invoked to execute this class file containing bytecode with the command: java HelloWorld.class

The message ‘Hello World’ is displayed on the terminal. Figure 2 shows the execution and output of the Java program HelloWorld.java. The program contains a special method named main( ), the starting point of this program, which will be identified and executed by the JVM. Remember that a method in an object oriented programmin­g paradigm is nothing but a function in a procedural programmin­g paradigm. The main( ) method contains the following line of code, which prints the message ‘Hello World’ on the terminal: ‘System. out. println (“Hello World ”);’

The program HelloWorld.java and all the other programs discussed in this article can be downloaded from opensource­foru.com/article_source_code/ January 18 java for you. zip. Figure 2: Hello World program in Java

Regular expression­s in Java

Now coming down to business, let us discuss regular expression­s in Java. The first question to be answered is ‘What flavour of regular expression is being used in

Java?’ Well, Java uses PCRE (Perl Compatible Regular Expression­s). So, all the regular expression­s we have developed in the previous articles describing regular expression­s in Python, Perl and PHP will work in Java without any modificati­ons, because Python, Perl and PHP also use the PCRE flavour of regular expression­s.

Since we have already covered much of the syntax of PCRE in the previous articles on Python, Perl and PHP,

I am not going to reintroduc­e them here. But I would like to point out a few minor difference­s between the classic PCRE and the PCRE standard tailor-made for

Java. For example, the regular expression­s in Java lack the embedded comment syntax available in programmin­g languages like Perl. Another difference is regarding the quantifier­s used in regular expression­s in Java and other PCRE based programmin­g languages. Quantifier­s allow you to specify the number of occurrence­s of a character to match against a string. Almost all the PCRE flavours have a greedy quantifier and a reluctant quantifier. In addition to these two, the regular expression syntax of Java has a possessive quantifier also.

To differenti­ate between these three quantifier­s, consider the string aaaaaa. The regular expression pattern ‘a+a’ involves a greedy quantifier by default. This pattern will result in a greedy match of the whole string aaaaaa because the pattern ‘a+’ will match only the string aaaaa. Now consider the reluctant quantifier ‘a+?a’. This pattern will only result in a match for the string aa since the pattern ‘a+?’ will only match the single character string a. Now let us see the effect of the Java specific possessive quantifier denoted by the pattern ‘a++a’. This pattern will not result in any match because the possessive quantifier behaves like a greedy quantifier, except that it is possessive. So, the pattern ‘a++’ itself will possessive­ly match the whole string aaaaaa, and the last character a in the regular expression pattern ‘a++a’ will not have a match. So, a possessive quantifier will match greedily and after a match it will never give away a character.

You can download and test the three example Java files Greedy.java, Reluctant.java and Possessive.java for a better understand­ing of these concepts. In Java, regular expression processing is enabled with the help of the package java.util.regex. This package was included in the Java Class Library (JCL) by J2SE 1.4 (JDK 1.4). So, if you are going to use regular expression­s in Java, make sure that you have JDK 1.4 or later installed on your system. Execute the command: java -version

… at the terminal to find the particular version of Java installed on your system. The later versions of Java have fixed many bugs and added support for features like named capture and Unicode based regular expression processing. There are also some third party packages that support regular expression processing in Java but our discussion strictly covers the classes offered by the package java.util. regex, which is standard and part of the JCL. The package java.util.regex offers two classes called Pattern and Matcher two classes called Pattern and Matcher that are used are used jointly for regular expression processing. The Pattern class enables us to define a regular expression pattern. The Matcher class helps us match a regular expression pattern with the contents of a string.

Java programs using regular expression­s

Let us now execute and analyse a simple Java program using regular expression­s. The code given below shows the program Regex1.java.

Open a terminal in the same directory containing the file Regex1.java and execute the following commands to view the output: and javac Regex1.java Java Regex1

You will be surprised to see the message ‘No Match Found’ displayed in the terminal. Let us analyse the code in detail to understand the reason for this output. The first line of code: ‘import java.util.regex.*;’ …imports the classes Pattern and Matcher from the package java.util.regex. The line of code: ‘Pattern pat= Pattern. compile (“Open Source ”);’

…generates the regular expression pattern with the help of the method compile( ) provided by the Pattern class. The Pattern object thus generated is stored in the object pat. A PatternSyn­taxExcepti­on is thrown if the regular expression syntax is invalid. The line of code: ‘Matcher mat= pat. match er (“Magazine Open Source For You ”);’

…uses the matcher( ) method of Pattern class to generate a Matcher object, because the Matcher class does not have a constructo­r. The Matcher object thus generated is stored in the object mat. The line of code: ‘if(mat.matches( ))’

…uses the method matches( ) provided by the class Pattern to perform a matching between the regular expression pattern ‘Open Source’ and the string ‘Magazine Open Source For You’. The method matches( ) returns True if there is a match and returns False if there is no match. But the important thing to remember is that the method matches( ) returns True only if the

pattern matches the whole string. In this case, the string ‘Open Source’ is just a substring of the string ‘Magazine Open Source For You’ and since there is no match, the method matches( ) returns False, and the if statement displays the message ‘No Match Found’ on the terminal.

If you replace the line of code: ‘Pattern pat = Pattern.compile(“Open Source”);’ …with the line of code: ‘Pattern pat= Pattern. compile (“Magazine Open Source For You”);’

…then you will get a match and the matches( ) method will return True. The file with this modificati­on Regex2.java is also available for download. The line of code: ‘System.out.println(“Match from “+ (mat.start( )+1) + “to “+ (mat.end( )));’

…uses two methods provided by the Matcher class, start( ) and end( ). The method start( ) returns the start index of the previous match and the method end( ) returns the offset after the last character matched. So, the output of the program Regex2.java will be ‘Match from 1 to 28’.

Figure 3 shows the output of Regex1.java and Regex2. java. An important point to remember is that the indexing starts at 0 and that is the reason why 1 is added to the value returned by the method start( ) as (mat.start( )+1). Since the method end( ) returns the index immediatel­y after the last matched character, nothing needs to be done there.

The matches( ) method of Pattern class with this sort of a comparison is almost useless. But many other useful methods are provided by the class Matcher to carry out different types of comparison­s. The method find( ) provided by the class Matcher is useful if you want to find a substring match. Figure 3: Output of Regex1.java and Regex2.java Replace the line of code: ‘if(mat.matches( ))’ …in Regex1.java with the line of code: ‘if(mat.find( ))’ …to obtain the program Regex3.java. On execution, Regex3.java will display the message ‘Match from 10 to

20’ on the terminal. This is due to the fact that the substring ‘Open Source’ appears from the 10th character to the 20th character in the string ‘Magazine Open Source For You’. The method find( ) also returns True in case of a match and False in case if there is no match. The method find( ) can be used repeatedly to find all the matching substrings present in a string. Consider the program Regex4.java shown below.

In this case, the method find( ) will search the whole string and find matches at positions starting at the first, fifth and ninth characters. The line of code: ‘String str = “abcdabcdab­cd”;’ …is used to store the string to be searched, and in the line of code: ‘Matcher mat = pat.matcher(str);’

…this string is used by the method matcher( ) for further processing. Figure 4 shows the output of the programs Regex3.java and Regex4.java.

Now, what if you want the matched string displayed instead of the index at which a match is obtained. Well, then you have to use the method group( ) provided by the class Figure 4: Output of Regex3.java and Regex4.java Matcher. Consider the program Regex5.java shown below: import java.util.regex.*;

On execution, the program regex5.java displays the message ‘Matched String 1 : Sachin Tendulkar Hits a Sixer’ on the terminal. What is the reason for matching the whole string? Because the pattern ‘S.*r’ searches for a string starting with S, followed by zero or more occurrence­s of any character, and finally ending with an r. Since the pattern ‘.*’ results in a greedy match, the whole string is matched.

Now replace the line of code: ‘Pattern pat= Pattern. c om pile(“S.*r ”);’ …in Regex5.java with the line: ‘Pattern pat= Pattern. c om pile(“S.*?r ”);’

…to get Regex6.java. What will be the output of Regex6. java? Since this is the last article of this series on regular expression­s, I request you to try your best to find the answer before proceeding any further. Figure 5 shows the output of Regex5.java and Regex6.java. But what is the reason for the output shown by Regex6.java? Again, I request you to ponder over the problem for some time and find out the answer. If you don’t get the answer, download the file Regex6.java from the link shown earlier, and in that file I have given the explanatio­n as a comment.

So, with that example, let us wind up our discussion about regular expression­s in Java. Java is a very powerful programmin­g language and the effective use of regular expression­s will make it even more powerful. The basic stuff discussed here will definitely kick-start your journey towards the efficient use of regular expression­s in Java. And now it is time to say farewell.

In this series we have discussed regular expression processing in six different programmin­g languages. Four of these—Python, Perl, PHP and Java—use a regular expression style called PCRE (Perl Compatible Regular Expression­s). The other two programmin­g languages we discussed in Figure 5: Output of Regex5.java and Regex6.java this series, C++ and JavaScript, use a style known as the ECMAScript regular expression style. The articles in this series were never intended to describe the complexiti­es of intricate regular expression­s in detail. Instead, I tried to focus on the different flavours of regular expression­s and how they can be used in various programmin­g languages. Any decent textbook on regular expression­s will give a languageag­nostic discussion of regular expression­s but we were more worried about the actual execution of regular expression­s in programmin­g languages.

Before concluding this series, I would like to go over the important takeaways. First, always remember the fact that there are many different regular expression flavours.

The difference­s between many of them are subtle, yet they can cause havoc if used indiscreet­ly. Second, the style of regular expression used in a programmin­g language depends on the flavour of the regular expression implemente­d by the language’s regular expression engine. Due to this reason, a single programmin­g language may support multiple regular expression styles with the help of different regular expression engines and library functions. Third, the way different languages support regular expression­s is different. In some languages the support for regular expression­s is part of the language core. An example for such a language is Perl. In some other languages the regular expression­s are supported with the help of library functions. C++ is a programmin­g language in which regular expression­s are implemente­d using library functions. Due to this, all the versions and standards of some programmin­g languages may not support the use of regular expression­s. For example, in C++, the support for regular expression­s starts with the C++11 standard.

For the same reason, the different versions of a particular programmin­g language itself might support different regular expression styles. You must be very careful about these important points while developing programs using regular expression­s to avoid dangerous pitfalls.

So, finally, we are at the end of a long journey of learning regular expression­s. But an even longer and far more exciting journey of practising and developing regular expression­s lies ahead. Good luck!

 ??  ?? This is the sixth and final part of a series of articles on regular expression­s in programmin­g languages. In this article, we will discuss the use of regular expression­s in Java, a very powerful programmin­g language.
This is the sixth and final part of a series of articles on regular expression­s in programmin­g languages. In this article, we will discuss the use of regular expression­s in Java, a very powerful programmin­g language.
 ??  ?? Figure 1: Duke – the mascot of Java
Figure 1: Duke – the mascot of Java
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??

Newspapers in English

Newspapers from India