Regular ExpressionsS2C Home « Regular Expressions

In this lesson we look at regular expressions (regex) and how we can use regular expression patterns for matching data. A regular expressions is a string containing normal characters as well as metacharacters which make a pattern we can use to match data. The metacharacters are used to represent concepts such as positioning, quantity and character types. The terminology used when searching through data for specific characters or groups of characters is known as pattern matching and is generally done from the left to the right of the input character sequence.

Regular expressions are a large topic which you could write an entire book on and people have, but here we will just cover the basics of pattern matching to get a feel for how to use them. The table below lists the regular expression metacharacter constructs used in the examples in this lesson; a more complete, but not exhaustive table with examples is listed at the end of the lesson.

Metacharacter Meaning
Escape/Unescape
\ Used to escape characters that are treated literally within regular expressions or alternatively to unescape special characters
Quantifiers
? Matches preceding item 0 or 1 times
* Matches preceding item 0 or more times
+ Matches preceding item 1 or more times
Character Classes
[xyz] A character set.
Matches any of the enclosed characters.
You can specify a range of characters by using a hyphen.
[^xyz] A negated character set.
Matches anything not enclosed in the brackets.
You can specify a range of characters by using a hyphen.
Predefined Character Classes
. Matches any single character without newline characters except when the DOTALL flag is specified.
\d Find a digit character.
Same as the range check [0-9].
\s Find a whitespace character.
\w Find a word character.
A word character is a character in ranges a-z, A-Z, 0-9 and also includes the _ (underscore) symbol.
Same as the range check [A-Za-z0-9_].

String Searchesgo to top of page Top

Before we look at a working example of using a regular expression we should talk a little about the java.util.regex package and the two classes it contains. The java.util.regex.Pattern class allows us to instantiate a compiled representation of a regular expression we have have passed to the class as a string. We can then use the matcher() method on the resultant pattern, to create an instance of the java.util.regex.Matcher class, that can match arbitrary character sequences against the specified regular expression. Once we have compiled a regular expression into a Pattern object we can use multiple matchers against this pattern, as all of the state involved in performing a match resides in the Matcher instance. We can then check methods of the Matcher class to see if we got any matches. There is also a convenience matches() method in the Pattern class that allows us to compile a regular expression, use a matcher and see if it matches in a single statement. Lets look at a simple search to see how it all hangs together:


/*
  Simple regex string search
*/
import java.util.regex.*; // Import all file classes from the java.util.regex package

class TestSimpleRegex {
    public static void main(String[] args) {
        boolean b = false;
        Pattern p = Pattern.compile("is"); // Create a regex
        Matcher m = p.matcher("mississippi");  // Our string for matching
        // Part region matching
        b = m.lookingAt();
        System.out.println("Did we get a part region match? " + b);
        // Full region matching
        b = m.matches();
        System.out.println("Did we get a full region match? " + b);
        // Multiple matching
        while (b = m.find()) {  // matching info
            System.out.println("We got a match at position: " + m.start());
        }
        // Convenience all in one method
        b = Pattern.matches("is", "mississippi");
        System.out.println("Did we get a full match? " + b);
        b = Pattern.matches("mississippi", "mississippi"); // Convenience all in one method
        System.out.println("Did we get a full match? " + b);
    }
}

Save, compile and run the TestSimpleRegex test class in directory   c:\_APIContents2 in the usual way.

run test simple regex

The above screenshot shows the output of compiling and running the TestSimpleRegex class. First off we compile a regex Pattern object from the string "is". Using this object we then create a Matcher object using the string "Mississippi" as the character sequence to be matched. The Matcher object finds matches in a subset of its input called the region, which by default contains all of the matcher's input. There is also a region() method which can be used to modify the region boundaries which I will live as an exercise for you to look at. We then perform the three different kinds of match operations on the Matcher object.

The lookingAt() method does a prefix region match which returns false as the prefix of our input doesn't match the pattern we created. The matches() method does a full region match which also returns false as our entire character input doesn't match the pattern we created. The find() methods scans the region looking for subsequences that match the pattern and finds these at positions 1 and 4 (think of a zero-based index). We print messages to the console showing the results.

Next we use the convenience matches() method of the Pattern class which works the same as the matches() method of the Matcher class and print some more messages to the console. It should be noted that the matches() method of the Pattern class is less efficient than its counterpart in the Matcher class when doing repeated matches as it doesn't allow the compiled pattern to be reused.

Metacharacter Searchesgo to top of page Top

Ok, we have seen how we can use regex to search strings for a prefix, subsequence or whole match of the character input but what else does the regex engine bring to the party? We mentioned at the start of the lesson how we can use metacharacters to represent concepts such as positioning, quantity and character types for pattern matching. So in this part of the lesson we will look at a few of the more common metacharacters used and how we incorporate then into our regex patterns. In our first example we will look at the \d, \s and \w metacharacters which search for digits, whitespace characters and word characters (letters, digits and the underscore symbol (_) respectively.


/*
  Using regex metacharacter search
*/
import java.util.regex.*; // Import all file classes from the java.util.regex package

class TestMetaRegex {
    public static void main(String[] args) {
        String str = "The quick brown fox. 1+1=2";
        String str2 = "---1+1=2---";
        boolean b = false;
        Pattern p = Pattern.compile("\\d"); // Create a regex to look for digits
        Matcher m = p.matcher(str);  // Our string Object for matching
        // Multiple matching
        while (b = m.find()) {  // matching info
            System.out.println("We found a digit at position: " + m.start());
        }
        p = Pattern.compile("\\s"); // Create a regex to look a whitespace
        m = p.matcher(str);  // Our string Object for matching
        // Multiple matching
        while (b = m.find()) {  // matching info
            System.out.println("We found a whitespace at position: " + m.start());
        }
        p = Pattern.compile("\\w"); // Create a regex to look for word characters
        m = p.matcher(str2);  // Our string Object for matching
        // Multiple matching
        while (b = m.find()) {  // matching info
            System.out.println("We found a word character at position: " + m.start());
        }
    }
}

Save, compile and run the TestSimpleRegex test class in directory   c:\_APIContents2 in the usual way.

run test meta regex

The above screenshot shows the output of compiling and running the TestSimpleRegex class. First off we compile a regex Pattern object from the string containing the metacharacter \d (look for a single digit). We have to escape the \ symbol using the escape character which is also the \ symbol. We have to do this or the compiler thinks this is an escape sequence such as \n for a newline and thinks hey! I don't have an escape sequence for \d and throws a compiler error. Using this object we then create a Matcher object using the String object with a reference of str. We then use the find() methods to scan the region looking for subsequences that match the pattern and print messages to the console showing the results.

We then compile a regex Pattern object from the string containing the metacharacter \s (look for a single whitespace). The rest is the same as above and we then print messages to the console showing the results. Finally we do the same to search for word characters using the String object with a reference of str2.

In our second example of metacharacter usage we look at the . (dot) predefined character class as well as character sets and negated character sets using [] bracket notation.


/*
  Using regex metacharacter search
*/
import java.util.regex.*; // Import all file classes from the java.util.regex package

class TestMetaRegex2 {
    public static void main(String[] args) {
        String str = "Our tree is getting big";
        String str2 = "facetious";
        boolean b = false;
        Pattern p = Pattern.compile("t.e"); // regex to look for t and e with any char in between
        Matcher m = p.matcher(str);  // Our string Object for matching
        // Multiple matching
        while (b = m.find()) {  // matching info
            System.out.println("We found t (any char) e at position: " + m.start());
        }
        p = Pattern.compile("[aeiou]"); // Create a regex to look for vowels
        m = p.matcher(str2);  // Our string Object for matching
        // Multiple matching
        while (b = m.find()) {  // matching info
            System.out.println("We found a vowel at position: " + m.start());
        }
        p = Pattern.compile("[^aeiou]"); // Create a regex to look for non vowels
        m = p.matcher(str2);  // Our string Object for matching
        // Multiple matching
        while (b = m.find()) {  // matching info
            System.out.println("We found a non vowel at position: " + m.start());
        }
    }
}

Save, compile and run the TestSimpleRegex2 test class in directory   c:\_APIContents2 in the usual way.

run test meta regex

The above screenshot shows the output of compiling and running the TestSimpleRegex2 class. First off we compile a regex Pattern object from the string containing 't' and 'e' and the metacharacter . (any character). Using this object we then create a Matcher object using the String object with a reference of str. We then use the find() methods to scan the region looking for a subsequence that matches the pattern and print a message to the console showing the results.

We then compile a regex Pattern object from the string containing a character set looking for vowels and then non vowels. The rest is the same as above except we use the String object with a reference of str2. We also print messages to the console showing the results.

Quantifier Searchesgo to top of page Top

In our final look at regex we discuss quantifiers and the effect they have on our search results. A quantifier is a metacharacter which allows us to select a range of matches. The metacharacter quantifiers available are ? for zero or more occurrences, * for zero or one occurrences and + for one or more occurrences. The following example shows usage of the qualifiers:


/*
  Using regex quantifiers
*/
import java.util.regex.*; // Import all file classes from the java.util.regex package

class TestRegexQquantifiers {
    public static void main(String[] args) {
        String str = "Oh geee, the tree hit my kneee";
        boolean b = false;
        Pattern p = Pattern.compile("ee?"); // look for e and then zero or 1 more e
        Matcher m = p.matcher(str);  // Our string Object for matching
        // Multiple matching
        while (b = m.find()) {  // matching info
            System.out.println("We found e, then zero or one more e at position: " + m.start());
        }
        p = Pattern.compile("ee*"); // look for e and then zero or more e's
        m = p.matcher(str);  // Our string Object for matching
        // Multiple matching
        while (b = m.find()) {  // matching info
            System.out.println("We found e, then zero or more e's at position: " + m.start());
        }
        p = Pattern.compile("ee+"); // look for e and then 1 or more e's
        m = p.matcher(str);  // Our string Object for matching
        // Multiple matching
        while (b = m.find()) {  // matching info
            System.out.println("We found e, then 1 or more e's at position: " + m.start());
        }
    }
}

Save, compile and run the TestRegexQquantifiers test class in directory   c:\_APIContents2 in the usual way.

run test regex quantifier

The above screenshot shows the output of compiling and running the TestSimpleRegex class. When we use the ? quantifier on our input character sequence it will find all subsequences of the letter e and zero or one more 'e'. In other words it will match 'e' and 'ee'. So where we have three 'e's in a row we get two matches. When we use the * quantifier on our input character sequence it will find all subsequences of the letter e and zero or more 'e's. So it will match 'e', 'ee' and 'eee'. When we use the + quantifier on our input character sequence it will find all subsequences of the letter e followed by 1 or more 'e'. In this case it will match 'ee' and 'eee'.

As a further note you can also use qualifiers in combination but I'll leave that for you to look at next time you examine the Java API.

Regex Metacharacter Examplesgo to top of page Top

The table below lists more regex metacharacters with examples of how they can be used. Luckily for certification purposes you don't need to memorize the whole table, just the rows with the light blue background. For a full list of regex constructs visit the Oracle online version of documentation for the JavaTM 2 Platform Standard Edition 5.0 API Specification and scroll down the top left pane and click on java.util.regex.

MeatacChar Meaning Examples
Escape/Unescape
\ Used to escape characters that are treated literally within regular expressions or alternatively to unescape special characters Literal Content
d matches the character d
\\d matches a digit character

Unescape Special Characters
d+ matches one or more character d
d\\+ matches d+
Quantifiers
? Matches preceding item 0 or 1 times do?
Every dig has its day
Every dog has its day
Shut that doooor
Can you see me
* Matches preceding item 0 or more times do*
Every dig has its day
Every dog has its day
Shut that doooor
Can you see me
+ Matches preceding item 1 or more times do+
Every dig has its day
Every dog has its day
Shut that doooor
Can you see me
Characters
\f Find a formfeed character. \\f
When matched will return the formfeed character

When searched will return the zero-based index position the formfeed character was found in.
\n Find a newline character. \\n
When matched will return the newline character

When searched will return the zero-based index position the newline character was found in.
\r Find a carriage return character. \\r
When matched will return the carriage return character

When searched will return the zero-based index position the carriage return character was found in.
\t Find a tab character. \\t
When matched will return the tab character

When searched will return the zero-based index position the tab character was found in.
\cA Find a control character A.
Where A is the control character in range A-Z you are looking for.
\\cC
Will search for control-C in a string.
Character Classes
[xyz] A character set.
Matches any of the enclosed characters.
You can specify a range of characters by using a hyphen.
[eno]
one two
one twoo
one twooo
one twoooo
[^xyz] A negated character set.
Matches anything not enclosed in the brackets.
You can specify a range of characters by using a hyphen.
[^eno]
one two
one twoo
one twooo
one twoooo
Predefined Character Classes
. Matches any single character without newline characters except when the DOTALL flag is specified. \\.t
This Time tonight
this is good
\d Find a digit character.
Same as the range check [0-9].
\\d
Was it 76 or 77
\D Find a non-digit character.
Same as the range check [^0-9].
\\D
Was it 76 or 77
\s Find a whitespace character. Example below words are greyed out and spaces are highlighted in red purely for emphasis
\\s
Beware of the dog
\S Find a non-whitespace character. Example below spaces are grayed out for emphasis
\\S
Beware of the dog 
\w Find a word character.
A word character is a character in ranges a-z, A-Z, 0-9 and also includes the _ (underscore) symbol.
Same as the range check [A-Za-z0-9_].
\\w
76% off_sales. £12 only
\W Find a non-word character.
Same as the range check [^A-Za-z0-9_].
\\W
76% off_sales. £12 only
\xnn Find a character that equates to hexadecimal nn.
Where nn is a two digit hexadecimal number
\\x70
The quick brown fox jumps.
\unnnn Find a character with the hexadecimal value nnnn.
Where nnnn is a four digit hexadecimal number
\\u0065
The quick brown fox jumps.
Boundary matches
^ Matches beginning of input
If line match flag (m) is set will also match after a line break character.
^A
an Armadillo
An Armadillo
$ Matches end of input
If line match flag (m) is set will also match before a line break character.
Z$
BuzZ BuZz
BuzZ BuZZ
\b Find a match at the beginning or end of a word. At Beginning
\\bday
the day today is saturday
daytime and nighttime

At End
day\\b
the day today is saturday
day and nighttime
\B Find a match NOT at the beginning or end of a word. Not At Beginning
\\Bday
the day today is saturday
daytime and nighttime

Not At End
day\\B
the day today is saturday
day and nighttime
Special Constructs
x(?=y) Matches Regexp(x) only if followed by y and(?= five)
one and two and three
one and two and four
one and two and five
x(?!y) Matches Regexp(x) only if NOT followed by y and(?! five)
one two and three
one two and four
one two and five
Occurrences
{n} Matches exactly n occurrences of the preceding item.
Where n is a positive integer
o{4}
one two
one twoo
one twooo
one twoooo
{n,} Matches at least n occurrences of the preceding item.
Where n is a positive integer
o{3,}
one two
one twoo
one twooo
one twoooo
{n,m} Matches at least n and at most m occurrences of the preceding item.
Where n and m are positive integers
o{2,3}
one two
one twoo
one twooo
one twoooo
Logical Operators
x|y Matches x or y three|four
one two
one two three
one two four

Lesson 8 Complete

In this lesson we looked at regular expressions and how we can use regular expression patterns for matching data.

What's Next?

In our final lesson of the API Contents section we look at formatting and tokenizing our data.

go to home page Homepage go to top of page Top