Regular Expressions


Regular Expression Grammar · Grammar Summary · Semantic Details · Matching and Searching · Replacement Text

A regular expression is a sequence of characters that can match one or more target sequences of characters, according to a regular expression grammar. This implementation supports the following regular expression grammars:

This document describes each of these grammars as provided in this implementation. Most of the differences between the grammars are in the regular expression features that are supported. When features are not supported by all of the grammars the text describing those features lists the grammars that support them. In some cases the differences between the grammars are in the syntax used to describe a feature (for example, BRE and grep require a backslash in front of a left parenthesis that marks the beginning of a group and the others do not). In these cases the differences are described as part of the description of the feature.


Regular Expression Grammar

Element

An element can be any of the following:

Examples:

In ECMAScript, BRE, and grep an element can also be:

For example:

In ECMAScript, an element can also be any of the following:

For example:

In awk, an element can also be one of the following:

Repetition

Any element other than a positive assert, a negative assert, or an anchor can be followed by a repetition count. The most general form of repetition count takes the form "{min,max}", or "\{min,max\}" in BRE and grep. An element followed by this form of repetition count matches at least min and no more than max successive occurrences of a sequence that matches the element.

For example:

A repetition count can also take one of the following forms:

Examples:

For all grammars except BRE and grep, a repetition count can also take one of the following forms:

Examples:

Finally, in ECMAScript, all of the preceding forms of repetition count can be followed by the character '?', which designates a non-greedy repetition.

Concatenation

Regular expression elements, with our without repetition counts, can be concatenated to form longer regular expressions. Such an expression matches a target sequence that is a concatenation of sequences matched by the individual elements.

For example:

Alternation

For all regular expression grammars except BRE and grep, a concatenated regular expression can be followed by the character '|' and another concatenated regular expression, which can be followed by another '|' and another concatenated regular expression, and so on. Such an expression matches any target sequence that matches one or more of the concatenated regular expressions. When more than one of the concatenated regular expressions matches the target sequence, ECMAScript chooses the first of the concatenated regular expressions that matches the sequence as the match (first match); the other regular expression grammars choose the one that results in the longest match.

For example:

In grep and egrep, a newline character ('\n') can be used to separate alternations.

Subexpression

A subexpression is a concatenation in BRE and grep, or an alternation in the other regular expression grammars.

Grammar Summary

Elements Used in Different Grammars
Element BRE ERE ECMA grep egrep awk
alternation using '|'   + +   + +
alternation using '\n'       + +  
anchor + + + + + +
back reference +   + +    
bracket expression + + + + + +
capture group using "()"   + +   + +
capture group using "\(\)" +     +    
control escape sequence     +      
dsw character escape     +      
file format escape     +     +
hexadecimal escape sequence     +      
identity escape + + + + + +
negative assert     +      
negative word boundary assert     +      
non-capture group     +      
non-greedy repetition     +      
octal escape sequence           +
ordinary character + + + + + +
positive assert     +      
repetition using "{}"   + +   + +
repetition using "\{\}" +     +    
repetition using '*' + + + + + +
repetition using '?' and '+'   + +   + +
unicode escape sequence     +      
wildcard character + + + + + +
word boundary assert     +      


Semantic Details

Anchor

An anchor matches a position in the target string and not a character. A '^' matches the beginning of the target string, and a '$' matches the end of the target string.

Back Reference

A back reference is a backslash followed by a decimal value N. It matches the contents of the Nth capture group. The value of N must not be greater than the number of capture groups that precede the back reference. In BRE and grep the value of N is determined by the decimal digit that follows the backslash. In ECMAScript the value of N is determined by all of the decimal digits that immediately follow the backslash. Thus, in BRE and grep the value of N is never greater than 9, even if the regular expression has more than nine capture groups. In ECMAScript the value of N is unbounded.

Examples:

Bracket Expression

A bracket expression defines a set of characters and collating elements. If the bracket expression begins with the character '^' the match succeeds if none of the elements in the set matches the current character in the target sequence. Otherwise, the match succeeds if any of the elements in the set matches the current character in the target sequence.

The set of characters can be defined by listing any combination of individual characters, character ranges, character classes, equivalence classes, and collating symbols.

Capture Group

A capture group marks its contents as a single unit in the regular expression grammar and labels the target text that matches its contents. The label associated with each capture group is a number, determined by counting the left parentheses marking capture groups up to and including the left parenthesis marking the current capture group. In this implementation, the maximum number of capture groups is 31.

Examples:

Character Class

A character class in a bracket expression adds all the characters in the named class to the character set defined by the bracket expression. To create a character class, use "[:" followed by the name of the class followed by ":]". Internally, names of character classes are recognized by calling id = traits.lookup_classname. A character ch belongs to such a class if traits.isctype(ch, id) returns true. The default regex_traits template supports the following class names:

Character Range

A character range in a bracket expression adds all the characters in the range to the character set defined by the bracket expression. To create a character range put the character '-' between the first and last characters in the range. This puts all the characters whose numeric value is greater than or equal to the numeric value of the first character and less than or equal to the numeric value of the last character into the set. Note that this set of added characters depends on the platform-specific representation of characters. If the character '-' occurs at the beginning or end of a bracket expression or as the first or last character of a character range it represents itself.

Examples:

When using locale-sensitive ranges, however, the characters in a range are determined by the collation rules for the locale. Characters that collate after the first character in the definition of the range and before the last character in the definition of the range are in the set, as are the two end characters.

Collating Element

A collating element is a multi-character sequence that is treated as a single character. It can contain any characters except '.', '=', or ':'.

Collating Symbol

A collating symbol in a bracket expression adds a collating element to the set defined by the bracket expression. To create a collating symbol, use "[." followed by the collating element followed by ".]".

Control Escape Sequence

A control escape sequence is a backslash followed by the letter 'c' followed by one of the letters 'a' through 'z' or 'A' through 'Z'. It matches the ASCII control character named by that letter.

For example,

DSW Character Escape

A DSW Character Escape is a Short Name for a Character Class
Escape Sequence Equivalent Named Class Default Named Class
"\d" "[[:d:]]" "[[:digit:]]"
"\D" "[^[:d:]]" "[^[:digit:]]"
"\s" "[[:s:]]" "[[:space:]]"
"\S" "[^[:s:]]" "[^[:space:]]"
"\w" "[[:w:]]" "[a-zA-Z0-9_]"*
"\W" "[^[:w:]]" "[^a-zA-Z0-9_]"*
*ASCII character set


Equivalence Class

An equivalence class in a bracket expression adds all the characters and collating elements that are equivalent to the collating element in the equivalence class definition to the set defined by the bracket expression. To create an equivalence class, use "[=" followed by a collating element followed by "=]". Internally, two collating elements elt1 and elt2 are equivalent if traits.transform_primary(elt1.begin(), elt1.end()) == traits.transform_primary(elt2.begin(), elt2.end()).

File Format Escape

A file format escape consists of the usual C language character escape sequences, "\\", "\a", "\b", "\f", "\n", "\r", "\t", "\v", with their usual meanings, namely, backslash, alert, backspace, form feed, newline, carriage return, horizontal tab, and vertical tab, respectively. In ECMAScript "\a" is not allowed. ("\\" is allowed, but technically it's an identity escape, not a file format escape).

Hexadecimal Escape Sequence

A hexadecimal escape sequence is a backslash followed by the letter 'x' followed by two hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence with the value specified by the two digits.

For example,

Identity Escape

An identity escape is a backslash followed by a single character. It matches that character. It is needed when the character has a special meaning; using the identity escape removes the special meaning.

For example,

The set of characters allowed in an identity escape depends on the regular expression grammar.

Individual Character

An individual character in a bracket expression adds that character to the character set defined by the bracket expression. A '^' anywhere other than at the beginning of a bracket expression represents itself.

Examples:

In all the regular expression grammars except ECMAScript if a ']' is the first character following the opening '[' or the first character following an initial '^' it represents itself.

Examples:

In ECMAScript use '\]' to represent the character ']' in a bracket expression.

Examples:

Negative Assert

A negative assert matches anything but its contents; it does not consume any characters in the target sequence.

For example,

Negative Word Boundary Assert

A negative word boundary assert matches if the current position in the target string is not immediately after a word boundary.

Non-capture Group

A non-capture group marks its contents as a single unit in the regular expression grammar, but does not label the target text.

For example,

Non-greedy Repetition

A non-greedy repetition consumes the shortest subsequence of the target sequence that matches the pattern. A greedy repetition consumes the longest.

For example,

Octal Escape Sequence

An octal escape sequence is a backslash followed by one, two, or three octal digits (0-7). It matches a character in the target sequence with the value specified by those digits. If all the digits are '0' the sequence is invalid.

For example,

Ordinary Character

An ordinary character is any valid character that doesn't have a special meaning in the current grammar.

In ECMAScript the characters that have special meanings are:

    ^  $  \  .  *  +  ?  (  )  [  ]  {  }  |

In BRE and grep the characters that have special meanings are:

    .   [   \

In addition, the following characters have special meanings when used in a particular context:

In ERE, egrep, and awk the following characters have special meanings:

    .   [   \   (   *   +   ?   {   |

In addition, the following characters have special meanings when used in a particular context.

An ordinary character matches the same character in the target sequence. By default this means that the match succeeds if the two characters are represented by the same value. In a case-insensitive match two characters ch0 and ch1 match if traits.translate_nocase(ch0) == traits.translate_nocase(ch1). In a locale-sensitive match two characters ch0 and ch1 match if traits.translate(ch0) == traits.translate(ch1).

Positive Assert

A positive assert matches its contents, but does not consume any characters in the target sequence.

Examples:

Unicode Escape Sequence

A unicode escape sequence is a backslash followed by the letter 'u' followed by four hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence with the value specified by the four digits.

For example,

Wildcard Character

A wildcard character matches any character in the target expression except a newline.

Word Boundary

A word boundary occurs in the following situations:

Word Boundary Assert

A word boundary assert matches if the current position in the target string is immediately after a word boundary.

Matching and Searching

For a regular expression to match a target sequence, the entire regular expression must match the entire target sequence.

For example:

For a regular expression search to succeed there must be a subsequence somewhere in the target sequence that matches the regular expression. The search ordinarily finds the leftmost matching subsequence.

Examples:

If there is more than one subsequence that matches at some position in the target sequence there are two ways to choose the matching pattern. First match chooses the subsequence that was found first when matching the regular expression. Longest match chooses the longest subsequence from the ones that match at that point. If there is more than one subsequence with the maximal length, longest match chooses the subsequence that was found first.

For example:

For example, with a partial match:

Replacement Text

Specifying Replacement Text for ECMAScript and sed
ECMAScript format rules sed format rules Replacement text
"$&" "&" The character sequence that matched the entire regular expression ([match[0].first, match[0].second))
"$$"   "$"
  "\&" "&"
"$`" (dollar sign followed by back quote)   The character sequence that precedes the subsequence that matched the regular expression ([match.prefix().first, match.prefix().second))
"$'" (dollar sign followed by forward quote)   The character sequence that follows the subsequence that matched the regular expression ([match.suffix().first, match.suffix().second))
"$n" "\n" The character sequence that matched the nth (0 <= n <= 9) capture group ([match[n].first, match[n].second)
  "\\n" "\n"
"$nn"   The character sequence that matched the nnth (10 <= nn <= 99) capture group ([match[nn].first, match[nn].second)



See also the Table of Contents and the Index.

Copyright © 1992-2013 by Dinkumware, Ltd. All rights reserved.