Regular Expressions

Regular Expression Grammar · Grammar Summary · Semantic Details · Matching and Searching · Replacement Text

A regular expression is a sequence of characters that can match one or more target sequences of characters, according to a regular expression grammar. This implementation supports the following regular expression grammars:

BRE -- Basic Regular Expressions, defined by the POSIX Standard, Part 1 (ISO/IEC 9945-1:2003)
ERE -- Extended Regular Expressions, also defined by the POSIX Standard, Part 1
ECMAScript -- ECMAScript regular exprssions, as defined by the ECMAScript Language Specification (Ecma-262)
awk -- regular expressions as used in the awk utility, defined by the POSIX Standard, Part 3 (ISO/IEC 9945-3:2003)
grep -- regular expressions as used in the grep utility, also defined by the POSIX Standard, Part 3
egrep -- regular expressions as used in the grep utility with the -E option, also defined by the POSIX Standard, Part 3

This document describes each of these grammars as provided in this implementation. Most of the differences between the grammars are in the regular expression features that are supported. When features are not supported by all of the grammars the text describing those features lists the grammars that support them. In some cases the differences between the grammars are in the syntax used to describe a feature (for example, BRE and grep require a backslash in front of a left parenthesis that marks the beginning of a group and the others do not). In these cases the differences are described as part of the description of the feature.

Regular Expression Grammar

Element

An element can be any of the following:

An ordinary character, which matches the same character in the target sequence
A wildcard character, '.', which matches any character in the target sequence except a newline
A bracket expression, of the form "[expr]", which matches a character or a collation element in the target sequence that is also in the set defined by the expression expr, or of the form "[^expr]", which matches a character or a collation element in the target sequence that is not in the set defined by the expression expr. The expression expr can consist of any combination of any number of each of the following.
An individual character, which adds that character to the set defined by expr.
A character range, of the form "ch1-ch2", which adds all of the characters represented by values in the closed range [ch1, ch2] to the set defined by expr.
A character class, of the form "[:name:]", which adds all of the characters in the named class to the set defined by expr.
An equivalence class, of the form "[=elt=]", which adds the collating elements that are equivalent to elt to the set defined by expr.
A collating symbol, of the form "[.elt.]", which adds the collation element elt to the set defined by expr.
An anchor, either '^' or '$', which matches the beginning or the end of the target sequence, respectively
A capture group, of the form "( subexpression )", or "$ subexpression $" in BRE and grep, which matches the sequence of characters in the target sequence that is matched by the pattern between the delimiters
An identity escape, of the form "\k", which matches the character k in the target sequence

Examples:

"a" matches the target sequence "a" but none of the target sequences "B", "b", or "c".
"." matches all of the target sequences "a", "B", "b", and "c".
"[b-z]" matches the target sequences "b" and "c" but does not match the target sequence "a" or the target sequence "B".
"[:lower:]" matches the target sequences "a", "b", and "c" but does not match the target sequence "B".
"(a)" matches the target sequence "a" and associates capture group 1 with the subsequence "a",but does not match any of the target sequences "B", "b", or "c".

In ECMAScript, BRE, and grep an element can also be:

a back reference, of the form "\dd" where dd represents a decimal value N, which matches a sequence of characters in the target sequence that is the same as the sequence of characters matched by the Nth capture group.

For example:

"(a)\1" matches the target sequence "aa" because the first (and only) capture group matches the initial sequence "a" and the \1 then matches the final sequence "a".

In ECMAScript, an element can also be any of the following:

A non-capture group, of the form "(?: subexpression )", which matches the sequence of characters in the target sequence that is matched by the pattern between the delimiters
a limited file format escape, of the form "\f", "\n", "\r", "\t", or "\v"; these match a form feed, newline, carriage return, horizontal tab, and vertical tab, respectively, in the target sequence.
A positive assert, of the form "(?= subexpression )", which matches the sequence of characters in the target sequence that is matched by the pattern between the delimiters, but does not change the match position in the target sequence.
A negative assert, of the form "(?! subexpression )", which matches any sequence of characters in the target sequence that does not match the pattern between the delimiters, and does not change the match position in the target sequence.
A hexadecimal escape sequence, of the form "\xhh", which matches a character in the target sequence whose representation is the value represented by the two hexadecimal digits hh.
A unicode escape sequence, of the form "\uhhhh", which matches a character in the target sequence whose representation is the value represented by the four hexadecimal digits hhhh.
A control escape sequence, of the form "\ck", which matches the control character named by the character k.
A word boundary assert, of the form "\b", which matches if the current position in the target sequence is immediately after a word boundary.
A negative word boundary assert, of the form "\B", which matches if the current position in the target sequence is not immediately after a word boundary.
A dsw character escape, of the form "\d", "\D", "\s", "\S", "\w", "\W", which provides a short name for a character class.

For example:

"(?:a)" matches the target sequence "a", but "(?:a)\1" is invalid, because there is no capture group 1.
"(?=a)a" matches the target sequence "a". The positive assert matches the initial sequence "a" in the target sequence and the final "a" in the regular expression matches the initial sequence "a" in the target sequence.
"(?!a)a" does not match the target sequence "a".
"a\b." matches the target sequence "a~" but does not match the target sequence "ab".
"a\B." matches the target sequence "ab" but does not match the target sequence "a~".

In awk, an element can also be one of the following:

A file format escape, of the form "\\", "\a", "\b", "\f", "\n", "\r", "\t", or "\v"; these match a backslash, alert, backspace, form feed, newline, carriage return, horizontal tab, and vertical tab, respectively, in the target sequence.
An octal escape sequence, of the form "\ooo", which matches a character in the target sequence whose representation is the value represented by the one, two, or three octal digits ooo.

Repetition

Any element other than a positive assert, a negative assert, or an anchor can be followed by a repetition count. The most general form of repetition count takes the form "{min,max}", or "\{min,max\}" in BRE and grep. An element followed by this form of repetition count matches at least min and no more than max successive occurrences of a sequence that matches the element.

For example:

"a{2,3}" matches the target sequence "aa" and the target sequence "aaa", but not the target sequence "a" or the target sequence "aaaa".

A repetition count can also take one of the following forms:

"{min}", or "\{min\}" in BRE and grep, which is equivalent to "{min,min}".
"{min,}", or "\{min,\}" in BRE and grep, which is equivalent to "{min,unbounded}".
"*", which is equivalent to "{0,unbounded}".

Examples:

"a{2}" matches the target sequence "aa" but not the target sequence "a" or the target sequence "aaa".
"a{2,}" matches the target sequence "aa", the target sequence "aaa", and so on, but does not match the target sequence "a".
"a*" matches the target sequence "", the target sequence "a", the target sequence "aa", and so on.

For all grammars except BRE and grep, a repetition count can also take one of the following forms:

"?", which is equivalent to "{0,1}".
"+", which is equivalent to "{1,unbounded}".

Examples:

"a?" matches the target sequence "" and the target sequence "a", but not the target sequence "aa".
"a+" matches the target sequence "a", the target sequence "aa", and so on, but not the target sequence "".

Finally, in ECMAScript, all of the preceding forms of repetition count can be followed by the character '?', which designates a non-greedy repetition.

Concatenation

Regular expression elements, with our without repetition counts, can be concatenated to form longer regular expressions. Such an expression matches a target sequence that is a concatenation of sequences matched by the individual elements.

For example:

"a{2,3}b" matches the target sequence "aab" and the target sequence "aaab", but does not match the target sequence "ab" or the target sequence "aaaab".

Alternation

For all regular expression grammars except BRE and grep, a concatenated regular expression can be followed by the character '|' and another concatenated regular expression, which can be followed by another '|' and another concatenated regular expression, and so on. Such an expression matches any target sequence that matches one or more of the concatenated regular expressions. When more than one of the concatenated regular expressions matches the target sequence, ECMAScript chooses the first of the concatenated regular expressions that matches the sequence as the match (first match); the other regular expression grammars choose the one that results in the longest match.

For example:

"ab|cd" matches the target sequence "ab" and the target sequence "cd", but does not match the target sequence "abd" or the target sequence "acd".

In grep and egrep, a newline character ('\n') can be used to separate alternations.

Subexpression

A subexpression is a concatenation in BRE and grep, or an alternation in the other regular expression grammars.

Grammar Summary

Elements Used in Different Grammars
Element	BRE	ERE	ECMA	grep	egrep	awk
alternation using '\|'		+	+		+	+
alternation using '\n'				+	+
anchor	+	+	+	+	+	+
back reference	+		+	+
bracket expression	+	+	+	+	+	+
capture group using "()"		+	+		+	+
capture group using ""	+			+
control escape sequence			+
dsw character escape			+
file format escape			+			+
hexadecimal escape sequence			+
identity escape	+	+	+	+	+	+
negative assert			+
negative word boundary assert			+
non-capture group			+
non-greedy repetition			+
octal escape sequence						+
ordinary character	+	+	+	+	+	+
positive assert			+
repetition using "{}"		+	+		+	+
repetition using "\{\}"	+			+
repetition using '*'	+	+	+	+	+	+
repetition using '?' and '+'		+	+		+	+
unicode escape sequence			+
wildcard character	+	+	+	+	+	+
word boundary assert			+

Semantic Details

Anchor

An anchor matches a position in the target string and not a character. A '^' matches the beginning of the target string, and a '$' matches the end of the target string.

Back Reference

A back reference is a backslash followed by a decimal value N. It matches the contents of the Nth capture group. The value of N must not be greater than the number of capture groups that precede the back reference. In BRE and grep the value of N is determined by the decimal digit that follows the backslash. In ECMAScript the value of N is determined by all of the decimal digits that immediately follow the backslash. Thus, in BRE and grep the value of N is never greater than 9, even if the regular expression has more than nine capture groups. In ECMAScript the value of N is unbounded.

Examples:

"((a+)(b+))(c+)\3" matches the target sequence "aabbbcbbb". The back reference "\3" matches the text in the third capture group, that is, the "(b+)". It does not match the target sequence "aabbbcbb".
"(a)\2" is not valid.
"(b(((((((((a))))))))))\10" has a different meaning in BRE and in ECMAScript. In BRE the back reference is "\1". It matches the contents of the first capture group (i.e. the one beginning with "(b" and ending with the final ")" preceding the back reference), and the final '0' matches the ordinary character '0'. In ECMAScript the back reference is "\10". It matches the tenth capture group (i.e. the innermost one).

Bracket Expression

A bracket expression defines a set of characters and collating elements. If the bracket expression begins with the character '^' the match succeeds if none of the elements in the set matches the current character in the target sequence. Otherwise, the match succeeds if any of the elements in the set matches the current character in the target sequence.

The set of characters can be defined by listing any combination of individual characters, character ranges, character classes, equivalence classes, and collating symbols.

Capture Group

A capture group marks its contents as a single unit in the regular expression grammar and labels the target text that matches its contents. The label associated with each capture group is a number, determined by counting the left parentheses marking capture groups up to and including the left parenthesis marking the current capture group. In this implementation, the maximum number of capture groups is 31.

Examples:

"ab+" matches the target sequence "abb" but not the target sequence "abab".
"(ab)+" does not match the target sequence "abb" but matches the target sequence "abab".
"((a+)(b+))(c+)" matches the target sequence "aabbbc" and associates capture group 1 with the subsequence "aabbb", capture group 2 with the subsequence "aa", capture group 3 with "bbb", and capture group 4 with the subsequence "c".

Character Class

A character class in a bracket expression adds all the characters in the named class to the character set defined by the bracket expression. To create a character class, use "[:" followed by the name of the class followed by ":]". Internally, names of character classes are recognized by calling id = traits.lookup_classname. A character ch belongs to such a class if traits.isctype(ch, id) returns true. The default regex_traits template supports the following class names:

"alnum" -- lowercase letters, uppercase letters, and digits;
"alpha" -- lowercase letters and uppercase letters;
"blank" -- space or tab;
"cntrl" -- the file format escape characters;
"digit" -- digits;
"graph" -- lowercase letters, uppercase letters, digits, and punctuation;
"lower" -- lowercase letters;
"print" -- lowercase letters, uppercase letters, digits, punctuation, and space;
"punct" -- punctuation;
"space" -- space;
"upper" -- uppercase characters;
"xdigit" -- digits, 'a', 'b', 'c', 'd', 'e', 'f', 'A', 'B', 'C', 'D', 'E', 'F';
"d" -- same as digit;
"s" -- same as space;
"w" -- same as alnum.

Character Range

A character range in a bracket expression adds all the characters in the range to the character set defined by the bracket expression. To create a character range put the character '-' between the first and last characters in the range. This puts all the characters whose numeric value is greater than or equal to the numeric value of the first character and less than or equal to the numeric value of the last character into the set. Note that this set of added characters depends on the platform-specific representation of characters. If the character '-' occurs at the beginning or end of a bracket expression or as the first or last character of a character range it represents itself.

Examples:

"[0-7]" represents the set of characters { '0', '1', '2', '3', '4', '5', '6', '7' }. It matches the target sequences "0", "1", etc., but not "a".
"[h-k]" represents the set of characters { 'h', 'i', 'j', 'k' } on systems that use the ASCII character encoding; it matches the target sequences "h", "i", etc., but not "\x8A" or "0".
"[h-k]" represents the set of characters { 'h', 'i', '\x8A', '\x8B', '\x8C', '\x8D', '\x8E', '\x8F', '\x90', 'j', 'k' } on systems that use the EBCDIC character encoding ('h' is encoded as 0x88 and 'k' is encoded as 0x92). It matches the target sequences "h", "i", "\x8A", etc., but not "0".
"[-0-24]" represents the set of characters { '-', '0', '1', '2', '4' }.
"[0-2-]" represents the set of characters { '0', '1', '2', '-' }.
"[+--]" on systems that use ASCII represents the set of characters { '+', ',', '-' }.

When using locale-sensitive ranges, however, the characters in a range are determined by the collation rules for the locale. Characters that collate after the first character in the definition of the range and before the last character in the definition of the range are in the set, as are the two end characters.

Collating Element

A collating element is a multi-character sequence that is treated as a single character. It can contain any characters except '.', '=', or ':'.

Collating Symbol

A collating symbol in a bracket expression adds a collating element to the set defined by the bracket expression. To create a collating symbol, use "[." followed by the collating element followed by ".]".

Control Escape Sequence

A control escape sequence is a backslash followed by the letter 'c' followed by one of the letters 'a' through 'z' or 'A' through 'Z'. It matches the ASCII control character named by that letter.

For example,

"\ci" matches the target sequence "\x09", because <ctrl-i> has the value 0x09.

DSW Character Escape

A DSW Character Escape is a Short Name for a Character Class
Escape Sequence	Equivalent Named Class	Default Named Class
"\d"	"[[:d:]]"	"[[:digit:]]"
"\D"	"[^[:d:]]"	"[^[:digit:]]"
"\s"	"[[:s:]]"	"[[:space:]]"
"\S"	"[^[:s:]]"	"[^[:space:]]"
"\w"	"[[:w:]]"	"[a-zA-Z0-9_]"*
"\W"	"[^[:w:]]"	"[^a-zA-Z0-9_]"*
*ASCII character set

Equivalence Class

An equivalence class in a bracket expression adds all the characters and collating elements that are equivalent to the collating element in the equivalence class definition to the set defined by the bracket expression. To create an equivalence class, use "[=" followed by a collating element followed by "=]". Internally, two collating elements elt1 and elt2 are equivalent if traits.transform_primary(elt1.begin(), elt1.end()) == traits.transform_primary(elt2.begin(), elt2.end()).

File Format Escape

A file format escape consists of the usual C language character escape sequences, "\\", "\a", "\b", "\f", "\n", "\r", "\t", "\v", with their usual meanings, namely, backslash, alert, backspace, form feed, newline, carriage return, horizontal tab, and vertical tab, respectively. In ECMAScript "\a" is not allowed. ("\\" is allowed, but technically it's an identity escape, not a file format escape).

Hexadecimal Escape Sequence

A hexadecimal escape sequence is a backslash followed by the letter 'x' followed by two hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence with the value specified by the two digits.

For example,

"\x41" matches the target sequence "A" when the ASCII character encoding is used.

Identity Escape

An identity escape is a backslash followed by a single character. It matches that character. It is needed when the character has a special meaning; using the identity escape removes the special meaning.

For example,

"a*" matches the target sequence "aaa" but does not match the target sequence "a*"
"a\*" does not match the target sequence "aaa" but does match the target sequence "a*"

The set of characters allowed in an identity escape depends on the regular expression grammar.

BRE, grep -- { '(', ')', '{', '}', '.', '[', '\', '*', '^', '$' }.
ERE, egrep -- { '(', ')', '{', '}', '.', '[', '\', '*', '^', '$', '+', '?', '|' }.
awk -- ERE plus { '"', '/' }.
ECMAScript -- all characters except those that can be part of an identifier. Roughly speaking, this is letters, digits, '$', '_', and unicode escape sequences. For full details see the ECMAScript Language Specification.

Individual Character

An individual character in a bracket expression adds that character to the character set defined by the bracket expression. A '^' anywhere other than at the beginning of a bracket expression represents itself.

Examples:

"[abc]" matches the target sequences "a", "b", and "c" but not the sequence "d".
"[^abc]" matches the target sequence "d", but not "a", "b", or "c".
"[a^bc]" matches the target sequences "a", "b", "c", and "^" but not the sequence "d".

In all the regular expression grammars except ECMAScript if a ']' is the first character following the opening '[' or the first character following an initial '^' it represents itself.

Examples:

"[]a" is invalid, because there is no ']' to end the bracket expression.
"[]abc]" matches the target sequences "a", "b", "c", and "]" but not the sequence "d".
"[^]abc]" matches the target sequence "d", but not "a", "b", "c", or "]".

In ECMAScript use '\]' to represent the character ']' in a bracket expression.

Examples:

"[]a" matches the target sequence "a" because the bracket expression is empty.
"[\]abc]" matches the target sequences "a", "b", "c", and "]" but not the sequence "d".

Negative Assert

A negative assert matches anything but its contents; it does not consume any characters in the target sequence.

For example,

"(?!aa)(a*)" matches the target sequence "a" and associates capture group 1 with the subsequence "a". It does not match the target sequence "aa" or the target sequence "aaa".

Negative Word Boundary Assert

A negative word boundary assert matches if the current position in the target string is not immediately after a word boundary.

Non-capture Group

A non-capture group marks its contents as a single unit in the regular expression grammar, but does not label the target text.

For example,

"(a)(?:b)*(c) matches the target text "abbc" and associates capture group 1 with the subsequence "a" and capture group 2 with the subsequence "c".

Non-greedy Repetition

A non-greedy repetition consumes the shortest subsequence of the target sequence that matches the pattern. A greedy repetition consumes the longest.

For example,

"(a+)(a*b)" matches the target sequence "aaab". When using a non-greedy repetition it associates capture group 1 with the subsequence "a" at the beginning of the target sequence and capture group 2 with the subsequence "aab" at the end of the target sequence. When using a greedy match it associates capture group 1 with the subsequence "aaa" and capture group 2 with the subsequence "b".

Octal Escape Sequence

An octal escape sequence is a backslash followed by one, two, or three octal digits (0-7). It matches a character in the target sequence with the value specified by those digits. If all the digits are '0' the sequence is invalid.

For example,

"\101" matches the target sequence "A" when the ASCII character encoding is used.

Ordinary Character

An ordinary character is any valid character that doesn't have a special meaning in the current grammar.

In ECMAScript the characters that have special meanings are:

    ^  $  \  .  *  +  ?  (  )  [  ]  {  }  |

In BRE and grep the characters that have special meanings are:

    .   [   \

In addition, the following characters have special meanings when used in a particular context:

'*' has a special meaning in all cases except when it is the first character in a regular expression or the first character following an initial '^' in a regular expression and when it is the first character of a capture group or the first character following an initial '^' in a capture group.
'^' has a special meaning when it is the first character of a regular expression.
'$' has a special meaning when it is the last character of a regular expression.

In ERE, egrep, and awk the following characters have special meanings:

    .   [   \   (   *   +   ?   {   |

In addition, the following characters have special meanings when used in a particular context.

')' has a special meaning when it matches a preceding '('.
'^' has a special meaning when it is the first character of a regular expression.
'$' has a special meaning when it is the last character of a regular expression.

An ordinary character matches the same character in the target sequence. By default this means that the match succeeds if the two characters are represented by the same value. In a case-insensitive match two characters ch0 and ch1 match if traits.translate_nocase(ch0) == traits.translate_nocase(ch1). In a locale-sensitive match two characters ch0 and ch1 match if traits.translate(ch0) == traits.translate(ch1).

Positive Assert

A positive assert matches its contents, but does not consume any characters in the target sequence.

Examples:

"(?=aa)(a*)" matches the target sequence "aaaa" and associates capture group 1 with the subsequence "aaaa".
In contrast, "(aa)(a*)" matches the target sequence "aaaa" and associates capture group 1 with the subsequence "aa" at the beginning of the target sequence and capture group 2 with the subsequence "aa" at the end of the target sequence.
"(?=aa)(a)|(a)" matches the target sequence "a" and associates capture group 1 with an empty sequence (because the positive assert failed) and capture group 2 with the subsequence "a". It also matches the target sequence "aa" and associates capture group 1 with the subsequence "aa" and capture group 2 with an empty sequence.

Unicode Escape Sequence

A unicode escape sequence is a backslash followed by the letter 'u' followed by four hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence with the value specified by the four digits.

For example,

"\u0041" matches the target sequence "A" when the ASCII character encoding is used.

Wildcard Character

A wildcard character matches any character in the target expression except a newline.

Word Boundary

A word boundary occurs in the following situations:

the current character is at the beginning of the target sequence and the current character is one of the word characters A-Za-z0-9_
the current character position is past the end of the target sequence and the last character in the target sequence is one of the word characters
the current character is one of the word characters and the preceding character is not
the current character is not one of the word characters and the preceding character is.

Word Boundary Assert

A word boundary assert matches if the current position in the target string is immediately after a word boundary.

Matching and Searching

For a regular expression to match a target sequence, the entire regular expression must match the entire target sequence.

For example:

the regular expression "bcd" matches the target sequence "bcd" but does not match the target sequence "abcd" nor the target sequence "bcde".

For a regular expression search to succeed there must be a subsequence somewhere in the target sequence that matches the regular expression. The search ordinarily finds the leftmost matching subsequence.

Examples:

A search for the regular expression "bcd" in the target sequence "bcd" succeeds and matches the entire sequence; the same search in the target sequence "abcd" also succeeds and matches the last three characters; the same search in the target sequence "bcde" also succeeds, and matches the first three characters.
A search for the regular expression "bcd" in the target sequence "bcdbcd" succeeds and matches the first three characters.

If there is more than one subsequence that matches at some position in the target sequence there are two ways to choose the matching pattern. First match chooses the subsequence that was found first when matching the regular expression. Longest match chooses the longest subsequence from the ones that match at that point. If there is more than one subsequence with the maximal length, longest match chooses the subsequence that was found first.

For example:

a search for the regular expression "b|bc" in the target sequence "abcd" matches the subsequence "b" with first match, because the left-hand term of the alternation matched that subsequence and there was no need to try the right-hand term of the alternation; the same search matches "bc" with longest match, because "bc" is longer than "b".

For example, with a partial match:

"ab" matches the target sequence "a" but not "ac".

Replacement Text

Specifying Replacement Text for ECMAScript and sed
ECMAScript format rules	sed format rules	Replacement text
"$&"	"&"	The character sequence that matched the entire regular expression (`[match[0].first, match[0].second)`)
"$$"		"$"
	"\&"	"&"
"$`" (dollar sign followed by back quote)		The character sequence that precedes the subsequence that matched the regular expression (`[match.prefix().first, match.prefix().second)`)
"$'" (dollar sign followed by forward quote)		The character sequence that follows the subsequence that matched the regular expression (`[match.suffix().first, match.suffix().second)`)
"$n"	"\n"	The character sequence that matched the `n^th` (`0 <= n <= 9`) capture group (`[match[n].first, match[n].second)`
	"\\n"	"\n"
"$nn"		The character sequence that matched the `nn^th` (`10 <= nn <= 99`) capture group (`[match[nn].first, match[nn].second)`

See also the Table of Contents and the Index.