Regular Expression Grammar
· Grammar Summary
· Semantic Details
· Matching and Searching
· Replacement Text
A regular expression is a sequence
of characters that can match one or more target sequences of characters,
according to a regular expression grammar. This
implementation supports
the following regular expression grammars:
- BRE -- Basic Regular Expressions,
defined by the POSIX Standard, Part 1
(ISO/IEC 9945-1:2003)
- ERE -- Extended Regular Expressions,
also defined by the POSIX Standard, Part 1
- ECMAScript --
ECMAScript regular exprssions, as defined by the
ECMAScript Language Specification (Ecma-262)
- awk -- regular expressions
as used in the awk utility,
defined by the POSIX Standard, Part 3
(ISO/IEC 9945-3:2003)
- grep -- regular expressions
as used in the grep utility,
also defined by the POSIX Standard, Part 3
- egrep -- regular expressions
as used in the grep utility with the -E option,
also defined by the POSIX Standard, Part 3
This document describes each of these grammars as provided in
this implementation. Most
of the differences between the grammars are in the regular expression
features that are supported. When features are not supported by all of
the grammars the text describing those features lists the grammars that
support them. In some cases the differences between the grammars are in the
syntax used to describe a feature (for example, BRE and grep
require a backslash in front of a left parenthesis that marks the beginning
of a group and the others do not). In these cases the differences are
described as part of the description of the feature.
An element can be any of the following:
- An ordinary character,
which matches the same character in the target sequence
- A wildcard character, '.',
which matches any character in the target sequence except a newline
- A bracket expression, of the form
"[expr]",
which matches a character or a
collation element in the target
sequence that is also in the set defined by the expression expr,
or of the form "[^expr]", which matches
a character or a collation element in the target sequence that is not
in the set defined by the expression expr. The expression expr
can consist of any combination of any number of each of the following.
- An individual character, which
adds that character to the set defined by expr.
- A character range, of the form
"ch1-ch2",
which adds all of the characters represented by values in the closed
range [ch1, ch2] to the set defined by expr.
- A character class, of the form
"[:name:]",
which adds all of the characters in the named class to the set defined by
expr.
- An equivalence class, of the form
"[=elt=]",
which adds the collating elements that are equivalent to elt to the set
defined by expr.
- A collating symbol, of the form
"[.elt.]",
which adds the collation element elt to the set defined by
expr.
- An anchor, either '^' or '$',
which matches the beginning or the end of the target sequence, respectively
- A capture group, of the form
"( subexpression )", or
"\( subexpression \)" in
BRE and grep, which matches the sequence of characters in the
target sequence that is matched by the pattern between the delimiters
- An identity escape, of the form
"\k", which matches the character k in
the target sequence
Examples:
- "a" matches the target sequence "a" but none
of the target sequences "B", "b", or "c".
- "." matches all of the target sequences "a",
"B", "b", and "c".
- "[b-z]" matches the target sequences "b"
and "c" but does not match the target sequence "a"
or the target sequence "B".
- "[:lower:]" matches the target sequences "a",
"b", and "c" but does not match the target
sequence "B".
- "(a)" matches the target sequence "a"
and associates capture group 1 with the subsequence "a",but does not
match any of the target sequences "B", "b",
or "c".
In ECMAScript, BRE, and grep an element can also be:
- a back reference,
of the form "\dd" where dd represents a
decimal value N, which matches a sequence of characters in the target
sequence that is the same as the sequence of characters matched by the Nth
capture group.
For example:
- "(a)\1" matches the target sequence "aa" because the first
(and only) capture group matches the initial sequence "a" and
the \1 then matches the final sequence "a".
In ECMAScript, an element can also be any of the following:
- A non-capture group, of the form
"(?: subexpression )", which
matches the sequence of characters in the target sequence that is matched
by the pattern between the delimiters
- a limited file format escape,
of the form
"\f",
"\n",
"\r",
"\t", or
"\v"; these match a form feed, newline, carriage return,
horizontal tab, and vertical tab, respectively, in the target sequence.
- A positive assert,
of the form "(?= subexpression )",
which matches the sequence of characters in the target sequence that is
matched by the pattern between the delimiters, but does not change the
match position in the target sequence.
- A negative assert,
of the form "(?! subexpression )",
which matches any sequence of characters in the target sequence that
does not match the pattern between the delimiters,
and does not change the match position in the target sequence.
- A hexadecimal escape sequence,
of the form "\xhh", which matches a character in the
target sequence whose representation is the value represented by the two
hexadecimal digits hh.
- A unicode escape sequence,
of the form "\uhhhh", which matches a character in
the target sequence whose representation is the value represented by the
four hexadecimal digits hhhh.
- A control escape sequence,
of the form "\ck", which matches the control
character named by the character k.
- A word boundary assert,
of the form "\b", which matches if the current position
in the target sequence is immediately after a
word boundary.
- A negative word boundary assert,
of the form "\B", which matches if the current position in the
target sequence is not immediately after a
word boundary.
- A dsw character escape, of
the form
"\d",
"\D",
"\s",
"\S",
"\w",
"\W",
which provides a short name for a character class.
For example:
- "(?:a)" matches the target sequence "a", but
"(?:a)\1" is invalid, because there is no capture group 1.
- "(?=a)a" matches the target sequence "a". The
positive assert matches the initial sequence "a" in the target
sequence and the final "a" in the regular expression matches the
initial sequence "a" in the target sequence.
- "(?!a)a" does not match the target sequence
"a".
- "a\b." matches the target sequence "a~"
but does not match the target sequence "ab".
- "a\B." matches the target sequence "ab"
but does not match the target sequence "a~".
In awk, an element can also be one of the following:
- A file format escape, of the form
"\\",
"\a",
"\b",
"\f",
"\n",
"\r",
"\t", or
"\v"; these match a backslash, alert, backspace, form feed,
newline, carriage return, horizontal tab, and vertical tab, respectively,
in the target sequence.
- An octal escape sequence,
of the form "\ooo", which matches a character in
the target sequence whose representation is the value represented by the
one, two, or three octal digits ooo.
Any element other than a positive assert,
a negative assert, or an
anchor can be followed by a
repetition count.
The most general form of repetition count takes the form
"{min,max}",
or "\{min,max\}" in BRE and grep.
An element followed by this form of repetition count matches at least min
and no more than max successive occurrences of a sequence that matches
the element.
For example:
- "a{2,3}" matches the target sequence "aa" and
the target sequence "aaa", but not the target sequence "a"
or the target sequence "aaaa".
A repetition count can also take one of the following forms:
- "{min}", or "\{min\}"
in BRE and grep, which is equivalent to
"{min,min}".
- "{min,}", or "\{min,\}"
in BRE and grep, which is equivalent to
"{min,unbounded}".
- "*", which is equivalent to "{0,unbounded}".
Examples:
- "a{2}" matches the target sequence "aa" but
not the target sequence "a" or the target sequence "aaa".
- "a{2,}" matches the target sequence "aa", the
target sequence "aaa", and so on, but does not match the target
sequence "a".
- "a*" matches the target sequence "", the target
sequence "a", the target sequence "aa", and so on.
For all grammars except BRE and grep, a repetition count can
also take one of the following forms:
- "?", which is equivalent to "{0,1}".
- "+", which is equivalent to "{1,unbounded}".
Examples:
- "a?" matches the target sequence "" and the target
sequence "a", but not the target sequence "aa".
- "a+" matches the target sequence "a", the target
sequence "aa", and so on, but not the target sequence
"".
Finally, in ECMAScript, all of the preceding forms of repetition
count can be followed by the character '?', which designates a
non-greedy repetition.
Regular expression elements, with our without
repetition counts, can be concatenated to
form longer regular expressions. Such an expression matches a target
sequence that is a concatenation of sequences matched by the individual
elements.
For example:
- "a{2,3}b" matches the target sequence "aab" and
the target sequence "aaab", but does not match the target sequence
"ab" or the target sequence "aaaab".
For all regular expression grammars except BRE and grep, a
concatenated regular expression can be followed by the character '|'
and another concatenated regular expression, which can be followed by
another '|' and another concatenated regular expression, and so on. Such an
expression matches any target sequence that matches one or more of the
concatenated regular expressions. When more than one of the concatenated regular
expressions matches the target sequence, ECMAScript chooses the first of
the concatenated regular expressions that matches the sequence as the
match (first match); the other regular
expression grammars choose the one that results in the
longest match.
For example:
- "ab|cd" matches the target sequence "ab" and
the target sequence "cd", but does not match the target sequence
"abd" or the target sequence "acd".
In grep and egrep, a newline character ('\n') can
be used to separate alternations.
A subexpression is a
concatenation
in BRE and grep, or an
alternation in
the other regular expression grammars.
Anchor
An anchor matches a position in the target
string and not a character. A '^' matches the beginning of the
target string, and a '$' matches the end of the target string.
Back Reference
A back reference is a backslash followed
by a decimal value N. It matches the contents of the
Nth capture group.
The value of N must not be greater than the number of capture
groups that precede the back reference. In BRE and grep the value
of N is determined by the decimal digit that follows the backslash. In
ECMAScript the value of N is determined by all of the decimal digits
that immediately follow the backslash. Thus, in BRE and grep the
value of N is never greater than 9, even if the regular expression has more
than nine capture groups.
In ECMAScript the value of N is unbounded.
Examples:
- "((a+)(b+))(c+)\3" matches the target sequence
"aabbbcbbb". The back reference "\3" matches the
text in the third capture group, that is, the "(b+)". It
does not match the target sequence
"aabbbcbb".
- "(a)\2" is not valid.
- "(b(((((((((a))))))))))\10" has a different meaning
in BRE and in ECMAScript. In BRE the back reference
is "\1". It matches the contents of the first capture group (i.e.
the one beginning with "(b" and ending with the final ")"
preceding the back reference), and
the final '0' matches the ordinary character '0'. In ECMAScript the
back reference is "\10". It matches the tenth capture group (i.e.
the innermost one).
Bracket Expression
A bracket expression defines a set
of characters and collating elements.
If the bracket expression begins with the character '^' the match
succeeds if none of the elements in the set matches the current character in
the target sequence. Otherwise, the match succeeds if any of the elements
in the set matches the current character in the target sequence.
The set of characters can be defined by listing any combination of
individual characters,
character ranges,
character classes,
equivalence classes, and
collating symbols.
Capture Group
A capture group marks its contents
as a single unit in the regular expression grammar and labels the target
text that matches its contents. The label associated with each capture group
is a number, determined by counting the left parentheses marking capture
groups up to and including the left parenthesis marking the current
capture group. In this implementation, the maximum number of capture
groups is 31.
Examples:
- "ab+" matches the target sequence "abb" but
not the target sequence "abab".
- "(ab)+" does not match the target sequence "abb"
but matches the target sequence "abab".
- "((a+)(b+))(c+)" matches the target sequence
"aabbbc" and associates capture group 1 with the subsequence
"aabbb", capture group 2 with the subsequence "aa",
capture group 3 with "bbb", and capture group 4 with the
subsequence "c".
Character Class
A character class in a bracket
expression adds all the characters in the named class to the character
set defined by the bracket expression. To create a character class, use
"[:" followed by the name of the class followed by ":]".
Internally, names of character classes are recognized by calling
id = traits.lookup_classname.
A character ch belongs to such a class if
traits.isctype(ch, id) returns true. The default
regex_traits template supports the following class names:
- "alnum" -- lowercase letters, uppercase letters,
and digits;
- "alpha" -- lowercase letters and uppercase letters;
- "blank" -- space or tab;
- "cntrl" -- the
file format escape characters;
- "digit" -- digits;
- "graph" -- lowercase letters, uppercase letters,
digits, and punctuation;
- "lower" -- lowercase letters;
- "print" -- lowercase letters, uppercase letters,
digits, punctuation, and space;
- "punct" -- punctuation;
- "space" -- space;
- "upper" -- uppercase characters;
- "xdigit" -- digits, 'a', 'b', 'c', 'd', 'e', 'f',
'A', 'B', 'C', 'D', 'E', 'F';
- "d" -- same as digit;
- "s" -- same as space;
- "w" -- same as alnum.
Character Range
A character range in a bracket
expression adds all the characters in the range to the character set
defined by the bracket expression. To create a character range put the
character '-' between the first and last characters in the range. This
puts all the characters whose numeric value is greater than or equal to
the numeric value of the first character and less than or equal to the
numeric value of the last character into the set. Note that this set of
added characters depends on the platform-specific representation of characters.
If the character '-' occurs at the beginning or end of a bracket expression
or as the first or last character of a character range it represents itself.
Examples:
- "[0-7]" represents the set of characters { '0', '1',
'2', '3', '4', '5', '6', '7' }. It matches the target sequences
"0", "1", etc., but not "a".
- "[h-k]" represents the set of characters { 'h', 'i',
'j', 'k' } on systems that use the ASCII character encoding; it matches
the target sequences "h", "i", etc., but not
"\x8A" or "0".
- "[h-k]" represents the set of characters { 'h', 'i',
'\x8A', '\x8B', '\x8C', '\x8D', '\x8E', '\x8F', '\x90', 'j', 'k' } on
systems that use the EBCDIC character encoding ('h' is encoded as 0x88
and 'k' is encoded as 0x92). It matches the target sequences
"h", "i", "\x8A", etc., but not "0".
- "[-0-24]" represents the set of characters { '-', '0',
'1', '2', '4' }.
- "[0-2-]" represents the set of characters { '0', '1',
'2', '-' }.
- "[+--]" on systems that use ASCII represents the set of
characters { '+', ',', '-' }.
When using
locale-sensitive ranges, however,
the characters in a range are determined by the collation rules for the
locale. Characters that collate after the first character in the definition
of the range and before the last character in the definition of the range
are in the set, as are the two end characters.
Collating Element
A collating element is a
multi-character sequence that is treated as a single character.
It can contain any characters except '.', '=', or ':'.
Collating Symbol
A collating symbol in a bracket
expression adds a collating element
to the set defined by the bracket expression. To create a collating symbol,
use "[." followed by the collating element followed by
".]".
Control Escape Sequence
A control escape sequence is
a backslash followed by the letter 'c' followed by one of the letters 'a'
through 'z' or 'A' through 'Z'. It matches the ASCII control character named
by that letter.
For example,
- "\ci" matches the target sequence "\x09",
because <ctrl-i> has the value 0x09.
A DSW Character Escape is a Short Name for a Character Class
| Escape Sequence |
Equivalent Named Class |
Default Named Class |
| "\d" |
"[[:d:]]" |
"[[:digit:]]" |
| "\D" |
"[^[:d:]]" |
"[^[:digit:]]" |
| "\s" |
"[[:s:]]" |
"[[:space:]]" |
| "\S" |
"[^[:s:]]" |
"[^[:space:]]" |
| "\w" |
"[[:w:]]" |
"[a-zA-Z0-9_]"* |
| "\W" |
"[^[:w:]]" |
"[^a-zA-Z0-9_]"* |
| *ASCII character set |
Equivalence Class
An equivalence class in a
bracket expression adds all the characters and
collating elements that are
equivalent to the collating element in the equivalence class definition
to the set defined by the bracket expression. To create an equivalence class,
use "[=" followed by a collating element followed by "=]".
Internally, two collating elements elt1 and elt2
are equivalent if
traits.transform_primary(elt1.begin(), elt1.end()) ==
traits.transform_primary(elt2.begin(), elt2.end()).
File Format Escape
A file format escape consists
of the usual C language character escape sequences, "\\",
"\a", "\b", "\f", "\n", "\r",
"\t", "\v", with their usual meanings, namely, backslash,
alert, backspace, form feed, newline, carriage return, horizontal tab, and
vertical tab, respectively. In ECMAScript "\a" is not allowed.
("\\" is allowed, but technically
it's an identity escape, not a file format escape).
Hexadecimal Escape Sequence
A hexadecimal escape sequence
is a backslash followed by the letter 'x' followed by two hexadecimal digits
(0-9a-fA-F). It matches a character in the target sequence with the value
specified by the two digits.
For example,
- "\x41" matches the target sequence "A" when
the ASCII character encoding is used.
Identity Escape
An identity escape is a backslash
followed by a single character. It matches that character. It is needed
when the character has a special meaning; using the identity escape removes
the special meaning.
For example,
- "a*" matches the target sequence "aaa" but
does not match the target sequence "a*"
- "a\*" does not match the target sequence
"aaa" but does match the target sequence "a*"
The set of characters allowed in an identity escape depends on
the regular expression grammar.
- BRE, grep -- { '(', ')', '{', '}', '.', '[', '\', '*',
'^', '$' }.
- ERE, egrep -- { '(', ')', '{', '}', '.', '[', '\', '*', '^',
'$', '+', '?', '|' }.
- awk -- ERE plus { '"', '/' }.
- ECMAScript -- all characters except those that can be part of an
identifier. Roughly speaking, this is letters, digits, '$', '_', and
unicode escape sequences. For full details see the
ECMAScript Language Specification.
Individual Character
An individual character in a
bracket expression adds that character to the character set defined by the
bracket expression. A '^' anywhere other than at the beginning of a bracket
expression represents itself.
Examples:
- "[abc]" matches the target sequences "a",
"b", and "c" but not the sequence "d".
- "[^abc]" matches the target sequence "d",
but not "a", "b", or "c".
- "[a^bc]" matches the target sequences "a",
"b", "c", and "^" but not the sequence
"d".
In all the regular expression grammars except ECMAScript if a ']'
is the first character following the opening '[' or the first character
following an initial '^' it represents itself.
Examples:
- "[]a" is invalid, because there is no ']' to end the
bracket expression.
- "[]abc]" matches the target sequences "a",
"b", "c", and "]" but not the sequence
"d".
- "[^]abc]" matches the target sequence "d",
but not "a", "b", "c", or "]".
In ECMAScript use '\]' to represent the character ']' in a bracket
expression.
Examples:
- "[]a" matches the target sequence "a" because
the bracket expression is empty.
- "[\]abc]" matches the target sequences "a",
"b", "c", and "]" but not the sequence
"d".
Negative Assert
A negative assert matches
anything but its contents; it does not consume any characters in the
target sequence.
For example,
- "(?!aa)(a*)" matches the target sequence
"a" and associates capture group 1 with the subsequence
"a". It does not match the target sequence "aa" or the
target sequence "aaa".
Negative Word Boundary Assert
A negative word boundary assert
matches if the current position in the target string is not immediately after a
word boundary.
Non-capture Group
A non-capture group
marks its contents as a single unit in the regular expression grammar, but
does not label the target text.
For example,
- "(a)(?:b)*(c) matches the target text "abbc" and
associates capture group 1 with the subsequence "a" and capture
group 2 with the subsequence "c".
Non-greedy Repetition
A non-greedy repetition consumes
the shortest subsequence of the target sequence that matches the pattern.
A greedy repetition consumes the longest.
For example,
- "(a+)(a*b)" matches the target sequence "aaab".
When using a non-greedy repetition it associates
capture group 1 with the subsequence "a" at the
beginning of the target sequence and capture group 2 with the subsequence
"aab" at the end of the target sequence. When using a greedy match
it associates capture group 1 with the subsequence "aaa" and
capture group 2 with the subsequence "b".
Octal Escape Sequence
An octal escape sequence
is a backslash followed by one, two, or three octal digits (0-7).
It matches a character in the target sequence with the value specified by those
digits. If all the digits are '0' the sequence is invalid.
For example,
- "\101" matches the target sequence "A" when
the ASCII character encoding is used.
Ordinary Character
An ordinary character is any
valid character that doesn't have a special meaning in the current grammar.
In ECMAScript the characters that have special meanings are:
^ $ \ . * + ? ( ) [ ] { } |
In BRE and grep the characters that have special meanings are:
. [ \
In addition, the following characters have special meanings when used in
a particular context:
- '*' has a special meaning in all cases except when it
is the first character in a regular expression or the first character following
an initial '^' in a regular expression and when it is the first
character of a capture group or the first character following an initial
'^' in a capture group.
- '^' has a special meaning when it is the first character of a
regular expression.
- '$' has a special meaning when it is the last character of a
regular expression.
In ERE, egrep, and awk the following characters have
special meanings:
. [ \ ( * + ? { |
In addition, the following characters have special meanings when used in
a particular context.
- ')' has a special meaning when it matches a preceding '('.
- '^' has a special meaning when it is the first character of a
regular expression.
- '$' has a special meaning when it is the last character of a
regular expression.
An ordinary character matches the same character in the target sequence.
By default this means that the match succeeds if the two characters are
represented by the same value. In a
case-insensitive match two
characters ch0 and ch1 match
if traits.translate_nocase(ch0) == traits.translate_nocase(ch1).
In a locale-sensitive match
two characters ch0 and ch1 match if
traits.translate(ch0) == traits.translate(ch1).
Positive Assert
A positive assert matches its
contents, but does not consume any characters in the target sequence.
Examples:
- "(?=aa)(a*)" matches the target sequence "aaaa"
and associates capture group 1 with the subsequence "aaaa".
- In contrast, "(aa)(a*)" matches the target sequence
"aaaa" and associates capture group 1 with the subsequence
"aa" at the beginning of the target sequence and capture group 2
with the subsequence "aa" at the end of the target sequence.
- "(?=aa)(a)|(a)" matches the target sequence "a"
and associates capture group 1 with an empty sequence (because the positive
assert failed) and capture group 2 with the subsequence "a".
It also matches the target sequence "aa" and associates
capture group 1 with the subsequence "aa" and capture group 2
with an empty sequence.
Unicode Escape Sequence
A unicode escape sequence
is a backslash followed by the letter 'u' followed by four
hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence
with the value specified by the four digits.
For example,
- "\u0041" matches the target sequence "A" when
the ASCII character encoding is used.
Wildcard Character
A wildcard character matches any
character in the target expression except a newline.
Word Boundary
A word boundary occurs in the following
situations:
- the current character is at the beginning of the target sequence and
the current character is one of the word characters A-Za-z0-9_
- the current character position is past the end of the target sequence
and the last character in the target sequence is one of the word characters
- the current character is one of the word characters and the
preceding character is not
- the current character is not one of the word characters and the
preceding character is.
Word Boundary Assert
A word boundary assert
matches if the current position in the target string is immediately after a
word boundary.
For a regular expression to match a target
sequence, the entire regular expression must match the entire target
sequence.
For example:
- the regular expression "bcd" matches the target sequence
"bcd" but does not match the target sequence "abcd" nor
the target sequence "bcde".
For a regular expression search to succeed
there must be a subsequence somewhere in the target sequence that matches
the regular expression. The search ordinarily finds the leftmost matching
subsequence.
Examples:
- A search for the regular expression "bcd" in the target
sequence "bcd" succeeds and matches the entire sequence; the same
search in the target sequence "abcd" also succeeds and matches the
last three characters; the same search in the target sequence "bcde"
also succeeds, and matches the first three characters.
- A search for the regular expression "bcd" in the target
sequence "bcdbcd" succeeds and matches the first three characters.
If there is more than one subsequence that matches at some position in the
target sequence there are two ways to choose the matching pattern.
First match chooses the subsequence that was
found first when matching the regular expression.
Longest match chooses the longest
subsequence from the ones that match at that point. If there is more than one
subsequence with the maximal length, longest match chooses the subsequence that
was found first.
For example:
- a search for the regular expression "b|bc" in the target
sequence "abcd" matches the subsequence "b" with first
match, because the left-hand term of the alternation matched that subsequence
and there was no need to try the right-hand term of the alternation; the same
search matches "bc" with longest match, because "bc" is
longer than "b".
For example, with a partial match:
- "ab" matches the target sequence "a" but not
"ac".
Specifying Replacement Text for ECMAScript and sed
| ECMAScript format rules |
sed format rules |
Replacement text |
| "$&" |
"&" |
The character sequence that matched the entire regular expression
([match[0].first, match[0].second)) |
| "$$" |
|
"$" |
| |
"\&" |
"&" |
| "$`" (dollar sign followed by back quote) |
|
The character sequence that precedes the subsequence that matched
the regular expression ([match.prefix().first, match.prefix().second)) |
| "$'" (dollar sign followed by forward quote) |
|
The character sequence that follows the subsequence that matched
the regular expression ([match.suffix().first, match.suffix().second)) |
| "$n" |
"\n" |
The character sequence that matched the nth
(0 <= n <= 9)
capture group ([match[n].first, match[n].second) |
| |
"\\n" |
"\n" |
| "$nn" |
|
The character sequence that matched the nnth
(10 <= nn <= 99)
capture group ([match[nn].first, match[nn].second) |
See also the
Table of Contents and the
Index.
Copyright © 1992-2013
by Dinkumware, Ltd. All rights reserved.