Regular Expressions are a widely-used method of specifying
patterns of text to search for. Special metacharacters allow the user to
specify, for instance, that a particular string the user is looking for
occurs at the beginning or end of a line or contains n recurrences
of a certain character. Any single character matches itself, unless it is a metacharacter with
a special meaning as described below. A series of characters matches that series of characters in the target
string, so the pattern "bluh" would match "bluh'' in the
target string. You can cause characters that normally function as metacharacters or
escape sequences to be interpreted literally by 'escaping' them by
preceding them with a backslash "\", for instance: metacharacter
"^" matches the beginning of a string, but "\^" match
character "^", "\\" match "\" and so on. Characters may be specified using an escape sequence syntax much like
that used in C and Perl: "\n'' matches a newline, "\t'' a tab,
etc. More generally, \xnn, where nn is a string of hexadecimal digits,
matches the character whose ASCII value is nn. If You need wide (Unicode)
character code, You can use '\x{nnnn}', where 'nnnn' - one or more
hexadecimal digits.
You can specify a character class, by enclosing a list of characters in
straight brackets [], which will match any one character from the list. If the first character after the "['' is "^'', the class
matches any character not in the list.
Within a list, the "-'' character is used to specify a range, so
that a-z represents all characters between "a'' and "z'',
inclusive. If you want "-'' itself to be a member of a class, put it at the
start or end of the list, or escape it with a backslash. If you want ']'
you may place it at the start of list or escape it with a backslash.
Metacharacters are special characters which are the essence of Regular
Expressions. There are different types of metacharacters, as described
below.
A word boundary (\b) is a spot between two characters that has a \w on
one side of it and a \W on the other side of it (in either order),
counting the imaginary characters off the beginning and end of the string
as matching a \W. Any item of a regular expression may be followed by another type of
metacharacters - iterators. Using this metacharacters you can specify a
number of occurrences of a previous character, metacharacter or
subexpression. So, digits in curly brackets of the form {n,m}, specify the minimum
number of times to match the item n and the maximum m. The form {n} is
equivalent to {n,n} and matches exactly n times. The form {n,} matches n
or more times. There is no limit to the size of n or m, but large numbers
will chew up more memory and slow down regular expressions execution. If a curly bracket occurs in any other context, it is treated as a
regular character.
You can specify a series of alternatives for a pattern using "|''
to separate them , so that fee|fie|foe will match any of "fee'',
"fie'', or "foe'' in the target string (as would f(e|i|o)e). The
first alternative includes everything from the last pattern delimiter
("('', "['', or the beginning of the pattern) up to the first
"|'', and the last alternative contains everything from the last
"|'' to the next pattern delimiter. For this reason, it's a common
practice to include alternatives in parentheses to minimize confusion
about where they start and end. Alternatives are tried from left to right, so the first alternative
found for which the entire expression matches, is the one that is chosen.
This means that alternatives are not necessarily greedy. For example: when
matching foo|foot against "barefoot'', only the "foo'' part will
match, as that is the first alternative tried, and it successfully matches
the target string. (This might not seem important, but it is important
when you are capturing matched text using parentheses.) Also remember that "|'' is interpreted as a literal within square
brackets, so if you write [fee|fie|foe] You're really only matching [feio|].
The bracketing construct ( ... ) may also be used for defining regular
expression subexpressions. Subexpressions are numbered based on the left
to right order of their opening parenthesis. The first subexpression has
the number '1'.
|