regex
Regular Expressions
Basic Expressions
The simplest regular expressions are alphanumeric strings, e.g., /pattern/
,
will match the substring "pattern"
.
Wild Cards and Quantifiers
The character .
matches any character, so for example, the pattern /.ed/
will match all of "sed"
and "bed"
and "1ed"
.
Quantifiers match the preceding expression multiple times. There are several quantifier expressions:
-
*
matches the preceding expression 0 or more times -
?
matches the preceding expression 0 or 1 times -
\{n\}
matches the preceding expressionn
times, wheren
is a number -
\{m,n\}
matches the preceding expressionm
ton
times, wheren
is a number.
Combine wild cards with quantifiers to match arbitrary strings, e.g., /.*ed/
matches all of "sed"
, "ed"
, and "foo bar baz ed"
.
Grouping
You can cause an regular expression to be treating as one thing for quantifiers
by surrounding it with \(...\)
. For example /\(foo\)\{3\}/
matches 'foofoofoo'.
Character Sets
Sometimes .
is too powerful a wildcard, and what you really want is to match
some characters but not others. Character sets let you do that. A character
set is composed of square brackets around the characters you want to match.
For example, [AEIOU]
matches any uppercase vowel. You can also invert the
character set with a ^
after the [
, i.e., [^AEIOU]
matches anything BUT an
uppercase vowel. For convenience, you may also list character ranges, e.g.,
[A-Z]
or [0-9]
.
Beginning and End of a String
You can match the start and end of a string with ^
and $
, respectively. As
we will see with sed
, regular expressions are often evaluated line
by line over a file, and in these instances ^
and $
refer to the start and
end of the line.
Alternation
You can match one or more alternatives with \|
. For example /foo|bar/
matches
both foo
and bar
. If you only want part of the regular expression to have
alternatives, you can use \(...\)
. For example /\(Li\|U\)nix/
matches both Linux
and Unix
.
Programs other than sed
The above guide describes POSIX Basic Regular Expressions which are used by sed
.
Many regular expression tools, such as egrep
and Python's re module
Links to an external site.,
use a syntax based on "extended" regular expressions.
In this style of regular expression, there is no backslash before |
, (
, )
, {
, or }
,
so one must write /(Li|U)nix/
instead of /\(Li\|U\)nix/
. Instead, putting a backslash before
a nonalphanumeric character always matches that character only.
Although there are standards for regular expressions, in practice, there is a lot of minor variation in the features and syntaxes that regular expression libraries support, so you should try to consult documentation for the tool you are using.