Regular expressions

Regular expressions provide a powerful tool for separating and filtering data. This topic is very complex and, in the beginning, quite challenging.

In regular expressions, you can specify certain text patterns in contrast to exact strings. Literals and metacharacters are used for this.

Literals are all characters that do not have any unique meaning in a regular expression and which are searched for. They are also called a search pattern. Letters and digits are, for example, literals.

A metacharacter, in comparison, is a symbol in a regular expression with a unique meaning (separator, operator or wildcard character).

Regular expressions are essentially search patterns that are applicable to strings and are able to determine whether they match a string or not (mismatch). In this way, the search pattern ay matches the way string because the string ay is contained in it. It does not match the string deep.

If a search pattern is applied to a set of strings, you will receive two subsets, namely the set of all strings for which the pattern matches and the set for which the pattern does not match. Usually you will be interested in one of the two subsets.

Find all names that begin with an A.
Find all characters that are not to the right of a comment character.

The characters used are certainly familiar to you if you have been working with a computer for any length of time. The following will show you which characters are used and which meaning or function is assigned to them.

You can also combine the special characters in regular expressions. You search for any string, for example, using a combination of a dot wildcard character and asterisk repeat character

This expression usually is used as a part of a larger expression, such as in

b.*ing

This expression searches for all strings that start with B and end with ing.

Regular expressions are usually implemented using a finite automaton. Computer science is familiar with processes with which you can automatically generate an automaton that can determine if a string matches or does not match a certain search pattern. Regular expressions function in the same way.

When a regular expression is first used, an automaton is generated internally (the pattern is compiled) and then applied. When the same search pattern is used at a later time this automaton could possibly be used again, which is significantly faster.

Some search patterns and conditions are too complex to use regular expressions and automata to formulate them.

A typical example is something that requires counting:

Find all words that contain the same number of b's as a's.

Also anything that requires preconditions:

Find all of the words print, but only if it is not in quotation marks or is part of a comment.

In this case, you need more powerful concepts and tools, context-free or context-sensitive grammar and a suitable parser. It is possible to create complex conditions by connecting multiple filters in series in CROSSCAP.

In practice, you use regular expressions to decide if a string meets certain formal criteria:

Accept the value only if it contains exclusively digits.

or to cut specific substrings out of strings:

Supply with the text between start and end out of the given string.

Components of regular expressions

Summary

Character	Name	Meaning
\	Backslash	Escape character
^	Caret or hat character	Position in line: beginning of line
$	Dollar sign or string character	Position in line: end of line
.	Dot	Wildcard: any character
\|	Pipe character	Branch/Alternation
{ }	Braces (curly brackets)	Quantifiers
?	Question mark	Indicates there is zero or one repetition of the preceding expression.
*	Asterisk	Repeat: zero or more repeating incidents of the preceding character or the preceding character class
+	Plus sign	at least one repeat of the previous expression
[ ]	Square brackets	Initiates character set
-	Dash	Range marker

[x-y]	Function	Range: all characters in the given range
\x	Function	Escape: literal use of the metacharacter x
\<xyz	Function	Position in word: beginning of word
xyz\>	Function	Position in word: end of word
[Class]	Function	Character class: any character in the class
[^Class]	Function	Inverse class all characters that are not given in the class

Seach patterns in regular expressions are comprised of common characters (literals) and characters with unique meanings (metacharacters). Literals in seach expressions indicate the corresponding characters in a string.

The seach expression hello matches all strings that contain this exact string somewhere.

Metacharacters indicate wildcards for one or more other characters, for beginnings or ends of lines, or for other specific functions. They are what make regular expressions powerful and meaningful.

In regular expressions, there are character sets that are contained within square brackets [ and ]. There are rules for metacharacters that differ depending on whether you are working inside or outside of a character set. Outside of a character set there are the following special rules:

Backslash

is used as an escape character with various applications. This removes the unique meaning from metacharacters (in other words, characters that are not alphanumeric) so you can insert them literally into the search pattern.

The seach pattern

\*

matches an asterisk exactly.

It has a unique meaning for alphabetic characters similar to in the C programming language.

\r

means Return (ASCII 13)

\n

means line feed (ASCII 10) and so on.

The strings \d, \D, \w, \W, \s and \S describe certain predefined character sets (see below).

The string \0xYY defines the hexadecimal notation YY.

\0x0A

indicates a character with the character code ASCII 10 (line feed).

In the string

\XXX

a character with the octal character code XXX is defined.

\012

therefore, indicates a character with the ASCII 10 (line feed) character code.

This syntax is not recommended as it overlaps with another syntax for back references (see below).

Caret or hat character

is an indication of the beginning of a string.

The search pattern ^hello matches all strings that start with the word hello... but not strings that contain ...hello... somewhere in the middle.

String character

Correspondingly, the string character $ indicates the end of string.

The seach pattern

World!$

matches strings that end with the text World!

The seach pattern

^$

matches strings that only have a start and an end, in other words, empty strings.

Dot

The dot matches exactly any single character except the newline character (\n).

The search pattern

...

therefore, matches all strings that contain exactly three characters.

Pipe character

The pipe character | indicates a branch or alternation.

This means that the search pattern (She|He) matches either the string she or he.

Parentheses () enclose a subexpression or a part of an expression. They can be used to mark a part of a search pattern so it can be used later or to group partial patterns.

The previous point provided an example of grouping:

(He|She) defines a pattern that matches either he or she.
In comparison, (He|She) defines a pattern that matches either hehe or hesh.

Back references are the marked parts of a search pattern that can be extracted and edited after applying the pattern.

The pattern

„(.)“

matches exactly any single character that is between quotation marks. Which letter exactly matched the string tested can be queried after using the search pattern. This is used frequently to cut substrings with a specific structure out of a larger string. Within a regular expression open parentheses are numbered consecutively from left to right. Back references can be used within the same expression to refer to the text that is marked by parentheses.

\1

indicates the text that is enclosed by the first open parentheses,

\2

the text in the second parentheses etc. There is more information regarding this in the following points.

Quantifiers

{}

You can determine how many incidents of the previous expression should be sought by using quantifiers in curly brackets.

Question mark

The question mark can be used as a shorthand for

{0,1}

It indicates that there is zero or one repetition of the preceding expression.

Asterisk

The asterisk means

{0,}

in other words, there is zero or more repetitions of the preceding expression.

Plus sign

The plus sign indicates

{1,}

therefore, for at least one repetition of the previous expression.

Square brackets

[ ]

Square brackets introduce a character set. A character set always indicates exactly one character, and this for one character that is included in the set. Thus, the regular expression

[0123456789]

matches exactly any one single digit.

Within the square brackets of a character set, the following characters have a unique meaning:

The backslash \ is an escape character. A character that comes after a backslash has no unique meaning, but is treated as a literal character.
The caret ^ used as the first character in a character set (but only as the first character!) negates the character set. The character set

[^0123456789]

indicates exactly one character that is not a digit.

Dash

The dash - defines a range. The character set

[0-9]

is a shorter notation than

[0123456789]

for the same set.

Special character sets

If you use a search expression with preg* functions then the search expression should be enclosed by delimiters, behind which additional options can be specified. Usually a forward slash / or an equals sign = are used.

After some practice, you can use these components to create search expressions that quickly and reliably perform the necessary search functions or find and replace functions.

Examples of frequently used expressions

Expression	Meaning
\d	a decimal digit
\D	the negation of \d
\w	a word character
\W	the negation of \w
\s	a whitespace character
\S	the negation of \s
{n,m}	repeats the previous expression between n and m times
{n,}	at least n repeats
{,m}	no more than m repeats
{n}	indicates exactly n repeats
a{3,5}	all strings that contain between three and five character a's in series
(bla){3}	all strings that contain exactly the character set blablabla
^a(.{3,})b$	all strings that start with an a, end with a b and contain at least three characters in between. Which (and how many) characters these were can be queried after using the pattern since this section of the pattern was marked with parentheses.