Crosscap User Manual
Regular expressions
CROSSCAP Manual > Appendix > Regular expressions

Regular expressions provide a powerful tool for separating and filtering data. This topic is very complex and, in the beginning, quite challenging.

In regular expressions, you can specify certain text patterns in contrast to exact strings. Literals and metacharacters are used for this.

Literals are all characters that do not have any unique meaning in a regular expression and which are searched for. They are also called a search pattern. Letters and digits are, for example, literals.

A metacharacter, in comparison, is a symbol in a regular expression with a unique meaning (separator, operator or wildcard character).

Regular expressions are essentially search patterns that are applicable to strings and are able to determine whether they match a string or not (mismatch). In this way, the search pattern ay matches the way string because the string ay is contained in it. It does not match the string deep.

If a search pattern is applied to a set of strings, you will receive two subsets, namely the set of all strings for which the pattern matches and the set for which the pattern does not match. Usually you will be interested in one of the two subsets.

The characters used are certainly familiar to you if you have been working with a computer for any length of time. The following will show you which characters are used and which meaning or function is assigned to them.

You can also combine the special characters in regular expressions. You search for any string, for example, using a combination of a dot wildcard character and asterisk repeat character

.*

This expression usually is used as a part of a larger expression, such as in

b.*ing

This expression searches for all strings that start with B and end with ing.

Regular expressions are usually implemented using a finite automaton. Computer science is familiar with processes with which you can automatically generate an automaton that can determine if a string matches or does not match a certain search pattern. Regular expressions function in the same way.

When a regular expression is first used, an automaton is generated internally (the pattern is compiled) and then applied. When the same search pattern is used at a later time this automaton could possibly be used again, which is significantly faster.

Some search patterns and conditions are too complex to use regular expressions and automata to formulate them.

A typical example is something that requires counting:


Also anything that requires preconditions:

In this case, you need more powerful concepts and tools, context-free or context-sensitive grammar and a suitable parser. It is possible to create complex conditions by connecting multiple filters in series in CROSSCAP.

In practice, you use regular expressions to decide if a string meets certain formal criteria:

or to cut specific substrings out of strings:

Components of regular expressions


Summary

Character

Name

Meaning

\ Backslash Escape character

^

Caret or hat character

Position in line: beginning of line

$

Dollar sign or string character

Position in line: end of line

.

Dot

Wildcard: any character

|

Pipe character

Branch/Alternation

{ }

Braces (curly brackets)

Quantifiers

?

Question mark

Indicates there is zero or one repetition of the preceding expression.

*

Asterisk

Repeat: zero or more repeating incidents of the preceding character or the preceding character class

+

Plus sign

at least one repeat of the previous expression

[ ]

Square brackets

Initiates character set

-

Dash

Range marker

[x-y]

Function

Range: all characters in the given range

\x

Function

Escape: literal use of the metacharacter x

\<xyz

Function

Position in word: beginning of word

xyz\>

Function

Position in word: end of word

[Class]

Function

Character class: any character in the class

[^Class]

Function Inverse class all characters that are not given in the class



Seach patterns in regular expressions are comprised of common characters (literals) and characters with unique meanings (metacharacters). Literals in seach expressions indicate the corresponding characters in a string.


Metacharacters indicate wildcards for one or more other characters, for beginnings or ends of lines, or for other specific functions. They are what make regular expressions powerful and meaningful.

In regular expressions, there are character sets that are contained within square brackets [ and ]. There are rules for metacharacters that differ depending on whether you are working inside or outside of a character set. Outside of a character set there are the following special rules:

Backslash

\

is used as an escape character with various applications. This removes the unique meaning from metacharacters (in other words, characters that are not alphanumeric) so you can insert them literally into the search pattern.

The seach pattern

\*

matches an asterisk exactly.

It has a unique meaning for alphabetic characters similar to in the C programming language.

\r

means Return (ASCII 13)


\n

means line feed (ASCII 10) and so on.

The strings \d, \D, \w, \W, \s and \S describe certain predefined character sets (see below).

The string \0xYY defines the hexadecimal notation YY.

\0x0A

indicates a character with the character code ASCII 10 (line feed).


In the string

\XXX

a character with the octal character code XXX is defined.

\012

therefore, indicates a character with the ASCII 10 (line feed) character code.

This syntax is not recommended as it overlaps with another syntax for back references (see below).


Caret or hat character

^

is an indication of the beginning of a string.

The search pattern ^hello matches all strings that start with the word hello... but not strings that contain ...hello... somewhere in the middle.


String character

$

Correspondingly, the string character $ indicates the end of string.

World!$

matches strings that end with the text World!

^$

matches strings that only have a start and an end, in other words, empty strings.

Dot

.

The dot matches exactly any single character except the newline character (\n).

The search pattern

...

therefore, matches all strings that contain exactly three characters.

Pipe character

|

The pipe character | indicates a branch or alternation.

This means that the search pattern (She|He) matches either the string she or he.

Parentheses () enclose a subexpression or a part of an expression. They can be used to mark a part of a search pattern so it can be used later or to group partial patterns.

The previous point provided an example of grouping:

Back references are the marked parts of a search pattern that can be extracted and edited after applying the pattern.

The pattern

„(.)“

matches exactly any single character that is between quotation marks. Which letter exactly matched the string tested can be queried after using the search pattern. This is used frequently to cut substrings with a specific structure out of a larger string. Within a regular expression open parentheses are numbered consecutively from left to right. Back references can be used within the same expression to refer to the text that is marked by parentheses.

\1

indicates the text that is enclosed by the first open parentheses,

\2

the text in the second parentheses etc. There is more information regarding this in the following points.

Quantifiers

{}

You can determine how many incidents of the previous expression should be sought by using quantifiers in curly brackets.

Question mark

?

The question mark can be used as a shorthand for

{0,1}

It indicates that there is zero or one repetition of the preceding expression.

Asterisk

*

The asterisk means

{0,}

in other words, there is zero or more repetitions of the preceding expression.

Plus sign

+

The plus sign indicates

{1,}

therefore, for at least one repetition of the previous expression.

Square brackets

[ ]

Square brackets introduce a character set. A character set always indicates exactly one character, and this for one character that is included in the set. Thus, the regular expression

[0123456789]

matches exactly any one single digit.

Within the square brackets of a character set, the following characters have a unique meaning:

[^0123456789]

indicates exactly one character that is not a digit.

Dash

-

The dash - defines a range. The character set

[0-9]

is a shorter notation than

[0123456789]

for the same set.

Special character sets

If you use a search expression with preg* functions then the search expression should be enclosed by delimiters, behind which additional options can be specified. Usually a forward slash / or an equals sign = are used.

After some practice, you can use these components to create search expressions that quickly and reliably perform the necessary search functions or find and replace functions.

Examples of frequently used expressions

Expression Meaning
\d a decimal digit
\D the negation of \d
\w a word character
\W the negation of \w
\s a whitespace character
\S the negation of \s
{n,m} repeats the previous expression between n and m times
{n,} at least n repeats
{,m} no more than m repeats
{n} indicates exactly n repeats
a{3,5} all strings that contain between three and five character a's in series
(bla){3} all strings that contain exactly the character set blablabla
^a(.{3,})b$ all strings that start with an a, end with a b and contain at least three characters in between. Which (and how many) characters these were can be queried after using the pattern since this section of the pattern was marked with parentheses.