Regular Expressions

Whether you need to scrape email addresses from a set of HTML documents or scour massive logfiles for needle-in-a-haystack errors, regular expressions provide a powerful tool for searching text.

Because they are primarily composed of sequences of tokens with somewhat non-obvious meanings, regular expressions (or regexes) can be daunting to decipher. The moderate learning curve to understand how to craft an accurate regex is highly worthwhile and has the potential to save hours of manually text searching.

This post aims to provide a clear walkthrough of the most commonly used components of regex syntax. While there are numerous cheatsheet style references available online, having a holistic understanding of what a regex can do will make it much easier to know where to start when approaching search problem.

Basic Matching

Regular expressions are composed of a series of tokens used to define a particular pattern. That pattern can then be compared against a string to determine if it matches the pattern as a whole, or if parts of the string match the pattern. In most languages, regexes are placed between forward slashes. Within these slashes, you will find a range of tokens, with the most basic being literal string expressions.

/efg/

abcdefghijk

In the above example, the the pattern defined between slashes registers a match in the sample string. The portion of the string, “efg” is explicitly sought in the regex and is thus returned as a match (as marked in bold).

To create more flexible searches, brackets allow you to define multiple possible matches. If you were searching a directory for any person named either “Ted” or “Ned”, you could use a single regex with brackets denoting that both T and N should be accepted.

/[TN]ed/

Barbara Terry Ned Joel Ted Eddy

Other more unfamiliar tokens, often starting with a backslash, can be used to craft more sophisticated searches. If you wanted to find a serial number by matching any digit, you wouldn’t need to write out every digit from 0 to 9, instead you can use the token \d which symbolizes all digits.

/\d/

abc 123 def 456 ghi

A number of other useful similar tokens are listed below:

Capture Groups

/(goose)/

duck duck goosegoose duck duck goose

Capture groups, also referred to as subexpressions, allow you to divide a regular expression into syntactic units. This becomes useful when you need to modify or refer to a particular part of your search expression. This allows you, for example, to match only string segments containing multiple repetitions of a capture group.

/(goose){2}/

duck duck goosegoose duck duck goose

Capture groups also allow you to perform backreferences. These allow you to add a character sequence that will, in essence, be replaced by the resulting match of the capture group that it is referring too.

For a very simple example, in the regex /(goose)(duck)\1/ the \1 can be read as (goose) because it points to the first capture group. A \2 would refer to (duck) as it refers to the second capture. This pattern can be used to reference back any number of groups.

For a more interesting example, consider what happens when we perform a backreference to one of the general matching tokens listed in the previous section.

/(\w)b\1/

yaba daba doo yabo dabo doo

Here, the capture group matches any word character, while the b matches just the literal character b. The backreference then matches the result of the initial capture group. Essentially, this matches any occurrence of the letter b with its immediately preceding and following characters, if those characters are the same.

If you wish to define a syntactic unit, but exclude it from these backreferences, simple use a non-capturing group. The syntax is /(?:goose)/.

If you are working with a complex regex with many capture groups, it may become cumbersome to refer to them by number. In these situations, you can use the following syntax to explicitly name and refer to capture groups:

/(?<myGroup>goose)\k<myGroup>/

Repetition

Regular expressions offer a number of tools for matching repeated string segments. They are outlined below.

Note, the repetitions must be immediately adjacent to each other, without any non-matching characters between them.

Boundaries

Boundary markers can be used to ensure that matches are adjacent to certain elements. For example, you can ensure that you only match sequences that occur at the very end of a line.

/(goose)$/

duck duck goose duck duck goose

Note, this is different than the \z anchor which matches the end of an entire string, rather than just a line.

Similarly, the ^ anchor match can be placed at the beginning of an expression to ensure that a match occurs only at the beginning of a line.

Within a line, individual words can be isolated using the \b word boundary anchor. This is useful, for instance, if you are attempting to match a word which can occur within other longer words.

/\b(she)\b/

shed she sheep she shepherd

Advanced Boundaries

More sophisticated boundary expressions can be used if you need to ensure that a match is or is not preceded by, or followed by a particular expression. This is best explained through example.

/(?<!b)(123)/

a123 b123 c123 d123

Here, the expression (?<!b) is a negative lookbehind expression which ensures that the expression that follows will not be preceded immediately by a “b”. The opposite expression, a positive lookbehind, looks like (?=b) and would ensure that only “123” strings that are preceded immediately by a “b” would register as matches.

In the same vein, (?!b) is a positive lookahead expression, while (?!b) provides the opposite negative lookahead functionality. Both are used to check the characters immediately following an expression.

Escape Characters

You may sometimes need to search for literal occurrences of special regex tokens. In those cases, simply prefix the character with a backslash. While . will match any characters at all, \. will match only an occurrence of an actual period character.

Further Resources

Advanced regular expressions can at first look like an indecipherable soup of symbols, but fortunately there are a plethora of tools and guides available online that can act as a handy reference. Here are a handful of links that I have personally found helpful.

Online Regex Tester – There are many similar testers freely available that you can find with a quick search, but I like this one in particular because it provides broader support for multiple programming languages. Most only implement Javascript’s regex support which is missing some key features.

Regex tutorial — Medium – Jonny Fox’s cheatsheet by example is a phenomenal deep dive into many of the key concepts of regular expressions.

Regular Expression Reference – A great streamlined overview of regex syntax, useful for quickly refreshing yourself on a particular type of matcher.

Regex cheatsheet – This is a handy one to bookmark. Clean and minimal with useful information on variances across language implementations.