Lecture 8 (RegEx)
Resources
- R4DS chapter 14
- P4DA chapter 7.4
- RegEx Cheat Sheet
You can find examples and motivation in the resources.
Summary
In this lecture, we are going to learn how to use RegEx to pattern match strings. We will see how powerful it is to use RegEx to bulk search for patterns. It is industry standard and can be used in all of the tools we have used so far, i.e Git, Python/R and SQL.
RegEx
Regular Expressions (RegEx) is a powerful tool used in computer science and programming for pattern matching within strings. It provides a concise and flexible means of searching, matching, and manipulating text based on patterns.
A regular expression is a sequence of characters (a type of query) that defines a search pattern. These patterns can include a variety of elements such as literal characters, metacharacters (special characters with specific meanings), and quantifiers (to specify the number of occurrences). Regex is commonly used in tasks like text searching, validation, and text manipulation.
Before diving in to the syntax of RegEx, let's look at a simple example.
Consider a scenario where you want to extract email addresses from a text, for instance the following text.
Please contact support@example.com for assistance. For general inquiries, you can email info@company.com.
We can use the following RegEx to extract the emails in this text by matching them to a specific format.
\b[\w._%+-]+@[\w.-]+\.[A-Za-z]{2,4}\b
\b[\w._%+-]+@[\w.-]+\.[A-Za-z]{2,4}\b
The expression above might look daunting but it will make sense when you get thet gist of RegEx. Lets break down this expression.
- \b: Word boundary to ensure that the match is a whole word and not part of a larger sequence. For instance,
\bcat\b
will match the word cat but not scattered. - [\w._%+-]+: Matches the username part of the email address, allowing alphanumeric characters, dots, underscores, percent signs, plus signs, and hyphens. Here
\w
is a metacharacter which is short forA-Za-z0-9
.+
outside of the brackets matches 1 or more of the proceeding character. - @: Matches the at symbol.
- [\w.-]+: Matches the domain name, allowing alphanumeric characters, dots, and hyphens.
- .: Matches the dot before the top-level domain.
- [A-Z|a-z]{2,4}: Matches the top-level domain (eg. .com) with at least two and at most 4 characters .
- \b: Word boundary to complete the match.
To summarise The following image illustrates what we have done.
Image from: https://kottke.org/21/07/a-history-of-regular-expressions-and-artificial-intelligence
Even if it does not make sense yet that is fine.
Basic Syntax
Some important definitions in regex are the following.
Literals
Characters in a regex pattern that match themselves. For example, the regex abc
will match the string "abc" in the input.
Metacharacters
Special characters with a specific meaning in regex. Some common metacharacters include:
.
(dot): Matches any single character except a newline.^
: Anchors the regex at the start of the string.$
: Anchors the regex at the end of the string.*
: Matches 0 or more occurrences of the preceding character or group.+
: Matches 1 or more occurrences of the preceding character or group.?
: Matches 0 or 1 occurrence of the preceding character or group.|
: Acts like a logical OR, allowing alternatives. For example,a|b
matches either "a" or "b".()
: Groups characters together. For example,(abc)+
matches one or more occurrences of "abc".
Character Classes
[ ]
: Defines a character class. For example,[aeiou]
matches any vowel.[^ ]
: Negates a character class. For example,[^0-9]
matches any non-digit character.
Quantifiers
Control the number of occurrences of a character or group.
{n}
: Matches exactly n occurrences.{n,}
: Matches n or more occurrences.{n,m}
: Matches between n and m occurrences.
Escape sequences
Use a backslash \
to escape a metacharacter, allowing it to be treated as a literal character. For example, \.
matches a literal period.
Predefined character classes
\d
: Matches any digit (equivalent to[0-9]
).\D
: Matches any non-digit.\w
: Matches any word character (alphanumeric + underscore).\W
: Matches any non-word character.\s
: Matches any whitespace character.\S
: Matches any non-whitespace character.
Anchors
Specify the position in the string where a match must occur.
\b
: Word boundary.\B
: Non-word boundary.^
: Start of a line.$
: End of a line.
Modifiers
i
: Case-insensitive matching.g
: Global matching (find all matches, not just the first).
Wildcard
.*
is a common pattern to match any character (except newline) zero or more times.
I strongly believe that you learn regex by examples. So let's look a typical example of regex.
Example: Extracting Email Addresses from a List
Suppose you have a list of email addresses:
john.doe@example.com
jane.smith@gmail.com
alice.jones@example.com
bob.miller@yahoo.com
john.doe@example.com
jane.smith@gmail.com
alice.jones@example.com
bob.miller@yahoo.com
Now, let's say you want to extract all the email addresses from the domain example.com. You can use the following regex:
\b[A-Za-z0-9._%+-]+@example\.com\b
\b[A-Za-z0-9._%+-]+@example\.com\b
Explanation:
\b: Word boundary to ensure that we match the entire domain, not just a part of it. [A-Za-z0-9._%+-]+: Matches the username part of the email address, allowing letters, numbers, dots, underscores, percent signs, plus signs, and hyphens. @example.com: Matches the domain part, specifically example.com.
This pattern will match the following strings.
['john.doe@example.com', 'alice.jones@example.com']
['john.doe@example.com', 'alice.jones@example.com']
This is a simple example of regex that is meant to illustrate the power of it.
Exercises
Here are some excersises, try them out on your own!
Simple Email Validation
john.doe@example.com
jane.smith@gmail.com
alice.jones123@yahoo.com
invalid.email@domain
john.doe@example.com
jane.smith@gmail.com
alice.jones123@yahoo.com
invalid.email@domain
Output should be:
Valid Email Addresses:
- john.doe@example.com
- jane.smith@gmail.com
- alice.jones123@yahoo.com
Invalid Email Addresses:
- invalid.email@domain
Valid Email Addresses:
- john.doe@example.com
- jane.smith@gmail.com
- alice.jones123@yahoo.com
Invalid Email Addresses:
- invalid.email@domain
Extracting Phone Numbers
Phone numbers:
123-456-7890,
(555) 987-6543,
9876543210, 555-1234
Invalid:
12-345-6789,
555-98765,
abcdefgh
Phone numbers:
123-456-7890,
(555) 987-6543,
9876543210, 555-1234
Invalid:
12-345-6789,
555-98765,
abcdefgh
Ouput should be:
Valid Phone Numbers:
- 123-456-7890
- (555) 987-6543
- 9876543210
- 555-1234
Invalid Phone Numbers:
- 12-345-6789
- 555-98765
- abcdefgh
Valid Phone Numbers:
- 123-456-7890
- (555) 987-6543
- 9876543210
- 555-1234
Invalid Phone Numbers:
- 12-345-6789
- 555-98765
- abcdefgh
Extracting HTML Tags and Attributes
<p class="intro">This is a <strong>sample</strong> paragraph.</p>
<p>No class here.</p>
<div id="container" class="main">
<h1>Title</h1>
<p>Content</p>
</div>
<p class="intro">This is a <strong>sample</strong> paragraph.</p>
<p>No class here.</p>
<div id="container" class="main">
<h1>Title</h1>
<p>Content</p>
</div>
Output should be:
HTML Tags and Attributes:
- <p class="intro">
- <strong>
Attributes in <div>:
- id="container"
- class="main"
HTML Tags and Attributes:
- <p class="intro">
- <strong>
Attributes in <div>:
- id="container"
- class="main"
You can execute RegEx using Python or R. Look at the resources for how to do this!