Jam 03 - Exercise 1 - Regular Expression Fundamentals

--- title: "Jam 03 - Exercise 1 - Regular Expression Fundamentals" tags: - 3 🧪 in testing - 4 🥳 done - jam03 - regex - pattern matching ---   {%hackmd dJZ5TulxSDKme-3fSY4Lbw %} # Exercise 1: Regular Expression Fundamentals ## Overview - Exercise 1 In this exercise, you'll learn the basics of regular expressions through hands-on practice with pattern matching. You'll start with simple patterns and gradually work up to more complex validation scenarios. ## Regular Expression Basics Regular expressions (regex) are powerful tools for pattern matching and text processing. Think of them as a search-and-match language that helps you find specific patterns in text. Let's break down the key components: :::info 🔧 **Helpful Tips for All Patterns** - Test your patterns incrementally - start simple and add complexity - Use regex101.com's explanation panel to understand each part - Remember to escape special characters like parentheses - Consider edge cases in your test strings ::: ### Basic Symbols These are the fundamental building blocks of regex patterns: - `.` - The wildcard character (it's a period). Matches any single character except newline. ```text "h.t" matches "hat", "hot", "hit", but not "heat" ``` - `^` - Marks the start of a line. Useful when you want to ensure something appears at the beginning. ```text "^The" matches "The cat" but not "Feed the cat" ``` - `$` - Marks the end of a line. Useful when you want to ensure something appears at the end. ```text "cat$" matches "The cat" but not "The cats" ``` ### Character Sets Character sets let you match one character from a specific set of characters: - `[]` - Defines a character set. Match ANY ONE character listed inside. ```text "[aeiou]" matches any single vowel "gr[ae]y" matches both "gray" and "grey" ``` - `[0-9]` - Matches any single digit. The hyphen creates a range. ```text "[0-9]" matches "0", "1", ..., "9" "[0-9][0-9]" matches "00", "01", ..., "99" ``` - `[a-z]` - Matches any single lowercase letter - `[A-Z]` - Matches any single uppercase letter ```text "[A-Z][a-z]" matches "The", "At", "In" (capital followed by lowercase) ``` - `[^abc]` - The ^ INSIDE brackets means "not". Matches any character EXCEPT those listed. ```text "[^0-9]" matches any non-digit character ``` ### Quantifiers Quantifiers let you specify how many times something should match: - `+` - One or more occurrences. Must appear at least once. ```text "ca+t" matches "cat", "caat", "caaat", but not "ct" ``` - `*` - Zero or more occurrences. Optional but can repeat. ```text "ca*t" matches "ct", "cat", "caat", "caaat" ``` - `?` - Zero or one occurrence. Makes something optional. ```text "colou?r" matches both "color" and "colour" ``` - `{n}` - Exactly n occurrences - `{n,m}` - Between n and m occurrences - `{n,}` - n or more occurrences ```text "[0-9]{3}" matches exactly 3 digits like "123" "[0-9]{2,4}" matches "12", "123", or "1234" "[0-9]{2,}" matches 2 or more digits ``` ### Special Characters and Escaping Some characters have special meanings and need to be escaped with `\` to match literally: - `\` - The escape character (backslash). Used to tell regex "treat the next character as a literal character, not a special one" ```text "1+1" uses + as a special character (one or more) "1\+1" uses + as a literal plus sign ``` - `\d` - Matches any digit (same as [0-9]) - `\w` - Matches any word character (letter, digit, or underscore) - `\s` - Matches any whitespace (space, tab, newline) ```text "\d\d\d" matches three digits (same as "[0-9]{3}") "\w+" matches entire words "a\sb" matches "a b" (a space between a and b) ``` To match special characters literally, escape them: ```text "\." matches a literal dot "\+" matches a literal plus sign "\(" matches a literal opening parenthesis ``` :::info 🔧 **Building Complex Patterns** Complex patterns are built by combining these elements. For example, to validate a hex color code (like #FF9900): - `#` - Literal hash symbol - `[0-9a-fA-F]` - A single hex digit (0-9 or a-f or A-F combined in one character class) - `{6}` - Exactly 6 occurrences This builds into: `#[0-9a-fA-F]{6}` Which matches: - "#FF9900" (uppercase hex) - "#ff9900" (lowercase hex) - "#123abc" (mixed numbers and letters) But not: - "FF9900" (missing #) - "#FF990" (too short) - "#FF990G" (G is not a hex digit) - "#FF99000" (too long) ::: Now that you understand the basics, let's practice with some real-world examples! ### Common Patterns - USA ZIP Code: `^\d{5}(-\d{4})?$` - URL: `^https?://[\w-]+(\.[\w-]+)+(/[\w-]*)*$` - Date (YYYY-MM-DD): `^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$` ## Required Steps - Pattern Matching Let's start by learning how to use regex101.com effectively: 1. Visit [regex101.com](https://regex101.com) 2. In the top-right corner in the "FLAVOR" section, select "Java 8" 3. The page has the following important sections: - Left: Where you select your regex flavor - Top Middle: Where you write your regex pattern - Bottom Middle: Where you paste your test strings - Top Right: Explanation of your regex pattern - Bottom Right: Quick Reference for regex syntax 4. Complete the following exercises in regex101.com (with answers put into answers.txt): ### Phone Numbers Phone numbers are a common target for regex validation. They often follow specific formats and can be tricky to validate correctly. Let's start by examining some phone numbers: ```text Test Strings to Consider: (123) 456-7890 <- (Valid) US format with area code 123-456-7890 <- (Invalid) Missing parentheses (1234) 456-7890 <- (Invalid) Too many digits in area code (123)456-7890 <- (Invalid) Missing space (abc) def-ghij <- (Invalid) Letters instead of numbers ``` Copy ALL of these test strings and paste them into the "Test String" section of regex101.com. As you write your pattern in the "Regular Expression" field, you'll see which strings match (highlighted in blue) and which don't. Add your own additional test cases following this format. Try to think of edge cases that might break your patterns! #### (1) Breaking down the components, you'll need to figure out - How to match literal parentheses - How to match exactly 3 digits - How to match a space - How to match a hyphen - How to ensure the pattern matches the entire string >[!Important] **Pattern Challenge 1: Phone Numbers** >Create a regex pattern that matches phone numbers in the format: (XXX) XXX-XXXX > >- Should match: (123) 456-7890 >- Should not match: 123-456-7890 or (123)456-7890 > >Add your pattern to `answers.txt` with the label "Q1.1: Phone Number Pattern:" ### Email Usernames Email addresses have two parts: the username before the @ and the domain after. Extract (or capture) just the username part from Bucknell email addresses. You can assume the email address will always be at the start of the line: ```text Test Strings to Consider: romano@bucknell.edu <- Should capture "romano" jdoe2@bucknell.edu <- Should capture "jdoe2" a.b.c@bucknell.edu <- Should capture "a.b.c" 123@bucknell.edu <- Should capture "123" @bucknell.edu <- Invalid: empty username .dot@bucknell.edu <- Invalid: starts with dot not.bucknell@gmail.com <- Invalid: wrong domain ``` Copy All of these strings again into regex101.com! #### (2) Breaking down the components, you'll need to figure out - How to match letters and numbers - How to allow dots (but not at the start) - How to capture just the username part using parentheses () - How to ensure it's a bucknell.edu email >[!Important] **Pattern Challenge 2: Email Usernames** >Create a regex pattern that matches Bucknell email addresses and captures the username: > >- Username can contain letters, numbers, and dots >- Username cannot start with a dot >- Must be a bucknell.edu email address >- Use capturing groups (parentheses) to extract just the username > >Test your pattern against: > >- Should capture "romano" from "<romano@bucknell.edu>" >- Should capture "jdoe2" from "<jdoe2@bucknell.edu>" >- Should not match: <.dot@bucknell.edu>, <not.bucknell@gmail.com> > >Add your pattern to `answers.txt` with the label "Q1.2: Email Username Pattern:" ### Log Entry Time Time formats in log files need to be precise and follow the 24-hour format. In our log files, the time will always appear at the start of each line. Here are some examples of time strings you might encounter: ```text 15:23:45 <- Valid 24-hour time 24:00:00 <- Invalid hour 23:59:59 <- Valid (maximum time) 00:00:00 <- Valid (midnight) 1:2:3 <- Invalid format (needs leading zeros) 12:60:00 <- Invalid minutes 12:59:60 <- Invalid seconds [INFO] 15:23:45 Server started <- Should still match "15:23:45" ``` Copy All of these strings again one more time into Regex101.com! #### (3) Breaking down the components, you'll need to figure out - How to limit hours to 00-23 - How to limit minutes to 00-59 - How to limit seconds to 00-59 - How to match the colons between components - How to ensure proper formatting with leading zeros >[!Important] **Pattern Challenge 3: Log Entry Time** >Create a regex pattern that matches the time portion from our log entries (HH:MM:SS format) > >- Should match: "15:23:45" from "[2024-02-13 15:23:45] ERROR: Database connection failed" >- Should handle 00-23 for hours, 00-59 for minutes and seconds > >Add your pattern to `answers.txt` with the label "Q1.3: Time Pattern:"  > 🔍 **Checkpoint**: Before moving on, verify that: > > - Your answers.txt contains all three regex patterns > - Each pattern has test cases documented > - You've tested each pattern on regex101.com > - You understand why each pattern matches or doesn't match your test cases ## Required Steps - Java Implementation Now that we understand how to create and test regex patterns, let's implement them in Java. Java provides several ways to work with regular expressions: 1. The quick way using `String.matches()` 2. The reusable way using `Pattern` and `Matcher` ### Quick Pattern Matching The simplest way to check if a string matches a pattern is using the `String.matches()` method: ```java String phoneNumber = "(123) 456-7890"; boolean isValid = phoneNumber.matches("\$[0-9]{3}\$ [0-9]{3}-[0-9]{4}"); ``` Note: In Java strings, we need to escape the backslash itself, so `\(` becomes `\\(` in the string literal. And if you ever need to match a literal backslash? You need four backslashes `\\\\` - yes, really! Two to create a literal backslash in the Java string, and two more to escape that backslash in the regex pattern. This leads to the classic programmer joke: "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems... and four backslashes." ### Pattern and Matcher Classes For more complex operations or when you'll reuse the same pattern multiple times, use the `Pattern` and `Matcher` classes: ```java /* * Example of using Pattern and Matcher classes */ import java.util.regex.Pattern; import java.util.regex.Matcher; // Compile the pattern once (more efficient for multiple uses) Pattern wordPattern = Pattern.compile("cat|dog"); // Matches either "cat" or "dog" // Test different strings String[] words = { "cat", "dog", "bird", // should not match "catdog" // should not match }; for (String word : words) { Matcher matcher = wordPattern.matcher(word); if (matcher.matches()) { System.out.println("Found animal: " + word); } } ``` ### Required Steps - Java Implementation of Pattern Matching Create a new class in the jam03 package called `PatternMatcher.java` (starter code will be provided below). For now, leave all your pattern matching code in the main method. Later exercises will teach us how to refactor this into a better design. 1. Replace `YOUR_PATTERN_HERE` with your patterns from answers.txt 2. Run the program and verify your patterns work as expected 3. Add at least 2 more test cases for each pattern in `testPattern()`: - One that should match (think about edge cases that are still valid) - One that should not match (think about tricky cases that might slip through) :::warning 🚧 **Known Issue**: As of the writing of this document, checkstyle is incorrectly reporting indentation problems with switch statements. These issues are being addressed and should be fixed shortly. For now, you can ignore any checkstyle warnings about indentation issues for code inside of switch statements. ::: (Note: The copy button should work for THIS code block. No promises on other code blocks.) ```java import java.util.regex.Pattern; import java.util.regex.Matcher; public class PatternMatcher { public static void main(String[] args) { // From the regex101.com exercise, copy your patterns into the strings below String phonePattern = "YOUR_PATTERN_HERE"; String emailPattern = "YOUR_PATTERN_HERE"; String timePattern = "YOUR_PATTERN_HERE"; // Compile your patterns Pattern compiledPhonePattern = Pattern.compile(phonePattern); Pattern compiledEmailPattern = Pattern.compile(emailPattern); Pattern compiledTimePattern = Pattern.compile(timePattern); // Run tests for each pattern type testPattern("phone numbers", compiledPhonePattern); testPattern("email addresses", compiledEmailPattern); testPattern("times", compiledTimePattern); } // Test helper method (provided) static void testPattern(String testType, Pattern pattern) { System.out.println("\nTesting " + testType + ":"); // Get the appropriate test cases for this pattern type String[][] tests = switch (testType) { case "phone numbers" -> new String[][] { {"(123) 456-7890", "valid"}, {"123-456-7890", "invalid"}, {"(123)456-7890", "invalid"} }; case "email addresses" -> new String[][] { {"abc123@bucknell.edu", "valid", "abc123"}, {".dot@bucknell.edu", "invalid", null}, {"not.bu@gmail.com", "invalid", null} }; case "times" -> new String[][] { {"15:23:45", "15:23:45"}, {"24:00:00", null}, {"[INFO] 15:23:45 Server started", "15:23:45"} }; default -> throw new IllegalArgumentException("Unknown test type: " + testType); }; // Test each case for (String[] test : tests) { String input = test[0]; String expected = test[1]; Matcher matcher = pattern.matcher(input); // Handle different pattern types switch (testType) { case "phone numbers": case "email addresses": boolean matches = matcher.matches(); String actual = matches ? "valid" : "invalid"; // Special handling for email usernames if (matches && test.length > 2 && test[2] != null) { String expectedUsername = test[2]; String actualUsername = matcher.group(1); String emoji = (actual.equals(expected) && actualUsername.equals(expectedUsername)) ? "✅" : "❌"; System.out.printf("%s %s -> %s (username: %s, expected username: %s)%n", emoji, input, actual, actualUsername, expectedUsername); } else { String emoji = actual.equals(expected) ? "✅" : "❌"; System.out.printf("%s %s -> %s (expected: %s)%n", emoji, input, actual, expected); } break; case "times": boolean found = matcher.find(); if (found) { String actualTime = matcher.group(); String emoji = (actualTime.equals(expected)) ? "✅" : "❌"; System.out.printf("%s %s -> found time: %s (expected: %s)%n", emoji, input, actualTime, expected != null ? expected : "no match"); } else { String emoji = (expected == null) ? "✅" : "❌"; System.out.printf("%s %s -> no time found (expected: %s)%n", emoji, input, expected != null ? expected : "no match"); } break; } } } } ``` > 🔍 **Checkpoint**: Before moving on to Exercise 2, verify that: > > - Your patterns work the same in Java as they did in regex101.com > - Your test cases cover both valid and invalid inputs > - Your program runs without errors and produces the expected output ## Save Your Work - Exercise 1 Verify what files are uncommitted: ```bash git status ``` Stage your changes: ```bash git add src/main/java/jam03/PatternMatcher.java git add src/main/java/jam03/answers.txt ``` Commit your work: ```bash git commit -m "jam03: Implement regex patterns and basic tests" ``` Your working directory should now be clean.