Skip to main content

Working with Regular Expressions in C#

I’ve been working on a program that needs to parse a html file for form data.  So when I was deciding what method to use, a few popped right into my mind.

The first being a character by character search through the string.  Parsing through the data and flagging sections that fit the signature of what was being searched for.

The second would be automating that by using the built-in string functions to split up the string and drill down until the needed data was extracted.

The third, and one I chose to use, was with regular expressions.  This in my mind is the most “poetic” method of the three, which would allow me to make the a robust and reliable function.

While I’ve used regular expressions a lot throughout the years. I never seem to remember enough to construct a decent statement. I had recently bought a pocket reference (link below), so I used that to get a statement constructed. It had a total of about 6 pages for C#, but I pretty much got what I needed from it. Anything else I just searched the Internet for.

Included below is most of the code to extract an unlimited number of forms from a html document:

// Create a new instance that will be sent back as a reference parameter
returnHtmlFormData = new List<HtmlForm>();
// Take the initial response text and process it for FORM tags, this can handle an "unlimited" number of them
// Regular expression to extract each form tag as well as the action attribute [0] and [1] in the group collection
Regex formExtractor = new Regex(@"<form\b[^>]*action=""?(.*?)[""|\s].*?>.*?</form>", RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Singleline);
// Regular expression to extract all of the input tags (we want the name and the value of each)
Regex inputTagExtractor = new Regex(@"<input\b[^>]*name=""?(.*?)[""|\s].*?value=""?(.*?)[""|\s].*?[/??|>]", RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Singleline);

List<HtmlForm> is a collection of a class that I created. It basically just stores:
1. The action page, which we would need if we want to use this data later to send a POST.
2. All of the fields that were in the form.

The two Regex instances are what do all of the work here. The first searches the text for form tags. It takes into account quite a few possibilities like if there are spaces or property fields in the starting tag. Also, notice the three regex options. Ignoring case is obvious, compiled helps improve execution time, and the singleline option means the regex expression engine will consider “character return” and “line feed” as normal white-space.

The second searches within the form tag (as shown later on) for all input fields). It uses the parenthesis characters to save certain pieces of the data into a buffer called a GroupCollection we will look at later. It also takes into account things like if properties have or do not have quotes around their values.

//attempt to extract out all forms in the passed string 
MatchCollection formList = formExtractor.Matches(initialResponseText);

This line above takes some text data and performs the first regular expression on it. It returns all instances of a match back in a MatchCollection object.


// For each form tag that is found, process it
foreach (Match formMatch in formList)
{
    // Create a new element in the list data structure so we can fill it with form data
    returnHtmlFormData.Add(new HtmlForm());
    
    // Get a temporary copy of the current element of the list we want to be filling with data
    int activeListElement = returnHtmlFormData.Count - 1;
    
    // Extract the regex variables from the result, so we can continue processing
    // Anywhere where you see () in the regex statement, will be a variable in here
    // The first element, [0], will be the whole result though
    GroupCollection formTagMatchValues = formMatch.Groups;
    
    // Assign the action page value we extracted from the current form element to our data structure
    returnHtmlFormData[activeListElement].ActionPage = formTagMatchValues[1].ToString();
    
    // Attempt to extract all of the names/values for the input tags
    MatchCollection inputTagMatches = inputTagExtractor.Matches(formTagMatchValues[0].ToString());
    
    // Loop through the results (multiple input tags should be returned)
    foreach (Match inputMatch in inputTagMatches)
    {
        GroupCollection inputMatchValues = inputMatch.Groups;
        
        // Save the input field data to our data structure
        returnHtmlFormData[activeListElement].addInputField(inputMatchValues[1].ToString(), inputMatchValues[2].ToString());
    }
}


As you can see above, I use my List collection returnHtmlFormData to hold a list of classes… of my special form storage class. I’ll let my code comments above speak for themselves, but basically you start from a MatchCollection, from there you loop through as single items of class Match, each Match can then be processed further by reading the GroupCollection for the actual data we wanted extracted. It’s quite and ingenious construct of classes, but it took be a while to figure out…

Amazon.com Book Link:
Regular Expression Pocket Reference: Regular Expressions for Perl, Ruby, PHP, Python, C, Java and .NET


Popular posts from this blog

ChatGPT is a new, and faster, way to do programming!

Currently ChatGPT is in a free “initial research preview” . One of its well known use cases at this point is generating software code. I’ve also just used it to write most of this article… Well, actually a future article about cleaning up SRT subtitle files of their metadata faster than I have been by hand with Notepad++ and its replace functionality. Update: I recorded a screencast of writing the SRT subtitle cleaner application loading and processing portion. I relied heavily on ChatGPT for code. It was a fun process! https://youtu.be/TkEW39OloUA ChatGPT, developed by OpenAI, is a powerful language model that can assist developers in a variety of tasks, including natural language processing and text generation. One such task that ChatGPT can help with is creating an SRT cleaner program. SRT, or SubRip Subtitle, files are commonly used to add subtitles to video files. However, these files can become cluttered with unnecessary information, such as timing lines or blank spaces. To clean...

Theme error in 2010s Android App after AppCompat Migration

I plan on releasing a lot of my old work as GPL open source, but most of it has aged to the point that it no longer functions, or if it does work it’s running in compatibility mode. Basically it’s no longer best practices. Not a good way to start off any new public GPL projects, in my opinion. The current project I’m working on is an Android app that calculates star trails meant to help photographers get or avoid that in their night time photos. For now I’m going to skip some of the import process because I didn’t document it exactly. It’s been mostly trial and error as I poke around Android Studio post import. The Android Studio import process… Removing Admob Google Play code before the project would run at all. After removing dependencies, it kind of worked, but when running it in the emulator it shows a pop-up message saying that the app was developed for an old version of Android. Going through the process of updating code to match current best practices… I had the IDE convert the ...

Blogger Notable theme pop-up header issue fix (thanks to Gemini Pro)

I've made a few half hearted attempts over the years to to fix Blogger's Notable theme's rendering of the pop-up header that shows up when you scroll down the page a decent amount and then pull back to reveal that secondary header. On Chrome mobile I noticed a gray box that forms next to the magnifying glass icon. I never looked in detail on  Chrome desktop, but it had an issue as well which I'll detail below.  If you are looking for a solution and don't want all of the extra talk about how I was able to find it, here it is:  .centered-top-container .sticky .main_header_elements { overflow : hidden !important ; } I decided to try using Gemini Pro 2.5 to see if it was capable of finding the issue and giving me a fix. Turns out that it was able, but it took a bit of collaboration back and forth to find the actual problem.  Here is a modified article I asked it to give me based on our debugging chat (it was very colorful in the article which I scaled back a lot, ...