Posts Tagged ‘c#’

Working with Regular Expressions in C#

Saturday, August 23rd, 2008

I’ve been working on a program that needs to parse a html file for form data.  So when I was deciding what method to use, a few popped right into my mind.

The first being a character by character search through the string.  Parsing through the data and flagging sections that fit the signature of what was being searched for.

The second would be automating that by using the built-in string functions to split up the string and drill down until the needed data was extracted.

The third, and one I chose to use, was with regular expressions.  This in my mind is the most “poetic” method of the three, which would allow me to make the a robust and reliable function.

While I’ve used regular expressions a lot throughout the years. I NEVER seem to remember enough to construct a decent statement. I had recently bought a pocket reference (link below), so I used that to get a statement constructed. It had a total of about 6 pages for C#, but I pretty much got what I needed from it. Anything else I just searched the Internet for.

Included below is most of the code to extract an unlimited number of forms from a html document:

//create a new instance that will be sent back as a reference parameter
//there may be multiple forms, so we have to use a data structure
returnHtmlFormData = new List<HtmlForm>();

//take the initial response text and process it for FORM tags, this can handle an "unlimited" number of them

//a regular expression to extract each form tag as well as the action attribute [0] and [1] in the group collection
Regex formExtractor = new Regex(@"<form\b[^>]*action=""?(.*?)[""|\s].*?>.*?</form>",
    RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Singleline);

//a regular expression to extract all of the input tags (we want the name and the value of each)
Regex inputTagExtractor = new Regex(@"<input\b[^>]*name=""?(.*?)[""|\s].*?value=""?(.*?)[""|\s].*?[/??|>]",
    RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Singleline);

List<HtmlForm> is a collection of a class that I created. It basically just stores:
1. The action page, which we would need if we want to use this data later to send a POST.
2. All of the fields that were in the form.

The two Regex instances are what do all of the work here. The first searches the text for form tags. It takes into account quite a few possibilities like if there are spaces or property fields in the starting tag. Also, notice the three regex options. Ignoring case is obvious, compiled helps improve execution time, and the singleline option means the regex expression engine will consider “character return” and “line feed” as normal white-space.

The second searches within the form tag (as shown later on) for all input fields). It uses the parenthesis characters to save certain pieces of the data into a buffer called a GroupCollection we will look at later. It also takes into account things like if properties have or do not have quotes around their values.

//attempt to extract out all forms in the passed string
MatchCollection formList = formExtractor.Matches(initialResponseText);

This line above takes some text data and performs the first regular expression on it. It returns all instances of a match back in a MatchCollection object.

//for each form tag that is found, process it
foreach (Match formMatch in formList)
{
    //create a new element in the list data structure so we can fill it with form data
    returnHtmlFormData.Add(new HtmlForm());

    //get a temporary copy of the current element of the list we want to be filling with data
    int activeListElement = returnHtmlFormData.Count - 1;

    //extract the regex variables from the result, so we can continue processing
    //anywhere where you see () in the regex statement, will be a variable in here
    //the first element, [0], will be the whole result though
    GroupCollection formTagMatchValues = formMatch.Groups;

    //assign the action page value we extracted from the current form element to our data structure
    returnHtmlFormData[activeListElement].ActionPage = formTagMatchValues[1].ToString();

    //attempt to extract all of the names/values for the input tags
    MatchCollection inputTagMatches = inputTagExtractor.Matches(formTagMatchValues[0].ToString());

    //loop through the results (multiple input tags should be returned)
    foreach (Match inputMatch in inputTagMatches)
    {
        GroupCollection inputMatchValues = inputMatch.Groups;

        //save the input field data to our data structure
        returnHtmlFormData[activeListElement].addInputField(inputMatchValues[1].ToString(), inputMatchValues[2].ToString());
    }
}

As you can see above, I use my List collection returnHtmlFormData to hold a list of classes… of my special form storage class. I’ll let my code comments above speak for themselves, but basically you start from a MatchCollection, from there you loop through as single items of class Match, each Match can then be processed further by reading the GroupCollection for the actual data we wanted extracted. It’s quite and ingenious construct of classes, but it took be a while to figure out…

Amazon.com Book Link:
Regular Expression Pocket Reference: Regular Expressions for Perl, Ruby, PHP, Python, C, Java and .NET

Learning About .NET Web Access Classes

Thursday, May 15th, 2008

I mentioned about wanting to start a business in my previous post. Well it looks like I might have my chance. While it isn’t exactly what I was thinking of in my previous post, I foresee many opportunities to flex my programming muscle in this endeavor. Plus, I will be starting up with a good friend, so there is a good chance our motivation will actually produce some results.

A piece of software I am starting to research/design/create will be an internal application we will use to automatically extract some government provided public data. The current issue is time. With the amount of data and the requirement of using their web interface, it is not worth the time needed to do it by hand. The only other option would be to buy the data from the government, but that’s just an unnecessary cost seeing as there is a free option available.

So I will be writing a program in C# to automatically access a few websites and download the needed data in chunks. I have never delt with .NET’s web access objects, so I started looking at what they have to offer today…. which looks like a lot.

I modified an example from here:

http://msdn.microsoft.com/en-us/library/456dfw4f.aspx

To do a simple test to see how reading a web page works in .NET.

My revised code below:
Take note of the two windows form controls (webBrowser1) and (textbox1).

//due to the finally statement, these variables need to be created outside the try block
WebRequest request = null;
WebResponse response = null;
Stream dataStream = null;
StreamReader reader = null;
try
{
    // Create a request for the URL.
    request = WebRequest.Create("http://page_to_access");

    // If required by the server, set the credentials.
    request.Credentials = CredentialCache.DefaultCredentials;

    // Get the response.
    response = request.GetResponse();

    // Display the status.
    textBox1.Text = ((HttpWebResponse)response).StatusDescription;

    // Get the stream containing content returned by the server.
    dataStream = response.GetResponseStream();

    // Open the stream using a StreamReader for easy access.
    reader = new StreamReader(dataStream);

    // Read the content.
    string responseFromServer = reader.ReadToEnd();

    // Display the content.
    webBrowser1.DocumentText = responseFromServer;
}
catch (Exception error)
{
    Console.Write(error.ToString());
}
finally
{
    // Clean up the streams and the response.
    if (reader != null)
    {
        reader.Close();
    }
    if (response != null)
    {
        response.Close();
    }
}

All it does is just request a webpage (only with get data) and then stream the response to a string. After that it takes the string and inserts it into the standard WebBrowser control.

Next up I will setup/research:
- Searching the results from a web request.
- POSTing data to a page as apposed to using the get string
- Downloading files that are available from a webpage

Should be interesting!

C# .NET Programming Tip: Oracle Connection Revised

Saturday, March 1st, 2008

Take not that Microsoft will discontinue support for System.Data.OracleClient in .NET 4.0. This method should still work, but it will be depreciated…

Now that I have had more time to work with Oracle, I found a better way to connect then described in my previous post. With this new method you can connect to multiple instances in one program by getting rid of that tnsnames file.

*Remember that a project reference to System.Data.OracleClient must be added:

OracleConnection dbConnection;

string connectionString = "Server=(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = *******)(PORT = 1521))(CONNECT_DATA = (SID = *****)));Persist Security Info=true;User Id=*****;Password=*****;";

dbConnection = new OracleConnection(connectionString);

It’s really that simple. The key here is that the connection string must use the “Server” attribute instead of the “Data Source” one. I ran upon that while looking at the specification on Microsoft’s MSDN documentation site.

It’s basically a tnsnames string with a few more attributes like user id and password. Just remember that:
host is the name of your server or ip address, sid is the name of your database meaning the *** part of ***.world, and of course the user id and password are your login credentials.

There is another helpful step that I will be posting about. It relates to ClickOnce and the 4 DLL files that you need to connect to Oracle.

C# .NET Programming Tip: Types

Thursday, February 21st, 2008

Figuring out a variable’s type has become more important since now variables can be boxed by their parent class(s) (Where all can be “Object”). It’s nice because it allows for one generalized function to work with many types that perform an action on a common attribute, or first figure out what the object is and then perform the action.

That is one of the instances where figuring out a variables type is important.

.NET has a built-in function called GetType() which figures out what type a variable is.

//loop through the controls in the panel anf figure out which of the checkboxes are checked
foreach (Control panelControl in newCheckListQuestionPanel.Controls)
{
   //only continue if this control is a checkbox
   if (panelControl.GetType().ToString().Equals("System.Windows.Forms.CheckBox") == true)
   {
       if (((CheckBox)panelControl).Checked == true)
       {
           //do something here if the control is checked
       }
   }
}




 

 
Stock Photo Website
Tech Learning Site

Popular Article Tags

Recent Article Comments

Archives