Posts Tagged ‘html’

Working with Regular Expressions in C#

Saturday, August 23rd, 2008

I’ve been working on a program that needs to parse a html file for form data.  So when I was deciding what method to use, a few popped right into my mind.

The first being a character by character search through the string.  Parsing through the data and flagging sections that fit the signature of what was being searched for.

The second would be automating that by using the built-in string functions to split up the string and drill down until the needed data was extracted.

The third, and one I chose to use, was with regular expressions.  This in my mind is the most “poetic” method of the three, which would allow me to make the a robust and reliable function.

While I’ve used regular expressions a lot throughout the years. I NEVER seem to remember enough to construct a decent statement. I had recently bought a pocket reference (link below), so I used that to get a statement constructed. It had a total of about 6 pages for C#, but I pretty much got what I needed from it. Anything else I just searched the Internet for.

Included below is most of the code to extract an unlimited number of forms from a html document:

//create a new instance that will be sent back as a reference parameter
//there may be multiple forms, so we have to use a data structure
returnHtmlFormData = new List<HtmlForm>();

//take the initial response text and process it for FORM tags, this can handle an "unlimited" number of them

//a regular expression to extract each form tag as well as the action attribute [0] and [1] in the group collection
Regex formExtractor = new Regex(@"<form\b[^>]*action=""?(.*?)[""|\s].*?>.*?</form>",
    RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Singleline);

//a regular expression to extract all of the input tags (we want the name and the value of each)
Regex inputTagExtractor = new Regex(@"<input\b[^>]*name=""?(.*?)[""|\s].*?value=""?(.*?)[""|\s].*?[/??|>]",
    RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Singleline);

List<HtmlForm> is a collection of a class that I created. It basically just stores:
1. The action page, which we would need if we want to use this data later to send a POST.
2. All of the fields that were in the form.

The two Regex instances are what do all of the work here. The first searches the text for form tags. It takes into account quite a few possibilities like if there are spaces or property fields in the starting tag. Also, notice the three regex options. Ignoring case is obvious, compiled helps improve execution time, and the singleline option means the regex expression engine will consider “character return” and “line feed” as normal white-space.

The second searches within the form tag (as shown later on) for all input fields). It uses the parenthesis characters to save certain pieces of the data into a buffer called a GroupCollection we will look at later. It also takes into account things like if properties have or do not have quotes around their values.

//attempt to extract out all forms in the passed string
MatchCollection formList = formExtractor.Matches(initialResponseText);

This line above takes some text data and performs the first regular expression on it. It returns all instances of a match back in a MatchCollection object.

//for each form tag that is found, process it
foreach (Match formMatch in formList)
{
    //create a new element in the list data structure so we can fill it with form data
    returnHtmlFormData.Add(new HtmlForm());

    //get a temporary copy of the current element of the list we want to be filling with data
    int activeListElement = returnHtmlFormData.Count - 1;

    //extract the regex variables from the result, so we can continue processing
    //anywhere where you see () in the regex statement, will be a variable in here
    //the first element, [0], will be the whole result though
    GroupCollection formTagMatchValues = formMatch.Groups;

    //assign the action page value we extracted from the current form element to our data structure
    returnHtmlFormData[activeListElement].ActionPage = formTagMatchValues[1].ToString();

    //attempt to extract all of the names/values for the input tags
    MatchCollection inputTagMatches = inputTagExtractor.Matches(formTagMatchValues[0].ToString());

    //loop through the results (multiple input tags should be returned)
    foreach (Match inputMatch in inputTagMatches)
    {
        GroupCollection inputMatchValues = inputMatch.Groups;

        //save the input field data to our data structure
        returnHtmlFormData[activeListElement].addInputField(inputMatchValues[1].ToString(), inputMatchValues[2].ToString());
    }
}

As you can see above, I use my List collection returnHtmlFormData to hold a list of classes… of my special form storage class. I’ll let my code comments above speak for themselves, but basically you start from a MatchCollection, from there you loop through as single items of class Match, each Match can then be processed further by reading the GroupCollection for the actual data we wanted extracted. It’s quite and ingenious construct of classes, but it took be a while to figure out…

Amazon.com Book Link:
Regular Expression Pocket Reference: Regular Expressions for Perl, Ruby, PHP, Python, C, Java and .NET

Flash video in a website

Friday, December 15th, 2006

While researching about displaying video in web pages I came upon a good [free] solution.

It uses:
Free and open source flash based video player called FlowPlayer.
Their website: http://flowplayer.sourceforge.net/

To encode the videos into the flv video format I used a free tool called Riva FLV Encoder.
Their website: http://rivavx.de/?encoder

Why a flash player as apposed to quicktime/windows media player/real player? A lot more people have Adobe Flash installed and it’s just a generally more foolproof method of serving videos on the web (aka. the extremely popular youtube.com is flash based).

Here is an example of the html required:

<object type="application/x-shockwave-flash" data="videos/FlowPlayer.swf"
  width="320" height="263" id="FlowPlayer">
  <param name="allowScriptAccess" value="sameDomain" />
  <param name="movie" value="videos/FlowPlayer.swf" />
  <param name="quality" value="high" />
  <param name="scale" value="noScale" />
  <param name="wmode" value="transparent" />
  <param name="flashvars" value="config={videoFile: 'glider.flv'}" />
</object>

Where flider.flv is the video file that was encoded with the Riva encoder. The example also stores all of the files in a sub directory called “videos”.

No, but there is more…much more thanks to my new friend ajax.

Saturday, December 9th, 2006

As I mentioned in the previous post, I made a static website in html, css, javascript, and json. It works pretty good and that’s great. The problem is my sister is the one who should be adding/editing content because it is her site. So the question is, how can I make an easily update-able site that is on a server without any server-side coding functionality?

I had figured out pretty much how I wanted it to work even before I finished coding the first version of the site. Sure, I could just write a client application that spits out html\css code and uploads it to the server, but no, that isn’t cool enough.

I wanted to break up the site data and formatting. That way the client application would only need to create json files and upload them along with any new content.

That means the json data needs to be removed from the html files and the html page now needs to allow fetching of the data whenever the person using the site needs it. What does this mean? It means that I now have a site that is comprised of: index.htm, a css file, and a javascript file. This is pretty much the whole website (not including the json files)! How can this be possible? Thank our new friend ajax (Asynchronous JavaScript and XML). What that does is allow javascript code to fetch data from the server (in my case text files containing json data) when the user does some type of action.

Lets get on to a few examples, shall we!

The whole index page:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
 <link rel="stylesheet" type="text/css" href="sapphirearts.css" />
 <script src="sapphirearts.js" type="text/javascript"></script>
</head>

<body onload="getJsonFromServer('links.json', 'init');">
<div id="divcontainer">
 <div id="divbanner"></div>

 <div id="divlinks"></div>

 <div id="divbody">
   <div id="divbodycontent"></div>
 </div>

 <div id="divfooter"></div>
</div>
</body>
</html>

What happens when this page is loaded on to a client computer? The javascript function getJsonFromServer() is called. This function downloads a block of json data that defines the main links for the site. Pretty cool, eh? When you click one of these main links, they call getJsonFromServer() again, but with different parameters and ends up filling the content area of the page.


function getJsonFromServer(filename, newstate)
{
 xmlHttp = getXmlHttpObject();

 if (xmlHttp != null){
   currentState = newstate; // save the state so we can process it later
   xmlHttp.onreadystatechange = stateChanged;
   xmlHttp.open("GET", filename, true);
   xmlHttp.send(null);
 }
}

Which relies on the function:

function getXmlHttpObject(){
 var objXMLHttp = null;

 if(window.XMLHttpRequest){
   objXMLHttp = new XMLHttpRequest();
 } else if(window.ActiveXObject){
   objXMLHttp = new ActiveXObject("Microsoft.XMLHTTP");
 }

 return objXMLHttp;
}

The getXmlHttpObject() function needs to execute different functions based on what bowser the user is using. Just one of the multitude of problems getting cross browser compatibility.





 

 
Stock Photo Website
Tech Learning Site

Popular Article Tags

Recent Article Comments

Archives