Thursday, May 19, 2005

Good Use for Regex

Ever want to extract data from someone else's webform?
If you can view the html source of the page and find consistent code that delimits the form data you can use .NET regular expressions to capture the form data in groups. Then, you can loop through the regular expression matches and extract the data from the groups. Here's an example:
Here is the code that separated the form titles from the form data:

</TH><TD valign="top" class="formbody"><table border=0
cellpadding=0 cellspacing=0><tr><td class="formbody">

And here is the regular expression pattern I used (note the concatenation " & Chr(34) & " is VB specific, but the rest is .NET universal). Note I'm using good deal of literal text. Undoubtedly this could be pared down, but I had some trouble figuring out how to do it, so, for expediency this is it:

([a-zA-Z ]+?)(?::</TH><TD valign=" & Chr(34) & "top" & Chr(34) & " class=" & Chr(34) & "formbody" & Chr(34) & "><table border=0 cellpadding=0 cellspacing=0><tr><td class=" & Chr(34) & "formbody" & Chr(34) & ">)(.*?)</td")
And finally the code to handle the results:

Imports System.Text.RegularExpressions

Dim reggie As Regex
Dim midge As Match
Dim sAry as string

'pass the html code and the regex pattern, and this line executes the regex find
For Each midge In reggie.Matches(strSubj, stPat)
'based on the value of group 1
sAry = midge.Groups(1).ToString
Select Case sAry
Case "Title"
'you can assign the value of group two to the appropriate place
sTitle = midge.Groups(2).ToString
Case "Begin"
dBegin = midge.Groups(2).ToString
Case "End"
dEnd = midge.Groups(2).ToString
'.......
End Select
Next
The handling of the strings could also undoubtedly be refactored too. But hey, we all start somewhere.

No comments :