HTML parsing

It can be a cumbersome task to parse HTML files into some kind of .NET object structure. The trouble with HTML files is that they usually are not  valid XHTML and thereby do not comply to the XML standard. There is a great toolkit avaible that accomplishes this task. It is called HtmlAgilityPack.

What it does is to give you an API very similiar to XmlDocument’s API but for HTML files. This makes it straight forward to search for specific tags, modify the document and write the changed document to some output.

The HtmlAgiltyPack provides the following important objects:
– HtmlDocument
– HtmlNode

The easiest way to perform searches is by using XPath queries on the HtmlDocument object. See the following example that retrieves all <div class="home"> elements from a HTML page.

For modifing the HTML file you have two options:

  1. Change the HtmlDocument in your code and persist the modified object. This is going to be messy if you have plenty of changes and will result into hard to debug/maintain code.
  2. Use a XSLT stylesheet to transform the HtmlDocument. This is the clean approach. It also has the advantage that you do not have to recompile your code when you change the mapping.

Leave a Reply

Your email address will not be published. Required fields are marked *