HTML Filters/Purifiers, The Need, and Introducing htmLawed

(by Santosh K. Patnaik) Web-based applications like blogs, content management systems (CMSs), forums, newsfeed aggregators, and wikis that utilize user-submitted text are widely deployed today. Often the applications permit HTML code in the text; after all, the input is used for display in web-pages.

HTML specifications are not just about the HTML elements and attributes. For instance, there are rules regarding characters and character entities (like '&' to represent the ampersand '&'). Plain text in the input that has no obvious HTML markup is thus still technically HTML.
Users either directly type the HTML code in the text, or the code is indirectly put in through BBCode (in which surrogates like '[url=...]' are used to represent HTML code like '<a href=...>'), WYSIWYG (What You See Is What You Get) editors like TinyMCE, etc.
Both BBCode and the browser-based WYSIWYG systems are capable of generating the most correct HTML markup without typographical or syntactical errors, they may have only a limited ability to restrict the markup (e.g., to disallow certain attributes for the HTML elements), and they generally do not encompass all of HTML (e.g., to deal with the 'form' HTML element)

The presence of HTML markup in the input text poses certain problems.

The HTML code may not be in compliance with the right HTML standard; e.g., input meant for a web-page using the XHTML 1.0 Strict DTD may incorrectly be using the deprecated 'u' HTML element to show underlined text.

A submitter may inadvertently have mistyped HTML code. For instance, he may have forgotten to put a closing tag, or to properly nest HTML elements. This too can make web-pages standard-incompliant. Poor standard compliance can break the display of a web-page or it can render the purposeful use of a tag useless.

A second issue with HTML markup is that of security. HTML code meant for cross-scripting (XSS) attacks may have been put in by someone with a malintent. Similarly, HTML code may be used to spam web-pages with links.

HTML-invalid characters like the null character as well as invalid character entities in the input can crash browsers or prevent a web-page from being displayed.

HTML code, even if valid, can still mess up the design and layout of web-pages that use the input text by, for example, presenting text in disruptive sizes or styles.

Content of a web-page can be in use outside the web-site and rendered using clients that are not browsers. Web-sites that aggregate newsfeed items from other web-sites, and newsfeed reader applications exemplify such instances. Improper HTML markup in such scenarios is not limited to just illegal HTML markup, but also to disruptive markup or to markup that is not compliant with standards for content-types like XML.

It is thus important to check user-submitted text for security, and standards and administrative policy-compliance. This is true in general for any case in which text from external sources is being used (e.g., a newsfeed aggregator displaying newsfeed items collected from others), and also applies for instances when any HTML markup is being generated indirectly by BBCode parsers, WYSIWYG editors, etc.

Stand-alone applications like 'HTML Tidy' and script-based code are available for this purpose. Such utilities are effectively input text filters that process, purify and sanitize the text. They take care of illegal characters and character entities, and of illegal or disallowed HTML elements and attributes by removing them or transforming them to plain text or allowed markup. They also balance the tags used to represent the elements, ensure that the elements are properly nested, etc. Some HTML filters are also able to check attribute values for correctness and can even modify them (e.g., to obfuscate email addresses as an anti-spam measure.

The various filtering scripts available today have different capabilities and customizabilities. It is also possible to use two different filters in tandem to have the desired filtering effect. In general, filters with more capabilities require more time and resources (CPU cycles and memory) for processing input text.

Some good HTML filters/purifiers available in various scripting languages are: for Perl, HTML Scrubber; for PHP, htmLawed and HTMLPurifier; for Python, HTML5lib; and, for Ruby, HTML5lib.

The htmLawed PHP script, in a single, ~45-kb file, is fast, with low memory consumption, and offers a high degree of configurability. Besides covering all aspects of HTML markup as described in the current HTML/XHTML standards, it can also deal with common but non-standard elements and attributes like 'embed'. Its additional capabilities include URL protocol checks, anti-email and anti-link spam measures, relative/absolute URL conversions, and transformation of deprecated elements and attributes.

To learn more about using htmLawed in your applications, please visit the htmLawed web-site:

Bookmark and Share Tag: HTML, htmlLawed Category: PHP Classes Post : August 29th 2008 Read: 6,065


blog comments powered by Disqus