The presence of HTML markup in the input text poses certain problems.
The HTML code may not be in compliance with the right HTML standard; e.g., input meant for a web-page using the XHTML 1.0 Strict DTD may incorrectly be using the deprecated 'u' HTML element to show underlined text.
A submitter may inadvertently have mistyped HTML code. For instance, he may have forgotten to put a closing tag, or to properly nest HTML elements. This too can make web-pages standard-incompliant. Poor standard compliance can break the display of a web-page or it can render the purposeful use of a tag useless.
A second issue with HTML markup is that of security. HTML code meant for cross-scripting (XSS) attacks may have been put in by someone with a malintent. Similarly, HTML code may be used to spam web-pages with links.
HTML-invalid characters like the null character as well as invalid character entities in the input can crash browsers or prevent a web-page from being displayed.
HTML code, even if valid, can still mess up the design and layout of web-pages that use the input text by, for example, presenting text in disruptive sizes or styles.
Content of a web-page can be in use outside the web-site and rendered using clients that are not browsers. Web-sites that aggregate newsfeed items from other web-sites, and newsfeed reader applications exemplify such instances. Improper HTML markup in such scenarios is not limited to just illegal HTML markup, but also to disruptive markup or to markup that is not compliant with standards for content-types like XML.
It is thus important to check user-submitted text for security, and standards and administrative policy-compliance. This is true in general for any case in which text from external sources is being used (e.g., a newsfeed aggregator displaying newsfeed items collected from others), and also applies for instances when any HTML markup is being generated indirectly by BBCode parsers, WYSIWYG editors, etc.
Stand-alone applications like 'HTML Tidy' and script-based code are available for this purpose. Such utilities are effectively input text filters that process, purify and sanitize the text. They take care of illegal characters and character entities, and of illegal or disallowed HTML elements and attributes by removing them or transforming them to plain text or allowed markup. They also balance the tags used to represent the elements, ensure that the elements are properly nested, etc. Some HTML filters are also able to check attribute values for correctness and can even modify them (e.g., to obfuscate email addresses as an anti-spam measure.
The various filtering scripts available today have different capabilities and customizabilities. It is also possible to use two different filters in tandem to have the desired filtering effect. In general, filters with more capabilities require more time and resources (CPU cycles and memory) for processing input text.
Some good HTML filters/purifiers available in various scripting languages are: for Perl, HTML Scrubber; for PHP, htmLawed and HTMLPurifier; for Python, HTML5lib; and, for Ruby, HTML5lib.
The htmLawed PHP script, in a single, ~45-kb file, is fast, with low memory consumption, and offers a high degree of configurability. Besides covering all aspects of HTML markup as described in the current HTML/XHTML standards, it can also deal with common but non-standard elements and attributes like 'embed'. Its additional capabilities include URL protocol checks, anti-email and anti-link spam measures, relative/absolute URL conversions, and transformation of deprecated elements and attributes.
To learn more about using htmLawed in your applications, please visit the htmLawed web-site: http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed