picture of Ying Zhang
This document describes how to display safely formatted output from user input. We will discuss the dangers of displaying unfiltered output and then provide a safe means of displaying formatted output. Download the attachment and extract it into your web documents directory.

Dangers of Unfiltered Output

If you just took the user's input and displayed it as is, you may break your webpage. For example, someone can maliciously embed javascript in their comment like:
This is my comment. 
<script language="javascript: 
alert('Do something bad here!')">.
Even if the user had no bad intentions, they may accidentally put some HTML that breaks your site layout. For example if you displayed the user's input in a table and they included an improperly nested </table> tag, your page appears broken.

Displaying Plain Text Only

The easiest solution would be to only display plain text in the comment. Using the htmlspecialchars() function, you convert all the special characters into HTML entites. For example <b> would become &lt;b&gt;, turning it into text instead of an HTML tag. This guarantees that there are no HTML markups in the comment that would produce unwanted output.
This is an okay solution if your guests don't mind entering in only plain text, but it would be a lot better if you gave them some formatting abilities.

Formatting with Custom Markup Tags

You can provide your own special markup tags for the user to use. For example, you can allow the to use [b]...[/b] for bolding, and [i]...[/i] for italics. Those would be simple string replace operations:
$output = str_replace("[b]", "<b>", $output);
$output = str_replace("[i]", "<i>", $output);
To get a little fancier, we will allow users to add links as well. For example, the user will be allowed to enter in [link="url"]...[/link], which we will turn into a proper <a href="">...</a> statement. We can't use a simple string replace here, instead we need to use regular expressions:
$output = ereg_replace('\[link="([[:graph:]]+)"\]', '<a href="\\1">', $output);
If you are unfamiliar with the ereg_replace() function, this statement means:
  • look for all occurrences of [link="..."] and replace it with <a href="...">
[[:graph:]] means any non-whitespace character. See the manual for more details about regular expressions.
The format_output() function in outputlib.php provides these markups and a few other ones as well. The general algorithm would be to:
  1. call htmlspecialchars() on the output text to convert all special characters to HTML entities
  2. systematically do string replacements or regexp replacements on our custom markup tags
<?php

function format_output($output) {
/****************************************************************************
 * Takes a raw string ($output) and formats it for output using a special
 * stripped down markup that is similar to HTML
 ****************************************************************************/

    
$output htmlspecialchars(stripslashes($output));

    
/* new paragraph */
    
$output str_replace('[p]''&lt;p>'$output);

    
/* bold */
    
$output str_replace('[b]''&lt;b>'$output);
    
$output str_replace('[/b]''&lt;/b>'$output);

    
/* italics */
    
$output str_replace('[i]''&lt;i>'$output);
    
$output str_replace('[/i]''&lt;/i>'$output);

    
/* preformatted */
    
$output str_replace('[pre]''&lt;pre>'$output);
    
$output str_replace('[/pre]''&lt;/pre>'$output);

    
/* indented blocks (blockquote) */
    
$output str_replace('[indent]''&lt;blockquote>'$output);
    
$output str_replace('[/indent]''&lt;/blockquote>'$output);

    
/* anchors */
    
$output ereg_replace('\[anchor=&amp;quot;([[:graph:]]+)&amp;quot;\]''&lt;a name="\1">&lt;/a>'$output);
    
    
/* links, note we try to prevent javascript in links */
    
$output str_replace('[link=&amp;quot;javascript''[link=&amp;quot; javascript'$output);
    
$output ereg_replace('\[link=&amp;quot;([[:graph:]]+)&amp;quot;\]''&lt;a href="\1">'$output);
    
$output str_replace('[/link]''&lt;/a>'$output);      
    
    return 
nl2br($output);
}

?>

Some notes:

  • Remember to do string replacements after you call htmlspecialchars() and not before, otherwise all your hard work in turning your custom markups into HTML markups will be lost when you call htmlspecialchars().

  • Remember to search for the HTML entity and in your replacements, for example instead of looking for " (double quote) you would look for &quot; since that is what it got translated to. See the manual for the other translations that occur.

  • The nl2br() function converts linebreaks into <br> tags, again make sure this is called after htmlspecialchars(), not before.

  • When converting [links=""] into <a href="">, you must be sure to prevent people from inserting javascript. A simple way to do that is to change [link="javascript into [link=" javascript, that way it won't match the pattern for links and it will just be displayed as is.

outputlib.php

Load up the test.php script to see the format_output() in action. function in action. Start by entering this in the textbox:
Regular HTML markup is not available, instead we will use special markup:

- this is [b]bold[/b]
- this is [i]italics[/i]
- this is [link="http://www.phpbuilder.com"]a link[/link]
- this is [anchor="test"]an anchor, and a [link="#test"]link[/link] to the anchor

[p]This is a paragraph break
[pre]This is preformatted text[/pre]
[indent]This is indented text[/indent]
This concludes our demonstration.
Currently there are only a small number of markups available - you are free to add more as you see fit.

Conclusion

This article discussed the dangers of displaying unfiltered user input, and provided a solution for displaying formatted user input with custom markup tags. This can be applied anywhere you want to accept user input, for example:
  • guestbooks
  • user comments
  • system bulletins
  • etc.
Enjoy!
--Ying