This document describes how to display safely formatted output from user
input. We will discuss the dangers of displaying unfiltered output
and then provide a safe means of displaying formatted output. Download the attachment and extract it into
your web documents directory.
If you just took the user's input and displayed it as is, you may break
your webpage. For example, someone can maliciously embed javascript
in their comment like:
This is my comment.
<script language="javascript:
alert('Do something bad here!')">.
Even if the user had no bad intentions, they may accidentally put some
HTML that breaks your site layout. For example if you displayed the
user's input in a table and they included an improperly nested
</table>
tag, your page appears broken.
The easiest solution would be to only display plain text in the comment.
Using the htmlspecialchars() function, you convert all the special
characters into HTML entites. For example <b> would become
<b>,
turning it into text instead of an HTML tag. This guarantees that
there are no HTML markups in the comment that would produce unwanted output.
This is an okay solution if your guests don't mind entering in only
plain text, but it would be a lot better if you gave them some formatting
abilities.
You can provide your own special markup tags for the user to use.
For example, you can allow the to use
[b]...[/b] for bolding, and
[i]...[/i]
for italics. Those would be simple string replace operations:
$output = str_replace("[b]", "<b>", $output);
$output = str_replace("[i]", "<i>", $output);
To get a little fancier, we will allow users to add links as well.
For example, the user will be allowed to enter in
[link="url"]...[/link],
which we will turn into a proper
<a href="">...</a> statement.
We can't use a simple string replace here, instead we need to use regular
expressions:
$output = ereg_replace('\[link="([[:graph:]]+)"\]', '<a href="\\1">', $output);
If you are unfamiliar with the
ereg_replace() function, this statement
means:
-
look for all occurrences of [link="..."] and replace it with <a
href="...">
[[:graph:]] means any non-whitespace character. See
the manual
for more details about regular expressions.
The
format_output() function in
outputlib.php provides
these markups and a few other ones as well. The general algorithm
would be to:
-
call htmlspecialchars() on the output text to convert all special
characters to HTML entities
-
systematically do string replacements or regexp replacements on our custom
markup tags
<?php
function format_output($output) {
/****************************************************************************
* Takes a raw string ($output) and formats it for output using a special
* stripped down markup that is similar to HTML
****************************************************************************/
$output = htmlspecialchars(stripslashes($output));
/* new paragraph */
$output = str_replace('[p]', '<p>', $output);
/* bold */
$output = str_replace('[b]', '<b>', $output);
$output = str_replace('[/b]', '</b>', $output);
/* italics */
$output = str_replace('[i]', '<i>', $output);
$output = str_replace('[/i]', '</i>', $output);
/* preformatted */
$output = str_replace('[pre]', '<pre>', $output);
$output = str_replace('[/pre]', '</pre>', $output);
/* indented blocks (blockquote) */
$output = str_replace('[indent]', '<blockquote>', $output);
$output = str_replace('[/indent]', '</blockquote>', $output);
/* anchors */
$output = ereg_replace('\[anchor=&quot;([[:graph:]]+)&quot;\]', '<a name="\1"></a>', $output);
/* links, note we try to prevent javascript in links */
$output = str_replace('[link=&quot;javascript', '[link=&quot; javascript', $output);
$output = ereg_replace('\[link=&quot;([[:graph:]]+)&quot;\]', '<a href="\1">', $output);
$output = str_replace('[/link]', '</a>', $output);
return nl2br($output);
}
?>
Some notes:
-
Remember to do string replacements after you call htmlspecialchars()
and not before, otherwise all your hard work in turning your custom markups
into HTML markups will be lost when you call htmlspecialchars().
-
Remember to search for the HTML entity and in your replacements, for example
instead of looking for " (double quote) you would look for "
since that is what it got translated to. See
the manual
for the other translations that occur.
-
The nl2br() function converts linebreaks into <br> tags,
again make sure this is called after htmlspecialchars(), not before.
-
When converting [links=""] into <a href="">, you must
be sure to prevent people from inserting javascript. A simple way to
do that is to change [link="javascript into [link=" javascript,
that way it won't match the pattern for links and it will just be displayed
as is.
outputlib.php
Load up the test.php script to see the format_output() in action.
function in action. Start by entering this in the textbox:
Regular HTML markup is not available, instead we will use special markup:
- this is [b]bold[/b]
- this is [i]italics[/i]
- this is [link="http://www.phpbuilder.com"]a link[/link]
- this is [anchor="test"]an anchor, and a [link="#test"]link[/link] to the anchor
[p]This is a paragraph break
[pre]This is preformatted text[/pre]
[indent]This is indented text[/indent]
This concludes our demonstration.
Currently there are only a small number of markups available - you are free to add
more as you see fit.
Conclusion
This article discussed the dangers of displaying unfiltered user input,
and provided a solution for displaying formatted user input with custom
markup tags. This can be applied anywhere you want to accept user input,
for example:
- guestbooks
- user comments
- system bulletins
- etc.
Enjoy!
--Ying