PHP – Securing your Web Application : Escape Output

Escaping is a technique that preserves data as it enters another context. PHP is frequently used as a bridge between disparate data sources, and when you send data to a remote source, it’s your responsibility to prepare it properly so that it’s not misinterpreted.

For example, O’Reilly is represented as O\’Reilly when used in an SQL query to be sent to a MySQL database. The backslash before the single quote exists to preserve the single quote in the context of the SQL query. The single quote is part of the data, not part of the query, and the escaping guarantees this interpretation.

The two predominant remote sources to which PHP applications send data are HTTP clients (web browsers) that interpret HTML, JavaScript, and other client-side technologies, and databases that interpret SQL. For the former, PHP provides htmlentities():

$html = array();
$html['username'] = htmlentities($clean['username'], ENT_QUOTES, 'UTF-8');

echo "<p>Welcome back, {$html['username']}.</p>";

This example demonstrates the use of another naming convention. The $html array is similar to the $clean array, except that its purpose is to hold data that is safe to be used in the context of HTML.

URLs are sometimes embedded in HTML as links:

<a href="http://host/script.php?var={$value}">Click Here</a>

In this particular example, $value exists within nested contexts. It’s within the query string of a URL that is embedded in HTML as a link. Because it’s alphabetic in this case, it’s safe to be used in both contexts. However, when the value of $var cannot be guaranteed to be safe in these contexts, it must be escaped twice:

$url = array(
	'value' => urlencode($value),
);

$link = "http://host/script.php?var={$url['value']}";
$html = array(
	'link' => htmlentities($link, ENT_QUOTES, 'UTF-8'),
);

echo "<a href=\"{$html['link']}\">Click Here</a>";

This ensures that the link is safe to be used in the context of HTML, and when it is used as a URL (such as when the user clicks the link), the URL encoding ensures that the value of $var is preserved.

For most databases, there is a native escaping function specific to the database. For example, the MySQL extension provides mysqli_real_escape_string():

$mysql = array(
	'username' => mysqli_real_escape_string($clean['username']),
);

$sql = "SELECT * FROM profile
	WHERE username = '{$mysql['username']}'";

$result = mysql_query($sql);

An even safer alternative is to use a database abstraction library that handles the escaping for you. The following illustrates this concept with PEAR::DB:

$sql = "INSERT INTO users (last_name) VALUES (?)";

$db->query($sql, array($clean['last_name']));

Although this is not a complete example, it highlights the use of a placeholder (the question mark) in the SQL query. PEAR::DB properly quotes and escapes the data according to the requirements of your database.

A more complete output-escaping solution would include context-aware escaping for HTML elements, HTML attributes, JavaScript, CSS, and URL content, and would do so in a Unicode-safe manner. Here in Example, is a sample class for escaping output in a variety of contexts, based on the content-escaping rules defined by the Open Web Application Security Project.

class Encoder
{
	const ENCODE_STYLE_HTML = 0;
	const ENCODE_STYLE_JAVASCRIPT = 1;
	const ENCODE_STYLE_CSS = 2;
	const ENCODE_STYLE_URL = 3;
	const ENCODE_STYLE_URL_SPECIAL = 4;
	private static $URL_UNRESERVED_CHARS =
	'ABCDEFGHIJKLMNOPQRSTUVWXYZabcedfghijklmnopqrstuvwxyz-_.~';

	public function encodeForHTML($value)
	{
		$value = str_replace('&', '&amp;', $value);
		$value = str_replace('<', '&lt;', $value);
		$value = str_replace('>', '&gt;', $value);
		$value = str_replace('"', '&quot;', $value);
		$value = str_replace('\'', '&#x27;', $value); // &apos; is not recommended
		$value = str_replace('/', '&#x2F;', $value); // forward slash can help end HTML entity
		return $value;
	}

	public function encodeForHTMLAttribute($value)
	{
		return $this->_encodeString($value);
	}

	public function encodeForJavascript($value)
	{
		return $this->_encodeString($value, self::ENCODE_STYLE_JAVASCRIPT);
	}

	public function encodeForURL($value)
	{
		return $this->_encodeString($value, self::ENCODE_STYLE_URL_SPECIAL);
	}

	public function encodeForCSS($value)
	{
		return $this->_encodeString($value, self::ENCODE_STYLE_CSS);
	}

	/**
	* Encodes any special characters in the path portion of the URL. Does not
	* modify the forward slash used to denote directories. If your directory
	* names contain slashes (rare), use the plain urlencode on each directory
	* component and then join them together with a forward slash.
	*
	* Based on http://en.wikipedia.org/wiki/Percent-encoding and
	* http://tools.ietf.org/html/rfc3986
	*/

	public function encodeURLPath($value)
	{
		$length = mb_strlen($value);
		if ($length == 0) {
			return $value;
		}

		$output = '';
		for ($i = 0; $i < $length; $i++) {
			$char = mb_substr($value, $i, 1);
			if ($char == '/') {
				// Slashes are allowed in paths.
				$output .= $char;
			}
			else if (mb_strpos(self::$URL_UNRESERVED_CHARS, $char) == false) {
				// It's not in the unreserved list so it needs to be encoded.
				$output .= $this->_encodeCharacter($char, self::ENCODE_STYLE_URL);
			}
			else {
				// It's in the unreserved list so let it through.
				$output .= $char;
			}

		}
		return $output;
	}

	private function _encodeString($value, $style = self::ENCODE_STYLE_HTML)
	{
		if (mb_strlen($value) == 0) {
			return $value;
		}

		$characters = preg_split('/(?<!^)(?!$)/u', $value);
		$output = '';
		foreach ($characters as $c) {
			$output .= $this->_encodeCharacter($c, $style);
		}
		return $output;
	}

	private function _encodeCharacter($c, $style = self::ENCODE_STYLE_HTML)
	{
		if (ctype_alnum($c)) {
			return $c;
		}

		if (($style === self::ENCODE_STYLE_URL_SPECIAL) && ($c == '/' || $c == ':')) {
			return $c;
		}
		$charCode = $this->_unicodeOrdinal($c);
		$prefixes = array(
			self::ENCODE_STYLE_HTML => array('&#x', '&#x'),
			self::ENCODE_STYLE_JAVASCRIPT => array('\\x', '\\u'),
			self::ENCODE_STYLE_CSS => array('\\', '\\'),
			self::ENCODE_STYLE_URL => array('%', '%'),
			self::ENCODE_STYLE_URL_SPECIAL => array('%', '%'),
		);

		$suffixes = array(
			self::ENCODE_STYLE_HTML => ';',
			self::ENCODE_STYLE_JAVASCRIPT => '',
			self::ENCODE_STYLE_CSS => '',
			self::ENCODE_STYLE_URL => '',
			self::ENCODE_STYLE_URL_SPECIAL => '',
		);

		// if ASCII, encode with \\xHH
		if ($charCode < 256) {
			$prefix = $prefixes[$style][0];
			$suffix = $suffixes[$style];
			return $prefix . str_pad(strtoupper(dechex($charCode)), 2, '0') . $suffix;
		}

		// otherwise encode with \\uHHHH
		$prefix = $prefixes[$style][1];
		$suffix = $suffixes[$style];
		return $prefix . str_pad(strtoupper(dechex($charCode)), 4, '0') . $suffix;
	}

	private function _unicodeOrdinal($u)
	{
		$c = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
		$c1 = ord(substr($c, 0, 1));
		$c2 = ord(substr($c, 1, 1));
		return $c2 * 256 + $c1;
	}

}

Here is the list of of Article in this Series:

Please share the article if you like let your friends learn PHP Security. Please comment any suggestion or queries.

 

Thanks Kevin Tatroe, Peter MacIntyre and Rasmus Lerdorf. Special Thanks to O’Relly.