Re: [PHP-I18N] UTF-8 string validity detection From: Cestmir Hybl (cestmir <email protected>)
Date: 05/13/03

Thanks a lot!

Meanwhile, I was playng with PCRE solution like this:

function utf8IsValidString($AStr)
{
  $ptrASCII = '[\x00-\x7F]';
  $ptr2Octet = '[\xC2-\xDF][\x80-\xBF]';
  $ptr3Octet = '[\xE0-\xEF][\x80-\xBF]{2}';
  $ptr4Octet = '[\xF0-\xF4][\x80-\xBF]{3}';
  $ptr5Octet = '[\xF8-\xFB][\x80-\xBF]{4}';
  $ptr6Octet = '[\xFC-\xFD][\x80-\xBF]{5}';

  return
preg_match("/^($ptrASCII|$ptr2Octet|$ptr3Octet|$ptr4Octet|$ptr5Octet|$ptr6Oc
tet)*$/s", $AStr);
}

but it tends to segfault on longer input (~10kB of text).

I've performed couple of tests and your solution seems to work fine, though
there's no specification on how exactly mb_convert_encoding() behaves on
incorrect input and how this may change in future. Stability of UTF-8 <->
UCS-4 round trip seems to be guarantied in RFC 2279.

CH

> As of the current mbstring implementation, there's no particular function
> to verify if a given string is encoded in valid utf-8. Instead, it'd be
> worth trying the following workaround:
>
> <?php
> function verify_utf8($str) {
> if ($str === mb_convert_encoding(mb_convert_encoding($str, "UTF-32",
> "UTF-8"), "UTF-8", "UTF-32")) {
> return true;
> }
> return false;
> }

-- 
PHP Internationalization Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php