Date: 05/13/03
- Next message: Cestmir Hybl: "Re: [PHP-I18N] UTF-8 string validity detection"
- Previous message: Moriyoshi Koizumi: "Re: [PHP-I18N] Problem with specific kanji"
- In reply to: Cestmir Hybl: "[PHP-I18N] UTF-8 string validity detection"
- Next in thread: Cestmir Hybl: "Re: [PHP-I18N] UTF-8 string validity detection"
- Reply: Cestmir Hybl: "Re: [PHP-I18N] UTF-8 string validity detection"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi,
As of the current mbstring implementation, there's no particular function
to verify if a given string is encoded in valid utf-8. Instead, it'd be
worth trying the following workaround:
<?php
function verify_utf8($str) {
if ($str === mb_convert_encoding(mb_convert_encoding($str, "UTF-32",
"UTF-8"), "UTF-8", "UTF-32")) {
return true;
}
return false;
}
$str = "some UTF-8 encoded string";
var_dump(verify_utf8($str));
?>
Moriyoshi
"Cestmir Hybl" <cestmir <email protected>> wrote:
> Hi.
>
> Is there a way to detect validity of UTF-8 string?
>
> We have a problem processing external data, where "almost-all" of records
> are using valid UTF-8 strings but those "1-of-10000" hangs the process up.
> Records are transformed and then stored into RDBMS (PostgreSQL), which
> returns an error ("invalid UTF sequence"). This RDBMS error cannot be used
> as source of information about UTF validity, because this information has to
> be known prior to inserting DB record.
>
> I've tried this:
>
> $str = 'some INVALID UTF-8 seq. here';
> var_dump(mb_detect_encoding($str, 'UTF-8')); // dumps false
>
> $str = 'some valid UTF-8 seq. here';
> var_dump(mb_detect_encoding($str, 'UTF-8')); // dumps 'UTF-8'
>
> $str = 'some valid UTF-8 seq. here' . 'some INVALID UTF-8 seq. here';
> var_dump(mb_detect_encoding($str, 'UTF-8')); // dumps 'UTF-8'
>
> It seems that mb_detect_encoding() doesn't scan whole string to detect
> encoding (and it's quite reasonable for the purpose of just estimating
> encoding from given set).
> But I can't find any other method not even workaround to test the validity
> of UTF-8 sequence with MBSTRING support in PHP.
>
> (well, I could have used some cheap DB query like 'select
> upper(some-utf-sequence)' which will raise an error if called with invalid
> sequence, but some client-side solution would be much better)
>
> Cestmir Hybl
>
>
>
> --
> PHP Internationalization Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
-- PHP Internationalization Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
- Next message: Cestmir Hybl: "Re: [PHP-I18N] UTF-8 string validity detection"
- Previous message: Moriyoshi Koizumi: "Re: [PHP-I18N] Problem with specific kanji"
- In reply to: Cestmir Hybl: "[PHP-I18N] UTF-8 string validity detection"
- Next in thread: Cestmir Hybl: "Re: [PHP-I18N] UTF-8 string validity detection"
- Reply: Cestmir Hybl: "Re: [PHP-I18N] UTF-8 string validity detection"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

