[PHP-I18N] UTF-8 string validity detection From: Cestmir Hybl (cestmir <email protected>)
Date: 05/12/03

Hi.

Is there a way to detect validity of UTF-8 string?

We have a problem processing external data, where "almost-all" of records
are using valid UTF-8 strings but those "1-of-10000" hangs the process up.
Records are transformed and then stored into RDBMS (PostgreSQL), which
returns an error ("invalid UTF sequence"). This RDBMS error cannot be used
as source of information about UTF validity, because this information has to
be known prior to inserting DB record.

I've tried this:

$str = 'some INVALID UTF-8 seq. here';
var_dump(mb_detect_encoding($str, 'UTF-8')); // dumps false

$str = 'some valid UTF-8 seq. here';
var_dump(mb_detect_encoding($str, 'UTF-8')); // dumps 'UTF-8'

$str = 'some valid UTF-8 seq. here' . 'some INVALID UTF-8 seq. here';
var_dump(mb_detect_encoding($str, 'UTF-8')); // dumps 'UTF-8'

It seems that mb_detect_encoding() doesn't scan whole string to detect
encoding (and it's quite reasonable for the purpose of just estimating
encoding from given set).
But I can't find any other method not even workaround to test the validity
of UTF-8 sequence with MBSTRING support in PHP.

(well, I could have used some cheap DB query like 'select
upper(some-utf-sequence)' which will raise an error if called with invalid
sequence, but some client-side solution would be much better)

Cestmir Hybl

-- 
PHP Internationalization Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php