Index: phpdoc/en/functions/mbstring.xml
diff -u phpdoc/en/functions/mbstring.xml:1.2 phpdoc/en/functions/mbstring.xml:1.3
--- phpdoc/en/functions/mbstring.xml:1.2 Sun Jun 24 11:27:21 2001
+++ phpdoc/en/functions/mbstring.xml Thu Jun 28 23:20:28 2001
@@ -1,117 +1,305 @@
Multi-Byte String Functions
- Multi-Byte String
+
+ Multi-Byte String
+
&warn.experimental;
Introduction
- This module is EXPERIMENTAL. Function name/API is subject to be
- changed. Current conversion filter supports Japanese only.
+ This module is EXPERIMENTAL. Function name/API is subject to
+ change. Current conversion filter supports Japanese only.
- There are many languages that all characters cannot be expressed
+ There are many languages in which all characters can be expressed
by single byte. Multi-byte character codes are used to express
many characters for many languages. mbstring
is developed to handle Japanese characters. However, many
mbstring functions are able to handle
- character codes other than Japanese.
+ character encoding other than Japanese.
- Multi-byte character encoding represents single character with
+ A multi-byte character encoding represents single character with
consecutive bytes. Some character encoding has shift(escape)
- sequences to start/end multi-byte character string. Therefore,
+ sequences to start/end multi-byte character strings. Therefore, a
multi-byte character string may be destroyed when it is divided
- and/or counted, unless multi-byte character encoding safe method
- is used. mbstring functions support multi-byte
- character safe string functions and other utility functions such
- as conversion functions.
+ and/or counted unless multi-byte character encoding safe method
+ is used. This module provides multi-byte character safe string
+ functions and other utility functions such as conversion
+ functions.
+
+ Since PHP is basically designed for ISO-8859-1, some multi-byte
+ character encoding does not work well with PHP. Therefore, it is
+ important to set mbstring.internal_encoding to
+ a character encoding that works with PHP.
+
+
+ PHP4 Character Encoding Requirements
+
+
+
+
+
+ Per byte encoding
+
+
+
+
+ Single byte characters in range of 00h-7fh
+ which is compatible with ASCII
+
+
+
+
+ Multi-byte characters without 00h-7fh
+
+
+
+
+
+ These are examples of internal character encoding that works with
+ PHP and does NOT work with PHP.
+
+
-
- Basics for Japanese multi-byte character
+Character encodings work with PHP:
+ISO-8859-*, EUC-JP, UTF-8
+
+
+Character encodings do NOT work with PHP:
+JIS, SJIS
+
+
+
+
+ Character encoding, that does not work with PHP, may be converted
+ with mbstring's HTTP input/output conversion
+ feature/function.
+
+
+
+ SJIS should not be used for internal encoding unless the reader
+ is familiar with parser/compiler, character encoding and
+ character encoding issues.
+
+
+
- Most Japanese characters need more than 1 byte for a
- character. In addition to this, several character encodings are
- used under Japanese environment. There are EUC-JP, Shift_JIS and
- ISO-2022-JP character encoding. As Unicode is getting popular,
- UTF-8 is used also. To develop Web application for Japanese
- environment, it is important to use these character codes depend
- on its purpose, HTTP input/output, RDBMS and E-mail.
+ If you use database with PHP, it is recommended that you use the
+ same character encoding for both database and internal
+ encoding for ease of use and better performance.
+
+
+ If you are using PostgreSQL, it supports character
+ encoding that is different from backend character encoding. See
+ the PostgreSQL manual for details.
+
+
+
+ How to Enable mbstring
+ mbstring is an extended module. You must
+ enable module with configure script. Refer
+ to the Install section for
+ details.
+
+
+ The following configure options are related to
+ mbstring module.
+
+
-
-
- Storage for a character can be upto four bytes
-
-
-
- A multi-byte character usually has twice of width compare to
- single byte characters. Wider character is called "zen-kaku"
- - meaning full width, narrower character called "han-kaku" -
- meaning half width. "zen-kaku" characters are fixed width
- usually.
-
+
+ : Enable
+ mbstring functions. This option is
+ required to use mbstring functions.
+
-
- Some character encoding defines shift sequence for
- entering/exiting multi-byte character strings.
-
+
+ :
+ Enable HTTP input character encoding conversion using
+ mbstring conversion engine. If this
+ feature is enabled, HTTP input character encoding may be
+ converted to mbstring.internal_encoding
+ automatically.
+
+
+
+
+
+
+ HTTP Input and Output
+
+ HTTP input/output character encoding conversion may convert
+ binary data also. Users are supposed to control character
+ encoding conversion if binary data is used for HTTP
+ input/output.
+
+
+ If enctype for HTML form is set to
+ multipart/form-data,
+ mbstring does not convert character encoding
+ in POST data. If it is the case, strings are needed to be
+ converted to internal character encoding.
+
+
+
- Database may allocate storage for characters that differs
- from size used in PHP even if the same character encoding is
- used. (For example, PostgreSQL)
+ HTTP Input
+ There is no way to control HTTP input character
+ conversion from PHP script. To disable HTTP input character
+ conversion, it has to be done in php.ini.
+
+
+ Disable HTTP input conversion in php.ini
+
+
+
+;; Disable HTTP Input conversion
+mbstring.http_input = pass
+
+
+
+
+ When using PHP as an Apache module, it is possible to
+ override PHP ini setting per Virtual Host in
+ httpd.conf or per directory with
+ .htaccess. Refer to the Configuration section and
+ Apache Manual for details.
+
- E-mail is supposed to use ISO-2022-JP.
+ HTTP Output
-
-
- "i-mode" web site is supposed to use Shift_JIS.
+ There are several ways to enable output character encoding
+ conversion. One is using php.ini, another
+ is using ob_start with
+ mb_output_handler as
+ ob_start callback function.
+
+
+ For PHP3-i18n users, mbstring's output
+ conversion differs from PHP3-i18n. Character encoding is
+ converted using output buffer.
+
+
+
+
+ php.ini setting example
+
+
+;; Enable output character encoding conversion for all PHP pages
+
+;; Enable Output Buffering
+output_buffering = On
+
+;; Set mb_output_handler to enable output conversion
+output_handler = mb_output_handler
+
+
+
+
+
+ Script example
+
+
+<?php
+
+// Enable output character encoding conversion only for this page
+
+// Set HTTP output character encoding to SJIS
+mb_http_output('SJIS');
+
+// Start buffering and specify "mb_output_handler" as
+// callback function
+ob_start('mb_output_handler');
+
+?>
+
+
+
- Supported character encodings
+ Supported Character Encoding
+
+ Currently, the following character encoding is supported by
+ mbstring module. Caracter encoding may
+ be specified for mbstring functions'
+ encoding parameter.
+
+ The following character encoding is supported in this PHP
+ extension :
+
- Following character encodings are supported in this PHP
- extension : UCS-4,
- UCS-4BE, UCS-4LE,
- UCS-2, UCS-2BE,
- UCS-2LE, UTF-32,
- UTF-32BE, UTF-32LE,
- UCS-2LE, UTF-16,
- UTF-16BE, UTF-16LE,
- UTF-8, UTF-7,
- ASCII, EUC-JP,
- SJIS, eucJP-win,
- SJIS-win,
- ISO-2022-JP(JIS),
+ UCS-4, UCS-4BE,
+ UCS-4LE, UCS-2,
+ UCS-2BE, UCS-2LE,
+ UTF-32, UTF-32BE,
+ UTF-32LE, UCS-2LE,
+ UTF-16, UTF-16BE,
+ UTF-16LE, UTF-8,
+ UTF-7, ASCII,
+ EUC-JP, SJIS,
+ eucJP-win, SJIS-win,
+ ISO-2022-JP, JIS,
ISO-8859-1, ISO-8859-2,
ISO-8859-3, ISO-8859-4,
ISO-8859-5, ISO-8859-6,
ISO-8859-7, ISO-8859-8,
ISO-8859-9, ISO-8859-10,
ISO-8859-13, ISO-8859-14,
- ISO-8859-15.
+ ISO-8859-15, byte2be,
+ byte2le, byte4be,
+ byte4le, BASE64,
+ 7bit, 8bit and
+ UTF7-IMAP.
+
+
+ php.ini entry, which accepts encoding name,
+ accepts "auto" and
+ "pass" also.
+ mbstring functions, which accepts encoding
+ name, and accepts "auto".
+
+
+ If "pass" is set, no character
+ encoding conversion is performed.
+
+
+ If "auto" is set, it is expanded to
+ "ASCII,JIS,UTF-8,EUC-JP,SJIS".
+
+
+ See also mb_detect_order
+
+
+ "Supported character encoding" does not mean that it
+ works as internal character code.
+
+
- php.ini settings
+ php.ini settings
@@ -122,63 +310,311 @@
- mbstring.http_input defines default HTTP input
- character encoding.
+ mbstring.http_input defines default HTTP
+ input character encoding.
- mbstring.http_output defines default HTTP output
- character encoding.
+ mbstring.http_output defines default HTTP
+ output character encoding.
- mbstring.detect_order defines default character
- encoding detection order.
+ mbstring.detect_order defines default
+ character code detection order. See also
+ mb_detect_order.
- mbstring.substitute_character defines character
- to substitute for invalid character codes.
+ mbstring.substitute_character defines
+ character to substitute for invalid character encoding.
+ Web Browsers are supposed to use the same character encoding
+ when submitting form. However, browsers may not use the same
+ character encoding. See mb_http_input to
+ detect character encoding used by browsers.
+
+
+ If enctype is set to
+ multipart/form-data in HTML forms,
+ mbstring does not convert character encoding
+ in POST data. The user must convert them in the script, if
+ conversion is needed.
+
+
+ Although, browsers are smart enough to detect character encoding
+ in HTML. charset is better to be set in HTTP
+ header. Change default_charset according to
+ character encoding.
+
+ php.ini setting example
-
+
+
;; Set default internal encoding
+;; Note: Make sure to use character encoding works with PHP
mbstring.internal_encoding = UTF-8 ; Set internal encoding to UTF-8
-;; Set default HTTP input character code
-mbstring.http_input = auto ; Set HTTP input to auto
-; or
-; mbstring.http_input = SJIS ; Set HTTP input to SJIS
-; mbstring.http_input = eucjp-win, sjis-win, UTF-8 ; Specify order
-
-;; Set default HTTP output character code
-mbstring.http_output = UTF-8 ; Set HTTP output encoding to UTF-8
-
-;; Set default character code detection order
-mbstring.detect_order = auto ; Set HTTP output to auto
-; or
-; mbstring.detect_order = eucjp-win, sjis-win, UTF-8 ; Specify order
+;; Set default HTTP input character encoding
+;; Note: Script cannot change http_input setting.
+mbstring.http_input = pass ; No conversion.
+mbstring.http_input = auto ; Set HTTP input to auto
+ ; "auto" is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS"
+mbstring.http_input = SJIS ; Set HTTP2 input to SJIS
+mbstring.http_input = UTF-8,SJIS,EUC-JP ; Specify order
+
+;; Set default HTTP output character encoding
+mbstring.http_output = pass ; No conversion
+mbstring.http_output = UTF-8 ; Set HTTP output encoding to UTF-8
+
+;; Set default character encoding detection order
+mbstring.detect_order = auto ; Set detect order to auto
+mbstring.detect_order = ASCII,JIS,UTF-8,SJIS,EUC-JP ; Specify order
;; Set default substitute character
-mbstring.substitute_character = 12307 ; Specify character code
-; or
-; mbstring.substitute_character = none ; Null character
-; mbstring.substitute_character = long ; Long
+mbstring.substitute_character = 12307 ; Specify Unicode value
+mbstring.substitute_character = none ; Do not print character
+mbstring.substitute_character = long ; Long Example: U+3000,JIS+7E7E
+
+
+ php.ini setting for EUC-JP users
+
+
+;; Disable Output Buffering
+output_buffering = Off
+
+;; Set HTTP header charset
+default_charset = EUC-JP
+
+;; Set HTTP input encoding conversion to auto
+mbstring.http_input = auto
+
+;; Convert HTTP output to EUC-JP
+mbstring.http_output = EUC-JP
+
+;; Set internal encoding to EUC-JP
+mbstring.internal_encoding = EUC-JP
+
+;; Do not print invalid characters
+mbstring.substitute_character = none
+
+
+
+
+
+ php.ini setting for SJIS users
+
+
+;; Enable Output Buffering
+output_buffering = On
+
+;; Set mb_output_handler to enable output conversion
+output_handler = mb_output_handler
+
+;; Set HTTP header charset
+default_charset = Shift_JIS
+
+;; Set http input encoding conversion to auto
+mbstring.http_input = auto
+
+;; Convert to SJIS
+mbstring.http_output = SJIS
+
+;; Set internal encoding to EUC-JP
+mbstring.internal_encoding = EUC-JP
+
+;; Do not print invalid characters
+mbstring.substitute_character = none
+
+
+
+
+
+ Basics for Japanese multi-byte character
+
+ Most Japanese characters need more than 1 byte per character. In
+ addition, several character encoding schemas are used under a
+ Japanese environment. There are EUC-JP, Shift_JIS(SJIS) and
+ ISO-2022-JP(JIS) character encoding. As Unicode becomes popular,
+ UTF-8 is used also. To develop Web applications for a Japanese
+ environment, it is important to use the character set for the
+ task in hand, whether HTTP input/output, RDBMS and E-mail.
+
+
+
+
+ Storage for a character can be up to four
+ bytes
+
+
+
+ A multi-byte character is usually twice of the width compared
+ to single-byte characters. Wider characters are called
+ "zen-kaku" - meaning full width, narrower characters are
+ called "han-kaku" - meaning half width. "zen-kaku" characters
+ are usually fixed width.
+
+
+
+
+ Some character encoding defines shift(escape) sequence for
+ entering/exiting multi-byte character strings.
+
+
+
+
+ ISO-2022-JP must be used for SMTP/NNTP.
+
+
+
+
+ "i-mode" web site is supposed to use SJIS.
+
+
+
+
+
+
+
+ References
+
+ Multi-byte character encoding and its related issues are very
+ complex. It is impossible to cover in sufficient detail
+ here. Please refer to the following URLs and other resources for
+ further readings.
+
+
+
+ Unicode/UTF/UCS/etc
+
+
+ http://www.unicode.org/
+
+
+
+
+ Japanese/Korean/Chinese character
+ information
+
+
+
+ ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
+
+
+
+
+
+
+
+
+
+ mb_language
+
+ Set/Get current language
+
+
+
+ Description
+
+
+ string
+ mb_language
+ string
+ language
+
+
+
+ mb_language sets language. If
+ language is omitted, it returns current
+ language as string.
+
+
+ language setting is used for encoding
+ e-mail messages. Valid languages are "Japanese",
+ "ja","English","en" and "uni"
+ (UTF-8). mb_send_mail uses this setting to
+ encode e-mail.
+
+ Language and its setting is ISO-2022-JP/Base64 for
+ Japanese, UTF-8/Base64 for uni, ISO-8859-1/quoted printable for
+ English.
+
+
+ Return Value: If language is set and
+ language is valid, it returns
+ TRUE. Otherwise, it returns FALSE. When
+ language is omitted, it returns language
+ name as string. If no language is set previously, it returns
+ FALSE.
+
+
+ See also mb_send_mail.
+
+
+
+
+
+
+ mb_parse_str
+
+ Parse GET/POST/COOKIE data and set global variable
+
+
+
+ Description
+
+
+ string
+ mb_parse_str
+
+ string
+ encoded_string
+
+ array
+ result
+
+
+
+
+ mb_parse_str parses GET/POST/COOKIE data and
+ sets global variables. Since PHP does not provide raw POST/COOKIE
+ data, it can only used for GET data for now. It preses URL
+ encoded data, detects encoding, converts coding to internal
+ encoding and set values to result array or
+ global variables.
+
+
+ encoded_string: URL encoded data.
+
+
+ result: Array contains decoded and
+ character encoding converted values.
+
+
+ Return Value: It returns TRUE for success or FALSE for failure.
+
+
+ See also mb_detect_order,
+ mb_internal_encoding.
+
+
+
+
mb_internal_encoding
@@ -211,7 +647,7 @@
encoding: Character encoding name
- Return Value: If encoding is
+ Return Value: If encoding is
set,mb_internal_encoding returns
TRUE for success, otherwise returns
FALSE. If encoding is
@@ -232,7 +668,7 @@
See also mb_http_input,
mb_http_output,
- mb_detect_order
+ mb_detect_order.
@@ -270,7 +706,7 @@
See also mb_internal_encoding,
mb_http_output,
- mb_detect_order
+ mb_detect_order.
@@ -294,9 +730,10 @@
If encoding is set,
mb_http_output sets HTTP output character
encoding to encoding. Output after this
- function is converted to encoding.
- mb_http_output returns TRUE for success and
- FALSE for failure.
+ function is converted to encoding.
+ mb_http_output returns
+ TRUE for success and FALSE
+ for failure.
If encoding is omitted,
@@ -306,7 +743,7 @@
See also mb_internal_encoding,
mb_http_input,
- mb_detect_order
+ mb_detect_order.
@@ -331,11 +768,12 @@
mb_detect_order sets automatic character
encoding detection order to encoding-list.
- It returns TRUE for success, FALSE for failure.
+ It returns TRUE for success,
+ FALSE for failure.
encoding-list is array or comma separated
- list of character encodings. ("auto" is expanded to
+ list of character encoding. ("auto" is expanded to
"ASCII, JIS, UTF-8, EUC-JP, SJIS")
@@ -346,6 +784,42 @@
This setting affects mb_detect_encoding and
mb_send_mail.
+
+
+ mbstring currently implements following
+ encoding detection filters. If there is a invalid byte sequence
+ for following encoding, encoding detection will fail.
+
+
+ UTF-8, UTF-7,
+ ASCII,
+ EUC-JP,SJIS,
+ eucJP-win, SJIS-win,
+ JIS, ISO-2022-JP
+
+
+ For ISO-8859-*, mbstring
+ always detects as ISO-8859-*.
+
+
+ For UTF-16, UTF-32,
+ UCS2 and UCS4, encoding
+ detection will fail always.
+
+
+
+ Useless detect order example
+
+; Always detect as ISO-8859-1
+detect_order = ISO-8859-1, UTF-8
+
+; Always detect as UTF-8, since ASCII/UTF-7 values are
+; valid for UTF-8
+detect_order = UTF-8, ASCII, UTF-7
+
+
+
+ mb_detect_order examples
@@ -368,7 +842,7 @@
See also mb_internal_encoding,
mb_http_input,
mb_http_output
- mb_send_mail
+ mb_send_mail.
@@ -393,7 +867,7 @@
substitution character when input character encoding is invalid
or character code is not exist in output character
encoding. Invalid characters may be substituted null(no output),
- string or hex value (Unicode character code value).
+ string or integer value (Unicode character code value).
This setting affects mb_detect_encoding
@@ -410,16 +884,17 @@
- "long" : Output hex value (Example: U+3000,JIS+7E7E)
+ "long" : Output character code value (Example:
+ U+3000,JIS+7E7E)
Return Value: If substchar is set, it
- returns TRUE for success, otherwise returns FALSE. If
- substchar is not set, it returns Unicode
- value or
+ returns TRUE for success, otherwise returns
+ FALSE. If substchar is
+ not set, it returns Unicode value or
"none"/"long".
@@ -461,9 +936,29 @@
ob_start callback
function. mb_output_handler converts
characters in output buffer from internal character encoding to
- HTTP output character encoding.
+ HTTP output character encoding.
+
+
+ 4.0.7 or later version, this hanlder adds charset HTTP header
+ when following conditions are met:
+
+
+ Does not set Content-Type by
+ header()
+
+
+ Default MIME type begins with
+ text/
+
+
+ http_output setting is other than
+ pass
+
+
+
+ contents : Output buffer contents
@@ -483,8 +978,8 @@
- If you want to output some binary data such as image from php
- script, you must set output encoding to "pass" using
+ If you want to output some binary data such as image from PHP
+ script, you must set output encoding to "pass" using
mb_http_output.
@@ -520,7 +1015,7 @@
$outputenc = "sjis-win";
mb_http_output($outputenc);
ob_start("mb_output_handler");
-Header("Content-Type: text/html; charset=" . mb_preferred_mime_name($outputenc));
+header("Content-Type: text/html; charset=" . mb_preferred_mime_name($outputenc));
@@ -550,6 +1045,11 @@
counted as 1.
+ encoding is character encoding for
+ str. If encoding is
+ omitted, internal character encoding is used.
+
+
See also mb_internal_encoding,
strlen.
@@ -567,7 +1067,7 @@
Description
- string mb_strpos
+ int mb_strposstring haystackstring needleint
@@ -605,7 +1105,7 @@
encoding is character encoding name. If it
- is not specified, internal character encoding is used.
+ is omitted, internal character encoding is used.
See also mb_strpos,
@@ -626,7 +1126,7 @@
Description
- string mb_strrpos
+ int mb_strrposstring haystackstring needlestring
@@ -649,7 +1149,7 @@
0. Second character position is 1.
- If encoding is not set, internal encoding
+ If encoding is omitted, internal encoding
is assumed. mb_strrpos accepts
string for needle where
strrpos accepts only character.
@@ -709,7 +1209,7 @@
omitted, internal character encoding is used.
- See also mb_struct,
+ See also mb_strcut,
mb_internal_encoding.
@@ -822,7 +1322,7 @@
Description
- string mb_strmwidth
+ string mb_strimwidthstring strint startint width
@@ -833,7 +1333,7 @@
- mb_strmwidth truncates string
+ mb_strimwidth truncates string
str to specified
width. It returns truncated string.
@@ -1164,6 +1664,12 @@
before conversion for success, FALSE for failure.
+ mb_convert_variables join strings in Array
+ or Object to detect encoding, since encoding detection tends to
+ fail for short strings. Therefore, it is impossible to mix
+ encoding in single array or object.
+
+
It from-encoding is specified by
array or comma separated string, it tries to detect encoding from
from-coding. When
@@ -1172,7 +1678,9 @@
vars (3rd and larger) is reference to
- variable to be converted. String, Array and Object are accepted.
+ variable to be converted. String, Array and Object are accepted.
+ mb_convert_variables assumes all parameters
+ have the same encoding.
@@ -1296,7 +1804,8 @@
convert.
- encoding is character encoding.
+ encoding is character encoding. If it is
+ omitted, internal character encoding is used.
@@ -1323,7 +1832,7 @@
mb_send_mail
- Send mail with ISO-2022-JP character code. (Japanese specific)
+ Send encoded mail.
@@ -1344,7 +1853,8 @@
mb_send_mail sends email. Headers and
- message are converted and encoded in ISO-2022-JP.
+ message are converted and encoded according to
+ mb_language setting.
mb_send_mail is wrapper
function of mail. See
mail for details.
@@ -1361,21 +1871,23 @@
message is mail message.
- string additional_headers is inserted at
- the end of the header. This is typically used to add
- extra headers. Multiple extra headers are separated with a
+ additional_headers is inserted at
+ the end of the header. This is typically used to add extra
+ headers. Multiple extra headers are separated with a
newline(\n).
- It returns TRUE for success, otherwise it returns FALSE.
+ additional_parameter is a MTA command line
+ parameter. It is useful when setting the correct Return-Path
+ header when using sendmail.
- additional_parameter is added this
- data to the call to the mailer by PHP. This is useful when
- setting the correct Return-Path header when using sendmail.
+ It returns TRUE for success, otherwise it
+ returns FALSE.
- See also: mail.
+ See also: mb_language,
+ mail.