[PHP-DEV] CVS update: php3/doc/functions From: andrey (php-dev <email protected>)
Date: 06/14/99

Date: Monday June 14, 1999 @ 9:36
Author: andrey

Update of /repository/php3/doc/functions
In directory php:/tmp/cvs-serv30698/functions

Modified Files:
        pcre.sgml
Log Message:
More docs.

Index: php3/doc/functions/pcre.sgml
diff -u php3/doc/functions/pcre.sgml:1.12 php3/doc/functions/pcre.sgml:1.13
--- php3/doc/functions/pcre.sgml:1.12 Sun Jun 13 02:56:52 1999
+++ php3/doc/functions/pcre.sgml Mon Jun 14 09:36:48 1999
@@ -18,7 +18,7 @@
    <para>
     The ending delimiter may be followed by various options that
     affect the matching.
- See <link linkend="pattern.options">Pattern Options</link>.
+ See <link linkend="pcre.pattern.options">Pattern Options</link>.
 
    <para>
     <example>
@@ -307,7 +307,7 @@
    </refsect1>
   </refentry>
 
- <refentry id="pattern.options">
+ <refentry id="pcre.pattern.options">
    <refnamediv>
     <refname>Pattern Options</refname>
     <refpurpose>describes possible options in regex
@@ -456,7 +456,7 @@
    </refsect1>
   </refentry>
 
- <refentry id="pattern.syntax">
+ <refentry id="pcre.pattern.syntax">
    <refnamediv>
     <refname>Pattern Syntax</refname>
     <refpurpose>describes PCRE regex syntax</refpurpose>
@@ -471,7 +471,7 @@
    </literallayout>
 
    <refsect1>
- <title>Differences from Perl</title>
+ <title>Differences From Perl</title>
     <literallayout>
      The differences described here are with respect to Perl
      5.005.
@@ -553,6 +553,992 @@
      (d) If PCRE_UNGREEDY is set, the greediness of the repeti-
      tion quantifiers is inverted, that is, by default they are
      not greedy, but if followed by a question mark they are.
+ </literallayout>
+ </refsect1>
+
+ <refsect1>
+ <title>Regular Expression Details</title>
+ <literallayout>
+ The syntax and semantics of the regular expressions sup-
+ ported by PCRE are described below. Regular expressions are
+ also described in the Perl documentation and in a number of
+ other books, some of which have copious examples. Jeffrey
+ Friedl's "Mastering Regular Expressions", published by
+ O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
+ The description here is intended as reference documentation.
+
+ A regular expression is a pattern that is matched against a
+ subject string from left to right. Most characters stand for
+ themselves in a pattern, and match the corresponding charac-
+ ters in the subject. As a trivial example, the pattern
+
+ The quick brown fox
+
+ matches a portion of a subject string that is identical to
+ itself. The power of regular expressions comes from the
+ ability to include alternatives and repetitions in the pat-
+ tern. These are encoded in the pattern by the use of <I>meta</I>-
+ <I>characters</I>, which do not stand for themselves but instead
+ are interpreted in some special way.
+
+ There are two different sets of meta-characters: those that
+ are recognized anywhere in the pattern except within square
+ brackets, and those that are recognized in square brackets.
+ Outside square brackets, the meta-characters are as follows:
+
+ \ general escape character with several uses
+ ^ assert start of subject (or line, in multiline
+ mode)
+ $ assert end of subject (or line, in multiline mode)
+ . match any character except newline (by default)
+ [ start character class definition
+ | start of alternative branch
+ ( start subpattern
+ ) end subpattern
+ ? extends the meaning of (
+ also 0 or 1 quantifier
+ also quantifier minimizer
+ * 0 or more quantifier
+ + 1 or more quantifier
+ { start min/max quantifier
+
+ Part of a pattern that is in square brackets is called a
+ "character class". In a character class the only meta-
+ characters are:
+
+ \ general escape character
+ ^ negate the class, but only if the first character
+ - indicates character range
+ ] terminates the character class
+
+ The following sections describe the use of each of the
+ meta-characters.
+
+BACKSLASH
+ The backslash character has several uses. Firstly, if it is
+ followed by a non-alphameric character, it takes away any
+ special meaning that character may have. This use of
+ backslash as an escape character applies both inside and
+ outside character classes.
+
+ For example, if you want to match a "*" character, you write
+ "\*" in the pattern. This applies whether or not the follow-
+ ing character would otherwise be interpreted as a meta-
+ character, so it is always safe to precede a non-alphameric
+ with "\" to specify that it stands for itself. In particu-
+ lar, if you want to match a backslash, you write "\\".
+
+ If a pattern is compiled with the PCRE_EXTENDED option, whi-
+ tespace in the pattern (other than in a character class) and
+ characters between a "#" outside a character class and the
+ next newline character are ignored. An escaping backslash
+ can be used to include a whitespace or "#" character as part
+ of the pattern.
+
+ A second use of backslash provides a way of encoding non-
+ printing characters in patterns in a visible manner. There
+ is no restriction on the appearance of non-printing charac-
+ ters, apart from the binary zero that terminates a pattern,
+ but when a pattern is being prepared by text editing, it is
+ usually easier to use one of the following escape sequences
+ than the binary character it represents:
+
+ \a alarm, that is, the BEL character (hex 07)
+ \cx "control-x", where x is any character
+ \e escape (hex 1B)
+ \f formfeed (hex 0C)
+ \n newline (hex 0A)
+ \r carriage return (hex 0D)
+ \t tab (hex 09)
+ \xhh character with hex code hh
+ \ddd character with octal code ddd, or backreference
+
+ The precise effect of "\cx" is as follows: if "x" is a lower
+ case letter, it is converted to upper case. Then bit 6 of
+ the character (hex 40) is inverted. Thus "\cz" becomes hex
+ 1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.
+
+ After "\x", up to two hexadecimal digits are read (letters
+ can be in upper or lower case).
+
+ After "\0" up to two further octal digits are read. In both
+ cases, if there are fewer than two digits, just those that
+ are present are used. Thus the sequence "\0\x\07" specifies
+ two binary zeros followed by a BEL character. Make sure you
+ supply two digits after the initial zero if the character
+ that follows is itself an octal digit.
+
+ The handling of a backslash followed by a digit other than 0
+ is complicated. Outside a character class, PCRE reads it
+ and any following digits as a decimal number. If the number
+ is less than 10, or if there have been at least that many
+ previous capturing left parentheses in the expression, the
+ entire sequence is taken as a <I>back</I> <I>reference</I>. A description
+ of how this works is given later, following the discussion
+ of parenthesized subpatterns.
+
+ Inside a character class, or if the decimal number is
+ greater than 9 and there have not been that many capturing
+ subpatterns, PCRE re-reads up to three octal digits follow-
+ ing the backslash, and generates a single byte from the
+ least significant 8 bits of the value. Any subsequent digits
+ stand for themselves. For example:
+
+ \040 is another way of writing a space
+ \40 is the same, provided there are fewer than 40
+ previous capturing subpatterns
+ \7 is always a back reference
+ \11 might be a back reference, or another way of
+ writing a tab
+ \011 is always a tab
+ \0113 is a tab followed by the character "3"
+ \113 is the character with octal code 113 (since there
+ can be no more than 99 back references)
+ \377 is a byte consisting entirely of 1 bits
+ \81 is either a back reference, or a binary zero
+ followed by the two characters "8" and "1"
+
+ Note that octal values of 100 or greater must not be intro-
+ duced by a leading zero, because no more than three octal
+ digits are ever read.
+
+ All the sequences that define a single byte value can be
+ used both inside and outside character classes. In addition,
+ inside a character class, the sequence "\b" is interpreted
+ as the backspace character (hex 08). Outside a character
+ class it has a different meaning (see below).
+
+ The third use of backslash is for specifying generic charac-
+ ter types:
+
+ \d any decimal digit
+ \D any character that is not a decimal digit
+ \s any whitespace character
+ \S any character that is not a whitespace character
+ \w any "word" character
+ \W any "non-word" character
+
+ Each pair of escape sequences partitions the complete set of
+ characters into two disjoint sets. Any given character
+ matches one, and only one, of each pair.
+
+ A "word" character is any letter or digit or the underscore
+ character, that is, any character which can be part of a
+ Perl "word". The definition of letters and digits is con-
+ trolled by PCRE's character tables, and may vary if locale-
+ specific matching is taking place (see "Locale support"
+ above). For example, in the "fr" (French) locale, some char-
+ acter codes greater than 128 are used for accented letters,
+ and these are matched by \w.
+
+ These character type sequences can appear both inside and
+ outside character classes. They each match one character of
+ the appropriate type. If the current matching point is at
+ the end of the subject string, all of them fail, since there
+ is no character to match.
+
+ The fourth use of backslash is for certain simple asser-
+ tions. An assertion specifies a condition that has to be met
+ at a particular point in a match, without consuming any
+ characters from the subject string. The use of subpatterns
+ for more complicated assertions is described below. The
+ backslashed assertions are
+
+ \b word boundary
+ \B not a word boundary
+ \A start of subject (independent of multiline mode)
+ \Z end of subject or newline at end (independent of
+ multiline mode)
+ \z end of subject (independent of multiline mode)
+
+ These assertions may not appear in character classes (but
+ note that "\b" has a different meaning, namely the backspace
+ character, inside a character class).
+
+ A word boundary is a position in the subject string where
+ the current character and the previous character do not both
+ match \w or \W (i.e. one matches \w and the other matches
+ \W), or the start or end of the string if the first or last
+ character matches \w, respectively.
+
+ The \A, \Z, and \z assertions differ from the traditional
+ circumflex and dollar (described below) in that they only
+ ever match at the very start and end of the subject string,
+ whatever options are set. They are not affected by the
+ PCRE_NOTBOL or PCRE_NOTEOL options. The difference between
+ \Z and \z is that \Z matches before a newline that is the
+ last character of the string as well as at the end of the
+ string, whereas \z matches only at the end.
+
+CIRCUMFLEX AND DOLLAR
+ Outside a character class, in the default matching mode, the
+ circumflex character is an assertion which is true only if
+ the current matching point is at the start of the subject
+ string. Inside a character class, circumflex has an entirely
+ different meaning (see below).
+
+ Circumflex need not be the first character of the pattern if
+ a number of alternatives are involved, but it should be the
+ first thing in each alternative in which it appears if the
+ pattern is ever to match that branch. If all possible alter-
+ natives start with a circumflex, that is, if the pattern is
+ constrained to match only at the start of the subject, it is
+ said to be an "anchored" pattern. (There are also other con-
+ structs that can cause a pattern to be anchored.)
+
+ A dollar character is an assertion which is true only if the
+ current matching point is at the end of the subject string,
+ or immediately before a newline character that is the last
+ character in the string (by default). Dollar need not be the
+ last character of the pattern if a number of alternatives
+ are involved, but it should be the last item in any branch
+ in which it appears. Dollar has no special meaning in a
+ character class.
+
+ The meaning of dollar can be changed so that it matches only
+ at the very end of the string, by setting the
+ PCRE_DOLLAR_ENDONLY option at compile or matching time. This
+ does not affect the \Z assertion.
+
+ The meanings of the circumflex and dollar characters are
+ changed if the PCRE_MULTILINE option is set. When this is
+ the case, they match immediately after and immediately
+ before an internal "\n" character, respectively, in addition
+ to matching at the start and end of the subject string. For
+ example, the pattern /^abc$/ matches the subject string
+ "def\nabc" in multiline mode, but not otherwise. Conse-
+ quently, patterns that are anchored in single line mode
+ because all branches start with "^" are not anchored in mul-
+ tiline mode. The PCRE_DOLLAR_ENDONLY option is ignored if
+ PCRE_MULTILINE is set.
+
+ Note that the sequences \A, \Z, and \z can be used to match
+ the start and end of the subject in both modes, and if all
+ branches of a pattern start with \A is it always anchored,
+ whether PCRE_MULTILINE is set or not.
+
+
+
+FULL STOP (PERIOD, DOT)
+ Outside a character class, a dot in the pattern matches any
+ one character in the subject, including a non-printing
+ character, but not (by default) newline. If the PCRE_DOTALL
+ option is set, then dots match newlines as well. The han-
+ dling of dot is entirely independent of the handling of cir-
+ cumflex and dollar, the only relationship being that they
+ both involve newline characters. Dot has no special meaning
+ in a character class.
+
+
+
+SQUARE BRACKETS
+ An opening square bracket introduces a character class, ter-
+ minated by a closing square bracket. A closing square
+ bracket on its own is not special. If a closing square
+ bracket is required as a member of the class, it should be
+ the first data character in the class (after an initial cir-
+ cumflex, if present) or escaped with a backslash.
+
+ A character class matches a single character in the subject;
+ the character must be in the set of characters defined by
+ the class, unless the first character in the class is a cir-
+ cumflex, in which case the subject character must not be in
+ the set defined by the class. If a circumflex is actually
+ required as a member of the class, ensure it is not the
+ first character, or escape it with a backslash.
+
+ For example, the character class [aeiou] matches any lower
+ case vowel, while [^aeiou] matches any character that is not
+ a lower case vowel. Note that a circumflex is just a con-
+ venient notation for specifying the characters which are in
+ the class by enumerating those that are not. It is not an
+ assertion: it still consumes a character from the subject
+ string, and fails if the current pointer is at the end of
+ the string.
+
+ When caseless matching is set, any letters in a class
+ represent both their upper case and lower case versions, so
+ for example, a caseless [aeiou] matches "A" as well as "a",
+ and a caseless [^aeiou] does not match "A", whereas a case-
+ ful version would.
+
+ The newline character is never treated in any special way in
+ character classes, whatever the setting of the PCRE_DOTALL
+ or PCRE_MULTILINE options is. A class such as [^a] will
+ always match a newline.
+
+ The minus (hyphen) character can be used to specify a range
+ of characters in a character class. For example, [d-m]
+ matches any letter between d and m, inclusive. If a minus
+ character is required in a class, it must be escaped with a
+ backslash or appear in a position where it cannot be inter-
+ preted as indicating a range, typically as the first or last
+ character in the class.
+ It is not possible to have the literal character "]" as the
+ end character of a range. A pattern such as [W-]46] is
+ interpreted as a class of two characters ("W" and "-") fol-
+ lowed by a literal string "46]", so it would match "W46]" or
+ "-46]". However, if the "]" is escaped with a backslash it
+ is interpreted as the end of range, so [W-\]46] is inter-
+ preted as a single class containing a range followed by two
+ separate characters. The octal or hexadecimal representation
+ of "]" can also be used to end a range.
+
+ Ranges operate in ASCII collating sequence. They can also be
+ used for characters specified numerically, for example
+ [\000-\037]. If a range that includes letters is used when
+ caseless matching is set, it matches the letters in either
+ case. For example, [W-c] is equivalent to [][\^_`wxyzabc],
+ matched caselessly, and if character tables for the "fr"
+ locale are in use, [\xc8-\xcb] matches accented E characters
+ in both cases.
+
+ The character types \d, \D, \s, \S, \w, and \W may also
+ appear in a character class, and add the characters that
+ they match to the class. For example, [\dABCDEF] matches any
+ hexadecimal digit. A circumflex can conveniently be used
+ with the upper case character types to specify a more res-
+ tricted set of characters than the matching lower case type.
+ For example, the class [^\W_] matches any letter or digit,
+ but not underscore.
+
+ All non-alphameric characters other than \, -, ^ (at the
+ start) and the terminating ] are non-special in character
+ classes, but it does no harm if they are escaped.
+
+
+
+VERTICAL BAR
+ Vertical bar characters are used to separate alternative
+ patterns. For example, the pattern
+
+ gilbert|sullivan
+
+ matches either "gilbert" or "sullivan". Any number of alter-
+ natives may appear, and an empty alternative is permitted
+ (matching the empty string). The matching process tries
+ each alternative in turn, from left to right, and the first
+ one that succeeds is used. If the alternatives are within a
+ subpattern (defined below), "succeeds" means matching the
+ rest of the main pattern as well as the alternative in the
+ subpattern.
+
+
+
+
+INTERNAL OPTION SETTING
+ The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL,
+ and PCRE_EXTENDED can be changed from within the pattern by
+ a sequence of Perl option letters enclosed between "(?" and
+ ")". The option letters are
+
+ i for PCRE_CASELESS
+ m for PCRE_MULTILINE
+ s for PCRE_DOTALL
+ x for PCRE_EXTENDED
+
+ For example, (?im) sets caseless, multiline matching. It is
+ also possible to unset these options by preceding the letter
+ with a hyphen, and a combined setting and unsetting such as
+ (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
+ unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
+ If a letter appears both before and after the hyphen, the
+ option is unset.
+
+ The scope of these option changes depends on where in the
+ pattern the setting occurs. For settings that are outside
+ any subpattern (defined below), the effect is the same as if
+ the options were set or unset at the start of matching. The
+ following patterns all behave in exactly the same way:
+
+ (?i)abc
+ a(?i)bc
+ ab(?i)c
+ abc(?i)
+
+ which in turn is the same as compiling the pattern abc with
+ PCRE_CASELESS set. In other words, such "top level" set-
+ tings apply to the whole pattern (unless there are other
+ changes inside subpatterns). If there is more than one set-
+ ting of the same option at top level, the rightmost setting
+ is used.
+
+ If an option change occurs inside a subpattern, the effect
+ is different. This is a change of behaviour in Perl 5.005.
+ An option change inside a subpattern affects only that part
+ of the subpattern that follows it, so
+
+ (a(?i)b)c
+
+ matches abc and aBc and no other strings (assuming
+ PCRE_CASELESS is not used). By this means, options can be
+ made to have different settings in different parts of the
+ pattern. Any changes made in one alternative do carry on
+ into subsequent branches within the same subpattern. For
+ example,
+
+ (a(?i)b|c)
+
+ matches "ab", "aB", "c", and "C", even though when matching
+ "C" the first branch is abandoned before the option setting.
+ This is because the effects of option settings happen at
+ compile time. There would be some very weird behaviour oth-
+ erwise.
+
+ The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
+ be changed in the same way as the Perl-compatible options by
+ using the characters U and X respectively. The (?X) flag
+ setting is special in that it must always occur earlier in
+ the pattern than any of the additional features it turns on,
+ even when it is at top level. It is best put at the start.
+
+
+
+SUBPATTERNS
+ Subpatterns are delimited by parentheses (round brackets),
+ which can be nested. Marking part of a pattern as a subpat-
+ tern does two things:
+
+ 1. It localizes a set of alternatives. For example, the pat-
+ tern
+
+ cat(aract|erpillar|)
+
+ matches one of the words "cat", "cataract", or "caterpil-
+ lar". Without the parentheses, it would match "cataract",
+ "erpillar" or the empty string.
+
+ 2. It sets up the subpattern as a capturing subpattern (as
+ defined above). When the whole pattern matches, that por-
+ tion of the subject string that matched the subpattern is
+ passed back to the caller via the <I>ovector</I> argument of
+ pcre_exec(). Opening parentheses are counted from left to
+ right (starting from 1) to obtain the numbers of the captur-
+ ing subpatterns.
+
+ For example, if the string "the red king" is matched against
+ the pattern
+
+ the ((red|white) (king|queen))
+
+ the captured substrings are "red king", "red", and "king",
+ and are numbered 1, 2, and 3.
+
+ The fact that plain parentheses fulfil two functions is not
+ always helpful. There are often times when a grouping sub-
+ pattern is required without a capturing requirement. If an
+ opening parenthesis is followed by "?:", the subpattern does
+ not do any capturing, and is not counted when computing the
+ number of any subsequent capturing subpatterns. For example,
+ if the string "the white queen" is matched against the
+ pattern
+
+ the ((?:red|white) (king|queen))
+
+ the captured substrings are "white queen" and "queen", and
+ are numbered 1 and 2. The maximum number of captured sub-
+ strings is 99, and the maximum number of all subpatterns,
+ both capturing and non-capturing, is 200.
+
+ As a convenient shorthand, if any option settings are
+ required at the start of a non-capturing subpattern, the
+ option letters may appear between the "?" and the ":". Thus
+ the two patterns
+
+ (?i:saturday|sunday)
+ (?:(?i)saturday|sunday)
+
+ match exactly the same set of strings. Because alternative
+ branches are tried from left to right, and options are not
+ reset until the end of the subpattern is reached, an option
+ setting in one branch does affect subsequent branches, so
+ the above patterns match "SUNDAY" as well as "Saturday".
+
+
+
+REPETITION
+ Repetition is specified by quantifiers, which can follow any
+ of the following items:
+
+ a single character, possibly escaped
+ the . metacharacter
+ a character class
+ a back reference (see next section)
+ a parenthesized subpattern (unless it is an assertion -
+ see below)
+
+ The general repetition quantifier specifies a minimum and
+ maximum number of permitted matches, by giving the two
+ numbers in curly brackets (braces), separated by a comma.
+ The numbers must be less than 65536, and the first must be
+ less than or equal to the second. For example:
+
+ z{2,4}
+
+ matches "zz", "zzz", or "zzzz". A closing brace on its own
+ is not a special character. If the second number is omitted,
+ but the comma is present, there is no upper limit; if the
+ second number and the comma are both omitted, the quantifier
+ specifies an exact number of required matches. Thus
+
+ [aeiou]{3,}
+
+ matches at least 3 successive vowels, but may match many
+ more, while
+
+ \d{8}
+
+ matches exactly 8 digits. An opening curly bracket that
+ appears in a position where a quantifier is not allowed, or
+ one that does not match the syntax of a quantifier, is taken
+ as a literal character. For example, {,6} is not a quantif-
+ ier, but a literal string of four characters.
+
+ The quantifier {0} is permitted, causing the expression to
+ behave as if the previous item and the quantifier were not
+ present.
+
+ For convenience (and historical compatibility) the three
+ most common quantifiers have single-character abbreviations:
+
+ * is equivalent to {0,}
+ + is equivalent to {1,}
+ ? is equivalent to {0,1}
+
+ It is possible to construct infinite loops by following a
+ subpattern that can match no characters with a quantifier
+ that has no upper limit, for example:
+
+ (a?)*
+
+ Earlier versions of Perl and PCRE used to give an error at
+ compile time for such patterns. However, because there are
+ cases where this can be useful, such patterns are now
+ accepted, but if any repetition of the subpattern does in
+ fact match no characters, the loop is forcibly broken.
+
+ By default, the quantifiers are "greedy", that is, they
+ match as much as possible (up to the maximum number of per-
+ mitted times), without causing the rest of the pattern to
+ fail. The classic example of where this gives problems is in
+ trying to match comments in C programs. These appear between
+ the sequences /* and */ and within the sequence, individual
+ * and / characters may appear. An attempt to match C com-
+ ments by applying the pattern
+
+ /\*.*\*/
+
+ to the string
+
+ /* first command */ not comment /* second comment */
+
+ fails, because it matches the entire string due to the
+ greediness of the .* item.
+
+ However, if a quantifier is followed by a question mark,
+ then it ceases to be greedy, and instead matches the minimum
+ number of times possible, so the pattern
+
+ /\*.*?\*/
+
+ does the right thing with the C comments. The meaning of the
+ various quantifiers is not otherwise changed, just the pre-
+ ferred number of matches. Do not confuse this use of ques-
+ tion mark with its use as a quantifier in its own right.
+ Because it has two uses, it can sometimes appear doubled, as
+ in
+
+ \d??\d
+
+ which matches one digit by preference, but can match two if
+ that is the only way the rest of the pattern matches.
+
+ If the PCRE_UNGREEDY option is set (an option which is not
+ available in Perl) then the quantifiers are not greedy by
+ default, but individual ones can be made greedy by following
+ them with a question mark. In other words, it inverts the
+ default behaviour.
+
+ When a parenthesized subpattern is quantified with a minimum
+ repeat count that is greater than 1 or with a limited max-
+ imum, more store is required for the compiled pattern, in
+ proportion to the size of the minimum or maximum.
+
+ If a pattern starts with .* or .{0,} and the PCRE_DOTALL
+ option (equivalent to Perl's /s) is set, thus allowing the .
+ to match newlines, then the pattern is implicitly anchored,
+ because whatever follows will be tried against every charac-
+ ter position in the subject string, so there is no point in
+ retrying the overall match at any position after the first.
+ PCRE treats such a pattern as though it were preceded by \A.
+ In cases where it is known that the subject string contains
+ no newlines, it is worth setting PCRE_DOTALL when the pat-
+ tern begins with .* in order to obtain this optimization, or
+ alternatively using ^ to indicate anchoring explicitly.
+
+ When a capturing subpattern is repeated, the value captured
+ is the substring that matched the final iteration. For exam-
+ ple, after
+
+ (tweedle[dume]{3}\s*)+
+
+ has matched "tweedledum tweedledee" the value of the cap-
+ tured substring is "tweedledee". However, if there are
+ nested capturing subpatterns, the corresponding captured
+ values may have been set in previous iterations. For exam-
+ ple, after
+ /(a|(b))+/
+
+ matches "aba" the value of the second captured substring is
+ "b".
+
+
+
+BACK REFERENCES
+ Outside a character class, a backslash followed by a digit
+ greater than 0 (and possibly further digits) is a back
+ reference to a capturing subpattern earlier (i.e. to its
+ left) in the pattern, provided there have been that many
+ previous capturing left parentheses.
+
+ However, if the decimal number following the backslash is
+ less than 10, it is always taken as a back reference, and
+ causes an error only if there are not that many capturing
+ left parentheses in the entire pattern. In other words, the
+ parentheses that are referenced need not be to the left of
+ the reference for numbers less than 10. See the section
+ entitled "Backslash" above for further details of the han-
+ dling of digits following a backslash.
+
+ A back reference matches whatever actually matched the cap-
+ turing subpattern in the current subject string, rather than
+ anything matching the subpattern itself. So the pattern
+
+ (sens|respons)e and \1ibility
+
+ matches "sense and sensibility" and "response and responsi-
+ bility", but not "sense and responsibility". If caseful
+ matching is in force at the time of the back reference, then
+ the case of letters is relevant. For example,
+
+ ((?i)rah)\s+\1
+
+ matches "rah rah" and "RAH RAH", but not "RAH rah", even
+ though the original capturing subpattern is matched case-
+ lessly.
+
+ There may be more than one back reference to the same sub-
+ pattern. If a subpattern has not actually been used in a
+ particular match, then any back references to it always
+ fail. For example, the pattern
+
+ (a|(bc))\2
+
+ always fails if it starts to match "a" rather than "bc".
+ Because there may be up to 99 back references, all digits
+ following the backslash are taken as part of a potential
+ back reference number. If the pattern continues with a digit
+ character, then some delimiter must be used to terminate the
+ back reference. If the PCRE_EXTENDED option is set, this can
+ be whitespace. Otherwise an empty comment can be used.
+
+ A back reference that occurs inside the parentheses to which
+ it refers fails when the subpattern is first used, so, for
+ example, (a\1) never matches. However, such references can
+ be useful inside repeated subpatterns. For example, the pat-
+ tern
+
+ (a|b\1)+
+
+ matches any number of "a"s and also "aba", "ababaa" etc. At
+ each iteration of the subpattern, the back reference matches
+ the character string corresponding to the previous itera-
+ tion. In order for this to work, the pattern must be such
+ that the first iteration does not need to match the back
+ reference. This can be done using alternation, as in the
+ example above, or by a quantifier with a minimum of zero.
+
+
+
+ASSERTIONS
+ An assertion is a test on the characters following or
+ preceding the current matching point that does not actually
+ consume any characters. The simple assertions coded as \b,
+ \B, \A, \Z, \z, ^ and $ are described above. More compli-
+ cated assertions are coded as subpatterns. There are two
+ kinds: those that look ahead of the current position in the
+ subject string, and those that look behind it.
+
+ An assertion subpattern is matched in the normal way, except
+ that it does not cause the current matching position to be
+ changed. Lookahead assertions start with (?= for positive
+ assertions and (?! for negative assertions. For example,
+
+ \w+(?=;)
+
+ matches a word followed by a semicolon, but does not include
+ the semicolon in the match, and
+
+ foo(?!bar)
+
+ matches any occurrence of "foo" that is not followed by
+ "bar". Note that the apparently similar pattern
+
+ (?!foo)bar
+
+ does not find an occurrence of "bar" that is preceded by
+ something other than "foo"; it finds any occurrence of "bar"
+ whatsoever, because the assertion (?!foo) is always true
+ when the next three characters are "bar". A lookbehind
+ assertion is needed to achieve this effect.
+ Lookbehind assertions start with (?&lt;= for positive asser-
+ tions and (?&lt;! for negative assertions. For example,
+
+ (?&lt;!foo)bar
+
+ does find an occurrence of "bar" that is not preceded by
+ "foo". The contents of a lookbehind assertion are restricted
+ such that all the strings it matches must have a fixed
+ length. However, if there are several alternatives, they do
+ not all have to have the same fixed length. Thus
+
+ (?&lt;=bullock|donkey)
+
+ is permitted, but
+
+ (?&lt;!dogs?|cats?)
+
+ causes an error at compile time. Branches that match dif-
+ ferent length strings are permitted only at the top level of
+ a lookbehind assertion. This is an extension compared with
+ Perl 5.005, which requires all branches to match the same
+ length of string. An assertion such as
+
+ (?&lt;=ab(c|de))
+
+ is not permitted, because its single top-level branch can
+ match two different lengths, but it is acceptable if rewrit-
+ ten to use two top-level branches:
+
+ (?&lt;=abc|abde)
+
+ The implementation of lookbehind assertions is, for each
+ alternative, to temporarily move the current position back
+ by the fixed width and then try to match. If there are
+ insufficient characters before the current position, the
+ match is deemed to fail. Lookbehinds in conjunction with
+ once-only subpatterns can be particularly useful for match-
+ ing at the ends of strings; an example is given at the end
+ of the section on once-only subpatterns.
+
+ Several assertions (of any sort) may occur in succession.
+ For example,
+
+ (?&lt;=\d{3})(?&lt;!999)foo
+
+ matches "foo" preceded by three digits that are not "999".
+ Furthermore, assertions can be nested in any combination.
+ For example,
+
+ (?&lt;=(?&lt;!foo)bar)baz
+
+ matches an occurrence of "baz" that is preceded by "bar"
+ which in turn is not preceded by "foo".
+
+ Assertion subpatterns are not capturing subpatterns, and may
+ not be repeated, because it makes no sense to assert the
+ same thing several times. If an assertion contains capturing
+ subpatterns within it, these are always counted for the pur-
+ poses of numbering the capturing subpatterns in the whole
+ pattern. Substring capturing is carried out for positive
+ assertions, but it does not make sense for negative asser-
+ tions.
+
+ Assertions count towards the maximum of 200 parenthesized
+ subpatterns.
+
+
+
+ONCE-ONLY SUBPATTERNS
+ With both maximizing and minimizing repetition, failure of
+ what follows normally causes the repeated item to be re-
+ evaluated to see if a different number of repeats allows the
+ rest of the pattern to match. Sometimes it is useful to
+ prevent this, either to change the nature of the match, or
+ to cause it fail earlier than it otherwise might, when the
+ author of the pattern knows there is no point in carrying
+ on.
+
+ Consider, for example, the pattern \d+foo when applied to
+ the subject line
+
+ 123456bar
+
+ After matching all 6 digits and then failing to match "foo",
+ the normal action of the matcher is to try again with only 5
+ digits matching the \d+ item, and then with 4, and so on,
+ before ultimately failing. Once-only subpatterns provide the
+ means for specifying that once a portion of the pattern has
+ matched, it is not to be re-evaluated in this way, so the
+ matcher would give up immediately on failing to match "foo"
+ the first time. The notation is another kind of special
+ parenthesis, starting with (?&gt; as in this example:
+
+ (?&gt;\d+)bar
+
+ This kind of parenthesis "locks up" the part of the pattern
+ it contains once it has matched, and a failure further into
+ the pattern is prevented from backtracking into it. Back-
+ tracking past it to previous items, however, works as nor-
+ mal.
+
+ An alternative description is that a subpattern of this type
+ matches the string of characters that an identical stan-
+ dalone pattern would match, if anchored at the current point
+ in the subject string.
+
+ Once-only subpatterns are not capturing subpatterns. Simple
+ cases such as the above example can be thought of as a max-
+ imizing repeat that must swallow everything it can. So,
+ while both \d+ and \d+? are prepared to adjust the number of
+ digits they match in order to make the rest of the pattern
+ match, (?&gt;\d+) can only match an entire sequence of digits.
+
+ This construction can of course contain arbitrarily compli-
+ cated subpatterns, and it can be nested.
+
+ Once-only subpatterns can be used in conjunction with look-
+ behind assertions to specify efficient matching at the end
+ of the subject string. Consider a simple pattern such as
+
+ abcd$
+
+ when applied to a long string which does not match it.
+ Because matching proceeds from left to right, PCRE will look
+ for each "a" in the subject and then see if what follows
+ matches the rest of the pattern. If the pattern is specified
+ as
+
+ ^.*abcd$
+
+ then the initial .* matches the entire string at first, but
+ when this fails, it backtracks to match all but the last
+ character, then all but the last two characters, and so on.
+ Once again the search for "a" covers the entire string, from
+ right to left, so we are no better off. However, if the pat-
+ tern is written as
+
+ ^(?&gt;.*)(?&lt;=abcd)
+
+ then there can be no backtracking for the .* item; it can
+ match only the entire string. The subsequent lookbehind
+ assertion does a single test on the last four characters. If
+ it fails, the match fails immediately. For long strings,
+ this approach makes a significant difference to the process-
+ ing time.
+
+
+
+CONDITIONAL SUBPATTERNS
+ It is possible to cause the matching process to obey a sub-
+ pattern conditionally or to choose between two alternative
+ subpatterns, depending on the result of an assertion, or
+ whether a previous capturing subpattern matched or not. The
+ two possible forms of conditional subpattern are
+
+ (?(condition)yes-pattern)
+ (?(condition)yes-pattern|no-pattern)
+
+ If the condition is satisfied, the yes-pattern is used; oth-
+ erwise the no-pattern (if present) is used. If there are
+ more than two alternatives in the subpattern, a compile-time
+ error occurs.
+
+ There are two kinds of condition. If the text between the
+ parentheses consists of a sequence of digits, then the con-
+ dition is satisfied if the capturing subpattern of that
+ number has previously matched. Consider the following pat-
+ tern, which contains non-significant white space to make it
+ more readable (assume the PCRE_EXTENDED option) and to
+ divide it into three parts for ease of discussion:
+
+ ( \( )? [^()]+ (?(1) \) )
+
+ The first part matches an optional opening parenthesis, and
+ if that character is present, sets it as the first captured
+ substring. The second part matches one or more characters
+ that are not parentheses. The third part is a conditional
+ subpattern that tests whether the first set of parentheses
+ matched or not. If they did, that is, if subject started
+ with an opening parenthesis, the condition is true, and so
+ the yes-pattern is executed and a closing parenthesis is
+ required. Otherwise, since no-pattern is not present, the
+ subpattern matches nothing. In other words, this pattern
+ matches a sequence of non-parentheses, optionally enclosed
+ in parentheses.
+
+ If the condition is not a sequence of digits, it must be an
+ assertion. This may be a positive or negative lookahead or
+ lookbehind assertion. Consider this pattern, again contain-
+ ing non-significant white space, and with the two alterna-
+ tives on the second line:
+
+ (?(?=[^a-z]*[a-z])
+ \d{2}[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
+
+ The condition is a positive lookahead assertion that matches
+ an optional sequence of non-letters followed by a letter. In
+ other words, it tests for the presence of at least one
+ letter in the subject. If a letter is found, the subject is
+ matched against the first alternative; otherwise it is
+ matched against the second. This pattern matches strings in
+ one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
+ letters and dd are digits.
+
+
+
+COMMENTS
+ The sequence (?# marks the start of a comment which
+ continues up to the next closing parenthesis. Nested
+ parentheses are not permitted. The characters that make up a
+ comment play no part in the pattern matching at all.
+
+ If the PCRE_EXTENDED option is set, an unescaped # character
+ outside a character class introduces a comment that contin-
+ ues up to the next newline character in the pattern.
+
+
+
+PERFORMANCE
+ Certain items that may appear in patterns are more efficient
+ than others. It is more efficient to use a character class
+ like [aeiou] than a set of alternatives such as (a|e|i|o|u).
+ In general, the simplest construction that provides the
+ required behaviour is usually the most efficient. Jeffrey
+ Friedl's book contains a lot of discussion about optimizing
+ regular expressions for efficient performance.
+
+ When a pattern begins with .* and the PCRE_DOTALL option is
+ set, the pattern is implicitly anchored by PCRE, since it
+ can match only at the start of a subject string. However, if
+ PCRE_DOTALL is not set, PCRE cannot make this optimization,
+ because the . metacharacter does not then match a newline,
+ and if the subject string contains newlines, the pattern may
+ match from the character immediately following one of them
+ instead of from the very start. For example, the pattern
+
+ (.*) second
+
+ matches the subject "first\nand second" (where \n stands for
+ a newline character) with the first captured substring being
+ "and". In order to do this, PCRE has to retry the match
+ starting after every newline in the subject.
+
+ If you are using such a pattern with subject strings that do
+ not contain newlines, the best performance is obtained by
+ setting PCRE_DOTALL, or starting the pattern with ^.* to
+ indicate explicit anchoring. That saves PCRE from having to
+ scan along the subject looking for a newline to restart at.
     </literallayout>
    </refsect1>
   </refentry>

-- 
PHP Development Mailing List (http://www.php.net/)
To unsubscribe, e-mail: php-dev-unsubscribe <email protected>
For additional commands, e-mail: php-dev-help <email protected>
To contact the list administrators, e-mail: php-list-admin <email protected>