Introduction
Hello hello hello, and welcome
back. We've looked at strings, and numbers and all sorts of
types of data, but we've not yet seen how to do something
really important, and that's to look for and pull
interesting parts out of the data we have, to do that where
going to use some magic from the Perl world called "Regular
Expressions"
Huh, Regular What?
Put simply, a regular expression
is a string in it's own right, but one that has a special
meaning. In some ways it's like a little mini program that
tells the regexp engine what to look for and how to find it.
Look at the first line of my
opening paragraph above.
If you wanted to look for that
"Hello hello hello" and treat it as a spelling error to
correct how are you going to find it?
Well you could use
if($text == "Hello hello hello")
or you might use the text function "str_replace":
str_replace("Hello hello hello","Hello");
and they would work fine, but what
if we now made the phrase "Hello hello hullo"? hmmm, it
looks like we now don't have a match. This is where the
power of regular expressions comes to the rescue.
How do Regular Expressions Work Then?
Ok, so your asking yourself, how
can I match something that's not match-able unless I change
what I'm looking for, which is not what I want to do.
The key is not to change what your
searching for but, to just search for the differences.
It's a set of searching rules,
that allows for variations in the text to be searched.
What exactly is the difference
between "Hello hello hello" and "Hello hello hullo" , well
in this case it's only one letter, and that letter can be
either an 'e' or a 'u', if we had a way of just saying
search for this phrase, but at the 4th letter from the end,
you need to be aware that that could change, then you've
pretty much defined what a regular expression is.
Now in our example here, we could
actually expand that very easily, to be aware of any letter
at that position and not just a 'e' or 'u', we do this by
using the full stop operator '.' , so to show
you what I mean we could write our search pattern like this:
"Hello hello h.llo"
That will match any character at
that position, but only one character, all the rest have to
match.
There is however much more to the
power of reg-expressions than just single letters, we can
search for whole groups of numbers, letters and combinations
of words symbols. Also for certain counts and lengths.
There's far more than we can cover
in this article, developing a true mastery of reg-
expressions takes years. We only have time to cover what you
need to know to get you started in PHP.
Once snippet of advice I will
give, find a program that will allow you to see what your
doing as you construct regular expressions. I use the
wonderful reg-ex coach available from
http://weitz.de/regex-
coach/ There is an older version for linux, but the
latest versions are only maintained under Windows now, it
does however run perfectly fine under wine.
So What Else Can Regular Expressions do?
The best way for me to describe
that is to show you a few examples:
Let's say we have the string:
"Long live PHP Builder in 2009"
We can find and extract the 2009 using:
^.*(\d\d\d\d)$
If we use this in PHP with the
preg_match function:
$found = preg_match("/^.*(\d\d\d\d)$/", "Long live PHP Builder in 2009",$matches);
$found will be true
if the text provided had 4 digits at the end of the string,
the / at either end of the pattern are how the
regular expression engine knows the start and finish of the
search (more on that in just a moment), if a match is found
then the array matches will contain the following:
$matches[0] = "Long live PHP Builder in 2009"
$matches[1] = "2009"
Here's how the reg-ex pattern reads:
^ = at the start of the line
. = Read any character
* = for as many as you can, until
\d\d\d\d = you encounter 4 digits in a row
$ = at the end of the string
() = keeps the part of the pattern
you found in any rule between these separate, in this case
the 4 digits.
or in English. Look for 4
consecutive digits that occur at the end of the string, and
retrieve them.
Here's another one:
$text = "Peter Shaw"
$reg-ex = "/(Peter)\s(Sh(aw|ore))/"
I'll not repeat the preg line this time.
The rule here says Return the
first word before the space, and after the space match it if
it's "Shaw" or a common misspelling "Shore", the pattern
reads:
\s = look for the first space you encounter with
Peter = on the left side of it and
Sh = on the right side, followed by either
(aw|ore) = 'aw' OR 'ore'
In all cases keep the 2 found words.
The result in $matches will be
$matches[0] = "Peter Shaw" (or "Peter Shore")
$matches[1] = "Peter"
$matches[2] = "Shaw" (or "Shore")
$matches[3] = "aw" (or "ore")
Pay attention above to the
(aw|ore) bit. This has to be in () to group
the 2 parts either side of the OR decision, so even if you
don't intend to look for that part, it still uses up a slot
in the results.
One more example:
$text = "the letter a is a vowel"
$reg-ex = "/the\sletter\s[aeiou]\sis\sa\svowel/i";
This reads:
Search for "the letter " followed by one of the letters a,e,i,o,u and none other
Followed by " is a vowel"
On a positive match, then
$matches[0] will hold "the letter "a" is a
vowel" , there will be no other parts in
$matches as there are no bracket sections.
In case your wondering
\s is a special character called a meta-
character, and it means anything classed as white space.
The * symbol is also
a meta character and means match "0 or more occurrences" eg:
A*
The above axample will match any text starting with 'A', the ^ and $ meta characters mean start and end of the text, so:
^A*
Will match any and all the text in
a phrase as long as it starts with an 'A' right
at the beginning, which is different to the previous,
because that will match on the first 'A' it
encounters in the text, then match on the rest of the line,
and that brings us to my next point.
Regular expressions are greedy.
They will try and match the largest amount possible at any
given time in any given match string, which is why you
really only want to use * if it's really necessary, if you
can, always try to narrow your search as much as possible
EG:
"Alan went to meet marsha"
To get the word 'Alan' use an expression of:
/^A.*\swent/
Or use the count control match meta characters:
/^A.{4}\s.+/
What this expression says is, look
for a 4 character word beginning with 'A' right at the
beginning of the line, followed by a space and at least 1 or
more characters.
The {4} means 4 characters of any
description, and only 4 characters. It's also possible to
specify ranges. Take a look at this example:
/^A.{1,4}/
This example would specify an A
followed by between 1 and 4 characters, but no less than 1
and no more than 4. And this snippet:
/^A.{4,}/
This code would mean an 'A' at the
beginning followed by at least 4 characters, possibly more.
You can also combine other rules,
this does not just have to be a '.',
'*' or '+' , you can use a
character class like this:
/^A[aeiou]{4,}/
This would match on a line
beginning with 'A' and at least 4 of any of the
characters in the square brackets in any order, but only the
characters in the square brackets.
Summary
We've really only just scraped the
tip of the iceberg with regular expressions, it's a huge
subject for which many books have been written. I urge you
to read more about them and you can always look to the PHP
manual, the expressions section is at
http://phpbuilder.com/manual/en/language.expressions
.phpt.
Next time will be the final part
in our series, in which we wrap up and look at some
practical examples of what we've learned so far.
It's also your chance to tell me
what you'd like to cover. If there is a particular thing
you've been trying to do, or a technique your not sure how
to make work, then please leave a comment using the form at
the bottom of this page.
Between now and the final article,
I'll be checking these comments, and I'll use them as a
basis for what I put in the last article, please note
however, I'm not going to complete your project for you or
your homework assignment, so please don't put things in like
"please show me how to make a project that does xxxx" all
I'm looking for are real world ideas based on common
scenarios that you guys are currently learning.
Until next time
May your expressions remain regular
Shawty