Like most PHP developers out there, I started off using databases to
store relatively simple data structures for my dynamic sites. PHP's
slickness and ease of database connectivity is no doubt a major reason
for its wild success, and also probably a reason why databases get
created and used without a huge amount of design and documentation.
In my next couple of articles, I'm going to attempt to talk about
the best way to use databases with
PHP. I'm not going to cover normalization, as Barry Wise wrote a
fantastic article on that topic.
Instead, I'm going to cover 3 other, arguably less important, advanced
topics:
- Proper use of indices (indexes) - this article
- Transactions - next week
- Foreign Keys - two weeks
Indexing
Let's start with indexing. I was asked the other day by a fairly advanced
programmer what indices are, and that scared me. I knew I had to write
up an article as this programmer is clearly not alone. Tens of thousands of
developers (most probably using MySQL) have little formal training with
databases and yet make a living developing on them for clients.
The very first thing you should do, if you do nothing more, is build an
index on any column in a table that is included in a WHERE clause.
CREATE TABLE mytable (
id serial primary key,
category_id int not null default 0,
user_id int not null default 0,
adddate int not null default 0
);
That's a pretty simple table that will suffice for most of this discussion.
SELECT * FROM mytable WHERE category_id=1;
Even the most rudimentary indexing strategy means you will create a simple
index on category_id:
CREATE INDEX mytable_categoryid
ON mytable (category_id);
Very simple right? Now what if you are going to use multiple criteria to select
from the database?
SELECT * FROM mytable
WHERE category_id=1 AND user_id=2;
You might be tempted to just create another index on user_id, but that
would be sub-optimal. You need to create a multi-column index.
CREATE INDEX mytable_categoryid_userid
ON mytable (category_id,user_id);
Are you noticing my naming convention yet? I use tablename_field1name_field2name.
You'll understand why I do that in a minute.
Now that you've built indices on the appropriate columns, you should verify
that your database is using them as expected. This is easy to do in most databases -
just use the EXPLAIN command:
EXPLAIN
SELECT * FROM mytable
WHERE category_id=1 AND user_id=2;
This is what Postgres 7.1 returns (exactly as I expected).
NOTICE: QUERY PLAN:
Index Scan using mytable_categoryid_userid on
mytable (cost=0.00..2.02 rows=1 width=16)
EXPLAIN
If you boil that down, essentially postgres is using an index (always a
good start), and furthermore it is using the second index I created.
Because of the way I named my index, I can tell immediately that it is
using the right index.
Now what if your query involves an ORDER BY clause? Believe it or not,
most databases will take advantage of an index on the ordered by field.
SELECT * FROM mytable
WHERE category_id=1 AND user_id=2
ORDER BY adddate DESC;
Confused yet? Don't be. Add another multi-column index that includes
the fields in the WHERE clause as well as the ORDER BY clause and you're
in business:
CREATE INDEX mytable_categoryid_userid_adddate
ON mytable (category_id,user_id,adddate);
NOTICE: identifier "mytable_categoryid_userid_adddate"
will be truncated to "mytable_categoryid_userid_addda"
CREATE
EXPLAIN SELECT * FROM mytable
WHERE category_id=1 AND user_id=2
ORDER BY adddate DESC;
NOTICE: QUERY PLAN:
Sort (cost=2.03..2.03 rows=1 width=16)
-> Index Scan using mytable_categoryid_userid_addda
on mytable (cost=0.00..2.02 rows=1 width=16)
EXPLAIN
Now if you look at the explain output, it did something that we were
dreading. We don't want a sort in there. I was hoping that the database
would be smart enough to do the query right without any extra hints, but
I guess I was overly optimistic.
To get around the sort step, we don't need any extra indices, but we
do need to modify the query a bit. I'm going to give postgres an extra
clue about what I want it to do - I'm going to add the criteria fields to the
ORDER BY clause. This should technically not be necessary, because there is no
sorting to be done on those two fields, but it gives postgres the extra
kick in the pants that it needs sometimes.
EXPLAIN SELECT * FROM mytable
WHERE category_id=1 AND user_id=2
ORDER BY category_id DESC,user_id DESC,adddate DESC;
NOTICE: QUERY PLAN:
Index Scan Backward using
mytable_categoryid_userid_addda on mytable
(cost=0.00..2.02 rows=1 width=16)
EXPLAIN
There, now that's perfect. Postgres now uses the index I wanted it to,
plus it's smart enough to read the index backwards to avoid any sorting
at all. This may sound like a petty detail, but I guarantee you will
appreciate it when your database gets huge and/or you get millions of
page views per day.
Things can get fairly complex when you are joining multiple tables together,
especially if you are using selection criteria where
the fields are in more than one table. Generally, I try to avoid this as
much as possible, because it may mean the database has to join everything
together, then go through and eliminate rows that don't match. That
can be incredibly expensive if you're not careful.
In a situation like that, you should use the indexing strategy above
for each of the joined tables and then verify with an EXPLAIN command
that your indices are being used. If they are, you should be OK. If not,
you may want to create temp tables which are pre-joined and indexed properly.
An hourly cron job could be used to populate this temp table.
Well now you should have the fundamentals of database indexing down.
Please post your questions, comments, successes and failures.
--Tim