picture of Tim Perdue
Like most PHP developers out there, I started off using databases to store relatively simple data structures for my dynamic sites. PHP's slickness and ease of database connectivity is no doubt a major reason for its wild success, and also probably a reason why databases get created and used without a huge amount of design and documentation.
In my next couple of articles, I'm going to attempt to talk about the best way to use databases with PHP. I'm not going to cover normalization, as Barry Wise wrote a fantastic article on that topic.
Instead, I'm going to cover 3 other, arguably less important, advanced topics:
  1. Proper use of indices (indexes) - this article
  2. Transactions - next week
  3. Foreign Keys - two weeks


Let's start with indexing. I was asked the other day by a fairly advanced programmer what indices are, and that scared me. I knew I had to write up an article as this programmer is clearly not alone. Tens of thousands of developers (most probably using MySQL) have little formal training with databases and yet make a living developing on them for clients.
The very first thing you should do, if you do nothing more, is build an index on any column in a table that is included in a WHERE clause.
CREATE TABLE mytable (
id serial primary key,
category_id int not null default 0,
user_id int not null default 0,
adddate int not null default 0
That's a pretty simple table that will suffice for most of this discussion.
SELECT * FROM mytable WHERE category_id=1;
Even the most rudimentary indexing strategy means you will create a simple index on category_id:
CREATE INDEX mytable_categoryid 
	ON mytable (category_id);
Very simple right? Now what if you are going to use multiple criteria to select from the database?
SELECT * FROM mytable 
	WHERE category_id=1 AND user_id=2;
You might be tempted to just create another index on user_id, but that would be sub-optimal. You need to create a multi-column index.
CREATE INDEX mytable_categoryid_userid 
	ON mytable (category_id,user_id);
Are you noticing my naming convention yet? I use tablename_field1name_field2name. You'll understand why I do that in a minute.
Now that you've built indices on the appropriate columns, you should verify that your database is using them as expected. This is easy to do in most databases - just use the EXPLAIN command:
	SELECT * FROM mytable 
	WHERE category_id=1 AND user_id=2;
This is what Postgres 7.1 returns (exactly as I expected).

Index Scan using mytable_categoryid_userid on 
	mytable  (cost=0.00..2.02 rows=1 width=16)

If you boil that down, essentially postgres is using an index (always a good start), and furthermore it is using the second index I created. Because of the way I named my index, I can tell immediately that it is using the right index.
Now what if your query involves an ORDER BY clause? Believe it or not, most databases will take advantage of an index on the ordered by field.
SELECT * FROM mytable 
	WHERE category_id=1 AND user_id=2
	ORDER BY adddate DESC;
Confused yet? Don't be. Add another multi-column index that includes the fields in the WHERE clause as well as the ORDER BY clause and you're in business:
CREATE INDEX mytable_categoryid_userid_adddate
	ON mytable (category_id,user_id,adddate);

NOTICE:  identifier "mytable_categoryid_userid_adddate" 
	will be truncated to "mytable_categoryid_userid_addda"

	WHERE category_id=1 AND user_id=2
	ORDER BY adddate DESC;


Sort  (cost=2.03..2.03 rows=1 width=16)
  ->  Index Scan using mytable_categoryid_userid_addda 
	on mytable  (cost=0.00..2.02 rows=1 width=16)

Now if you look at the explain output, it did something that we were dreading. We don't want a sort in there. I was hoping that the database would be smart enough to do the query right without any extra hints, but I guess I was overly optimistic.
To get around the sort step, we don't need any extra indices, but we do need to modify the query a bit. I'm going to give postgres an extra clue about what I want it to do - I'm going to add the criteria fields to the ORDER BY clause. This should technically not be necessary, because there is no sorting to be done on those two fields, but it gives postgres the extra kick in the pants that it needs sometimes.
	WHERE category_id=1 AND user_id=2
	ORDER BY category_id DESC,user_id DESC,adddate DESC;


Index Scan Backward using 
	mytable_categoryid_userid_addda on mytable  
	(cost=0.00..2.02 rows=1 width=16)

There, now that's perfect. Postgres now uses the index I wanted it to, plus it's smart enough to read the index backwards to avoid any sorting at all. This may sound like a petty detail, but I guarantee you will appreciate it when your database gets huge and/or you get millions of page views per day.
Things can get fairly complex when you are joining multiple tables together, especially if you are using selection criteria where the fields are in more than one table. Generally, I try to avoid this as much as possible, because it may mean the database has to join everything together, then go through and eliminate rows that don't match. That can be incredibly expensive if you're not careful.
In a situation like that, you should use the indexing strategy above for each of the joined tables and then verify with an EXPLAIN command that your indices are being used. If they are, you should be OK. If not, you may want to create temp tables which are pre-joined and indexed properly. An hourly cron job could be used to populate this temp table.
Well now you should have the fundamentals of database indexing down. Please post your questions, comments, successes and failures.