To register for an Internet.com membership to receive newsletters and white papers, use the Register button ABOVE.
To participate in the message forums BELOW, click here
PHPBuilder.com  
 

 

Go Back   PHPBuilder.com > PHP Help > Database

Database Conversation regarding PHP and SQL

Reply
 
Thread Tools Search this Thread Rate Thread Display Modes
Old 04-14-2007, 11:26 PM   #1
Sxooter
Chamberlain
 
Sxooter's Avatar
 
Join Date: Aug 2002
Location: Denver, CO
Posts: 4,067
Fast Random Results - "ORDER BY RAND()" is bad

Quite often we need fast random results. A lot of new developers do it this way:

select * from table order by random() limit 1;

Seems simple enough. And it works just fine, as long as your table isn't too large. But, it doesn't scale well, either in terms of table size or number of users hitting the system at the same time.

At work, I've got a table that summarizes all the requests our systems see every minute. It has data since last August or so. Every service (100 or so) times every machine (4 to 8) times every minute since last august. Running something like order by random() limit 1 on such a database will simply never return. In the month of April alone, for 2 weeks so far, it has 3,046,182 rows.

In the last hour, on a quiet saturday night, it's gotten about 5,000 rows. Doing the order by random() limit 1 on the last hour's data takes 132 mS. The last 4 hours data takes 484 mS. 24 hours takes 3 seconds. 48 hours takes 9 seconds. You can see where this is going. What if I needed a random number from the last month or something? And what if I need it several times a second? order by random() is not going to work for that.

Here's a generalized method to let me get random entries fast. It's not perfect, but most compromises aren't. First, I'll grab the last 48 hours worth of ids into a summary table, and randomize them on the way. Since I'm only materializing the id, and not the whole row, the select for that is only about 4 seconds on this machine. Next I do a select count(*) on the summary table to find out how many rows it has. This takes 178 mS and I can store it. In this case, the summary table has 352,532 rows. We can run these two queries every hour or 5 minutes, or whatever we need, and store the result of that count(*) somewhere handy. Here's what I'm gonna do:

create table summary (id serial primary key, refid int);
insert into summary (refid) select summaryid from bigtable where lasttime > now() - interval'48 hour' order by random(); -- NOTE the order by random() is costly and uneeded... Thanks laserlght
INSERT 0 352538

Now I have a lookup table with a sequence from 1 to 352538 in the id column, and a number to match the master table in the refid column.

Let's grab a random row now! Using either php or your db, you can create a random number.

$id = random(1,352538);
// $id = 183965;

We run a query like this:

select * from bigtable where summaryid=(select refid from summary where id=183965);
Time: 49.478 ms

We'll call it 50 ms. That's a little faster than the 9 seconds it took with order by random() limit 1. 180 times faster.

Wait, I just ran it again, since the buffer wasn't full, and it now takes 0.993 ms to run...

Let's call that 1 ms. that's 9000 times faster.

Ran a little more testing. Seems the time to pull any one row out of the db unprimed is about 20-30 ms. The overhead of the select from the random table is very very small, somewhere around 0.4 ms. All the rest of the time is looking up the indexed entry fro 30+million rows and returning it.
__________________
PostgreSQL version 8.5 is now in alpha!

Last edited by Sxooter; 04-14-2007 at 11:47 PM. Reason: wrong number at the end... part 2...
Sxooter is offline   Reply With Quote
Old 04-14-2007, 11:41 PM   #2
laserlight
PHP Witch
 
laserlight's Avatar
 
Join Date: Apr 2003
Location: Singapore
Posts: 13,055
hmm... but if you are caching primary keys, why bother ordering them randomly in the summary table? After all, the randomness can come from generating a random id in PHP.
__________________
Use Bazaar for your version control system
Read the PHP Spellbook
Learn How To Ask Questions The Smart Way
laserlight is offline   Reply With Quote
Old 04-14-2007, 11:44 PM   #3
Sxooter
Chamberlain
 
Sxooter's Avatar
 
Join Date: Aug 2002
Location: Denver, CO
Posts: 4,067
Because there might be holes in the sequence.

OTOH, you can cache the start / stop numbers and use something like:

select * from table where id >= $id limit 1;
__________________
PostgreSQL version 8.5 is now in alpha!
Sxooter is offline   Reply With Quote
Old 04-14-2007, 11:45 PM   #4
Sxooter
Chamberlain
 
Sxooter's Avatar
 
Join Date: Aug 2002
Location: Denver, CO
Posts: 4,067
Just read what you wrote (second time) and you're absolutely right.
Will post an adendum
__________________
PostgreSQL version 8.5 is now in alpha!
Sxooter is offline   Reply With Quote
Old 04-15-2007, 12:05 AM   #5
NogDog
High Energy Magic Dept.
 
NogDog's Avatar
 
Join Date: Aug 2006
Location: Ankh-Morpork
Posts: 12,621
Just a thought I had, no idea if it would perform any better:

1. Do a SELECT COUNT(*) to get the number of records

2. Do a <?php $offset = rand(0, $count - 1); ?> where $count comes from the above query.

3. Do a SELECT * FROM table LIMIT 1 OFFSET $offset

I have no empirical evidence how this would perform versus the ORDER BY RAND() method, but figured I'd throw it out there in case you want to try it.
__________________
"That's what the gods are! An answer that will do! Because there's food to be caught and babies to be born and life to be lived and so there is not time for big, complicated, and worrying answers! Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be." -- from Nation, by Terry Pratchett

Email me
NogDog is offline   Reply With Quote
Old 04-15-2007, 12:05 AM   #6
Sxooter
Chamberlain
 
Sxooter's Avatar
 
Join Date: Aug 2002
Location: Denver, CO
Posts: 4,067
Thanks to laserlight's point, I ran the same query to populate the summary table, but with the last month's data instead of the last 48 hours. It took quite a while to build the summary table, 450 seconds. And it inserted 6787330 rows. Select time still hovers around 50 milliseconds for the first time on a row, 1 millisecond the second time when it's block is in memory.
__________________
PostgreSQL version 8.5 is now in alpha!
Sxooter is offline   Reply With Quote
Old 04-15-2007, 12:10 AM   #7
Sxooter
Chamberlain
 
Sxooter's Avatar
 
Join Date: Aug 2002
Location: Denver, CO
Posts: 4,067
Post

Quote:
Originally Posted by NogDog
Just a thought I had, no idea if it would perform any better:

1. Do a SELECT COUNT(*) to get the number of records

2. Do a <?php $offset = rand(0, $count - 1); ?> where $count comes from the above query.

3. Do a SELECT * FROM table LIMIT 1 OFFSET $offset

I have no empirical evidence how this would perform versus the ORDER BY RAND() method, but figured I'd throw it out there in case you want to try it.
Tried it before. It's halfway between order by rand() and my method in terms of performance. Problem is that the db still has to generate the whole data set before hitting the offset and stopping, so the higher the number, the longer it will take to return.

Note that select count(*) is NOT fast in most databases (Oracle, PostgreSQL, MySQL with innodb tables) when the tables are very large.

for instance, on the table listed above, hitting the last month's data, Here's what I go for varying offsets, in milliseconds

1: 1.3
100: 3.5
1000: 4.4
10,000: 23
100,000: 210
500,000: 1280
1,000,000: 3000

You can see where that's heading. At 40+ million rows, it's not gonna scale.
Quote:
COUNT(*) pretty much has to traverse through the entire table, right?
laserlight, yes count(*) has to traverse the whole table in many databases, such as postgresql and oracle and mysql with innodb tables...
__________________
PostgreSQL version 8.5 is now in alpha!

Last edited by Sxooter; 04-15-2007 at 12:24 AM.
Sxooter is offline   Reply With Quote
Old 04-15-2007, 12:12 AM   #8
laserlight
PHP Witch
 
laserlight's Avatar
 
Join Date: Apr 2003
Location: Singapore
Posts: 13,055
Quote:
Note that select count(*) is NOT fast in most databases (Oracle, PostgreSQL, MySQL with innodb tables) when the tables are very large.
COUNT(*) pretty much has to traverse through the entire table, right?
__________________
Use Bazaar for your version control system
Read the PHP Spellbook
Learn How To Ask Questions The Smart Way
laserlight is offline   Reply With Quote
Old 04-15-2007, 03:12 AM   #9
NogDog
High Energy Magic Dept.
 
NogDog's Avatar
 
Join Date: Aug 2006
Location: Ankh-Morpork
Posts: 12,621
I was remembering this regarding MySQL:
Quote:
COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause.
However, it then notes that this is only applicable in MyISAM tables.
__________________
"That's what the gods are! An answer that will do! Because there's food to be caught and babies to be born and life to be lived and so there is not time for big, complicated, and worrying answers! Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be." -- from Nation, by Terry Pratchett

Email me
NogDog is offline   Reply With Quote
Old 04-15-2007, 06:48 AM   #10
Roger Ramjet
Senior Member
 
Roger Ramjet's Avatar
 
Join Date: Jul 2004
Location: Leeds, UK
Posts: 4,311
Instead of COUNT(*) you can use SHOW TABLE STATUS with myisam tables to get the number of rows. Unfortunately this does not work for other table types which only return an approximation of the row count.. Be interesting if someone could benchmark this against count(*). My assumption is that show status will be independent of table size.
__________________
David Soussan
Roger Ramjet is offline   Reply With Quote
Old 04-15-2007, 08:45 AM   #11
Sxooter
Chamberlain
 
Sxooter's Avatar
 
Join Date: Aug 2002
Location: Denver, CO
Posts: 4,067
There's something similar in most databases for estimating rows cheaply. In PostgreSQL it's

select reltuples from pg_class where relname='tablename';

klunky but it works. It tells you how many tuples were in the table at the end of the last vacuum.
__________________
PostgreSQL version 8.5 is now in alpha!
Sxooter is offline   Reply With Quote
Old 04-15-2007, 10:20 AM   #12
Roger Ramjet
Senior Member
 
Roger Ramjet's Avatar
 
Join Date: Jul 2004
Location: Leeds, UK
Posts: 4,311
[edit]cleaned up for sticky[/edit]

When I take those 3 steps back and look at the problem, I see that the solution is the domain itself. What we have all forgot is that the web is a random environment. We see so many folk trying to use this RAND() thing for good reasons on their websites because they want to display random advertising banners, random users, random messages, etc. What we all forget is a) Random does not mean unique, but more importantly b) it is the users who are already random.

The solution to displaying random events to individual users is already taken care of by their random page hits. All you need to do is to cycle sequentially through your items and that is a simple algorithm with easy optimisation.

Those cases where RAND() is a necessary and applicable solution are complex apps like MRP/JIT, BI, Risk Analysis/Financial Modelling, Market Research, Scientific Analysis, etc: domains where complex and powerful analysis is involved and performance is secondary to results. When I worked in market research we did not mind if a report took 20 mins to process - just go to lunch, or run it overnight.
__________________
David Soussan

Last edited by Roger Ramjet; 04-22-2007 at 07:37 AM.
Roger Ramjet is offline   Reply With Quote
Old 04-15-2007, 11:42 AM   #13
Sxooter
Chamberlain
 
Sxooter's Avatar
 
Join Date: Aug 2002
Location: Denver, CO
Posts: 4,067
Quote:
Because with all their wealth of experience and knowledge, this RAND() problem is not one that they have had to solve before. If it were then the solution would have been presented already.
I have run into it quite a bit before, and I believe I have presented the best solution for randomly selecting one row from a large data set.

If you've got a better solution, I'd love to see it.

About the 20 minute reports. I've got some reports that take the better part of a day to run, and that's after massive optimization. When you're aggregating 40 million rows of data, it can only be done so fast. If that same report had an order by random() clause in it, it would, literally, eat every byte of RAM, gigs of storage, and probably not return for weeks, maybe even months.
__________________
PostgreSQL version 8.5 is now in alpha!
Sxooter is offline   Reply With Quote
Old 04-15-2007, 12:21 PM   #14
Roger Ramjet
Senior Member
 
Roger Ramjet's Avatar
 
Join Date: Jul 2004
Location: Leeds, UK
Posts: 4,311
Hey, cool down Sxooter, I was not knocking you. On the contrary, I was trying to say that generally you would not even think of doing this. Those cases where you would have done it would be apps where rand() was essential and the only way to do the job; and you would accept the performance hit just because it was the ONLY way to do the job.

If I've upset you, sorry. All I can say is that sitting drinking beer in the sunshine has got the better of me.

I do have a modification to your method that will not get the performance hit with using LIMIT base,offset on large data sets.

PHP Code:
// get the highest 2 ids
$res1 = mysql_query("SELECT id FROM table ORDER BY id DESC LIMIT 2");
// generate random no between 0 and second highest id
$top = mysql_result($res1,1);
$rand = rand(0,$top);
// now query for id greater then random no
$sql = "SELECT * FROM table WHERE id>$rand ORDER BY id DESC LIMIT 1"
Size of dataset should not matter, but I stand to be corrected if you care to benchmark it.
__________________
David Soussan
Roger Ramjet is offline   Reply With Quote
Old 04-15-2007, 12:35 PM   #15
Sxooter
Chamberlain
 
Sxooter's Avatar
 
Join Date: Aug 2002
Location: Denver, CO
Posts: 4,067
I'm not upset in the least, honestly. I sorta kinda figured that was where you were heading.

I've posted something similar to your solution and that one also runs very quickly.

I think I might be due for a pub run myself!
__________________
PostgreSQL version 8.5 is now in alpha!
Sxooter is offline   Reply With Quote
Reply

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump


All times are GMT -4. The time now is 11:03 PM.








Acceptable Use Policy

Internet.com
The Network for Technology Professionals

Search:

About Internet.com

Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | E-mail Offers


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.