Click to See Complete Forum and Search --> : [RESOLVED] Script for filtiing records! (In use now) looking for improvments!


spiritssight
01-23-2008, 02:38 PM
Hello Everyone,

I am looking for better ways to do the below script which does:
get count of table (irs_rawdata)
gets first record in the table (irs_rawdata)
checks to see if its in block table if so deletes it from irs_rawdata
if not, then it checks if matches some WHERE rules if match adds to query table, if its in query table already then it updates the record
then loops again to to get the first record again and does this for the full table count.

This script is like the other FILTERING script that I have posted but I had to make a number of changes to it as I was not thing clearly and code a little to much into the first one that was not needed (I got carried away with it)

This script is very slow when it comes to processing over 1,000 records, I need it to be able to handle over 1m.

here is the script:

<?php
include '/home/dev/www/lib/php_header.php';

if(isset($_SESSION[ "user_id" ]) & isset($_SESSION[ "user_access_level" ]))
{
include '/home/dev/www/lib/db_config_cr-dev.php';
include '/home/dev/www/lib/db_conn-select.php';

// Counts record in table
$query_s = "SELECT COUNT(*) FROM irs_rawdata";
$result_s = mysql_query($query_s) OR die("Sorry, unable to count the records in the table!");
$result_c = mysql_result($result_s, 0);
$record_count = $result_c;

echo "There is ". $record_count ." records in the irs_rawdata table<br /><br />";

// checks to see if sesult_c is greater then zero
// if so then it runs script
// otherwise it says there is no more records to process
for($i=1; $i <= $result_c; $i++)
{
//used to change $i to a meanful var name
$record_id = $i;

// selects and limits it to the first row only and retrives the EIN then return a 1 or 0
$query_s = "SELECT EIN FROM irs_rawdata LIMIT 0,1";
$result_s = mysql_query($query_s) OR die("Sorry, unable to to select EIN for the first record! <br /> ". mysql_error());
$record = mysql_fetch_assoc($result_s);
$count = mysql_num_rows($result_s);

if($count)
{
// changes long var to short var
$fin = $record['EIN'];

// gets record if there is one in block returns 1 or 0
echo "We are now processing record ". $record_id. " with Fed ID #". $fin ."!<br />";
$query_s = "SELECT fin FROM block WHERE fin = '$fin'";
$result_s = mysql_query($query_s) OR die("Sorry, unable to select record: " . mysql_error());
$record = mysql_fetch_assoc($result_s);
$count = mysql_num_rows($result_s);

if ($count == 0)
{
echo "There was no record found in the \"block\" table!<br />";
echo "We are now checking if record ". $record_id ." matchs the required criteria!<br />";
$query_is = "INSERT INTO query SELECT * FROM irs_rawdata
WHERE EIN = '$fin' AND
(
PNO LIKE '%blind%' OR
PNO LIKE '%deaf%' OR
PNO LIKE '%braille%' OR
PNO LIKE '%sight%' OR
PNO LIKE '%vision%' OR
PNO LIKE '%sign%' OR
PNO LIKE '%ear%' OR
PNO LIKE '%eye%' OR
PNO LIKE '%dog%' OR
Activity_Code LIKE '%031%' OR
NTEE_Code LIKE '%G41%' OR
NTEE_Code LIKE '%G42%' OR
NTEE_Code LIKE '%H41%' OR
NTEE_Code LIKE '%H42%' OR
NTEE_Code LIKE '%P86%' OR
NTEE_Code LIKE '%P87%'

)";
$result_is = mysql_query($query_is);
$count = mysql_affected_rows();

if($count)
{
echo "We have successfully added the record to the \"query\" table, it match the criteria!<br />";
echo "We are now deleting record ". $record_id . " from \"irs_rawdata\" table!<br />";
$query_d = "DELETE FROM irs_rawdata WHERE EIN = '$fin'";
$result_d = mysql_query($query_d);
$count = mysql_affected_rows();

if($count)
{
echo "We have successfully deleted the record from the \"irs_rawdata\" table!<br /><br />";
}
}
else
{
echo "We was unable to match the record ". $record_id ." to the criteria!<br />";
echo "We are now adding record ". $record_id . " to \"block\" table!<br />";
$query_i = "INSERT INTO block (fin, who_blocked) VALUES ('$fin', 'S-123456789')";
$result_i = mysql_query($query_i) OR die("Sorry was unable to insert ".$fin." into the database table allowed! <br />" . mysql_error());
$count = mysql_affected_rows();

if($count)
{
echo "We have successfully added the record to the \"block\" table!<br />";
echo "We are now deleting record ". $record_id . " from \"irs_rawdata\" table!<br />";
$query_d = "DELETE FROM irs_rawdata WHERE EIN ='$fin'";
$result_d = mysql_query($query_d) OR die("Was unable to delete ".$fin." from the db table irs_rawdata! <br />" . mysql_error());
$count = mysql_affected_rows();

if($count)
{
echo "We have successfully deleted the record from the \"irs_rawdata\" table!<br /><br />";
}
}
else
{
echo "We was unable to successfully added the record to the \"block\" table!<br />";
}
}

}
else
{
echo "There was one record found in the \"block\" table!<br />";
echo "We are now deleting record ". $record_id . " from \"irs_rawdata\" table!<br />";
$query_d = "DELETE FROM irs_rawdata WHERE EIN = '$fin'";
$result_d = mysql_query($query_d) OR die("Was unable delete ".$fin." from the db table irs_rawdata! <br />" . mysql_error());
$count = mysql_affected_rows();

if($count)
{
echo "We have successfully deleted the record from the \"irs_rawdata\" table!<br /><br />";
}
}
}
}
echo "The computer has finished processing the records in irs_rawdata!";

include_once ('/home/dev/www/lib/db_close.php');
}
else
{
echo "You are not allowed on this page!";
}

include '/home/dev/www/lib/php_footer.php';
?>



Thanks for your advice and help with this script! I am learning stuff it may not seem it but I am learning slowly,

Sincerely,
Christopher

bpat1434
01-23-2008, 09:12 PM
For one, I'd just query irs_rawdata for all records selecting the EIN field. If there are results (i.e. mysql_num_rows($result) > 0) then use a while loop to execute a query.

I'd even go so far as to use an inner join of block onto irs_rawdata and use a SQL conditional to figure out if `fin` is "null" or not. I'm guessing `fin` can not be a negative number (seeing as how it's a federal ID, but it is the government we're talking about here ;) ). So you could do something like:
SELECT ir.`ein`, (CASE b.`fin` WHEN NULL THEN '-1' ELSE b.`fin` END) AS `fin`
FROM `irs_rawdata` AS ir
INNER JOIN `block` AS b
ON ir.`ein` = b.`ir`

THen it's just a matter of using PHP to determine if $result[row]['fin'] is "-1" or not. If it isn't, then you can skip it, otherwise, you can insert. That's not going to be much of a speed up, but instead of running 1 million queries, you'd run 1 "large" query and a bunch of smaller queries.

Oh and that "INSERT INTO `query` SELECT..." query should probably be shaved down. Just select the items from irs_rawdata to begin with, then do a quick PHP comparison to see if it should be inserted. That way your insert can be quicker ;) NOt a must do, but could speed things up.

What you probably should do is take each query, and see how "optimized" it is by running it via the command line (or setting up a timer in your PHP script). That way you can see where exactly the delay is. My guess would be your lengthy delays come from running the same SELECT query over and over again. As well as the INSERT INTO `query` SELECT statement being rather large. Just a suggestion though.

spiritssight
01-23-2008, 10:40 PM
I am totally lost with what you did my knowlege is very minial with this stuff, would you mind explaining this to me please I do wish to learn this stuff so I don't have to relie on people as much as I do right now.

Thanks for your post and help with this script!

Sincerely,
Christopher

spiritssight
01-26-2008, 10:51 AM
Could I just write one query that does

selects the first record in irs_rawdata only field (EIN)

select the record with the ein and check if it matches in the block table if not then checks match WHERE clause insert into query if matched and if not match insert only ein into block

there must be a way, this would make the script much much faster I believe!

Thank you very much for you help ahead of time!

Sincerely,
Christopher

bpat1434
01-29-2008, 12:15 AM
The thing you need to think about with SQL is that while you think combining 3 queries into one will make it faster, it may not. Every "join" or "union" creates a temporary table. So MySQL could really eat up some memory if you don't optimize your queries, and also the tables could get quite large.

Some things to help speed up your queries are to:
- Select only the fields you need to work with
Don't select all the fields of a table if you're going to work with only a small subset. It's just like eating, try not to take more than you're going to need.
- Use defined joins, don't use Cartesian joins
Using two tables in the "FROM" clause like: "SELECT * FROM tabeA, tableB" creates a Cartesian Join where every row is included in the result set. Using a defined join like INNER JOIN or LEFT JOIN will allow you to specify which column(s) to match up, and those that don't match up will be left out (with some caveats).
- Keep the query simple
If you have a query with a sub-query, with sub-query with a huge CASE statement, that can slow things down. Each sub-query is a temporary table, and CASE statements can slow down processing of large tables because MySQL has to evaluate each record. Keep them simple and to the point. Don't try to do too much with them.

So let's go through your code here.


SELECT COUNT(*) FROM irs_rawdata
So you want the total number of rows in the table "irs_rawdata". Okay, this is a quick simple way to do it. Nothing wrong here.
Loop through from 0 to the max rows executing this: SELECT EIN FROM irs_rawdata LIMIT 0,1
Well, here is some trouble. While it's a short simple query, it's executing thousands of times, maybe more. This can really cause a slow-down in execution since there are so many records, and so many queries to the MySQL server. You're asking for overload.
With each row from the query in #2, you execute this: SELECT fin FROM block WHERE fin = '$fin'
Okay, so you want the "fin" from the block table. This is a simple query which does this, but at what cost? If we have thousands of rows, once again, we're running this exact query thousands of times. Could make things very very slow.
If a row is not found in the "block" table from query in #3, then you want to INSERT INTO the "query" table information from the irs_rawdata table.
This is a hefty query, and could cause some backup, but not as much as the others. Once again, this is run a lot. Maybe not 100% of the times, but still, it has the possibility for each loop to be run.
If you insert successfully in #4, then you delete from irs_rawdata
Nothing wrong here. No real comments :)
If you don't insert successfully in #4, then you want to add the current row info in irs_rawdata to the "block" table.
Nothing wrong here. It's just run a lot, that's all.
If the insert in #6 works, then you delete from irs_rawdata
This is fine, nothing major here.
If there is a record found in the block table from the query in #3, then delete that row from irs_rawdata
Okay, simple enough.


Now, if we take this code and refactor it, I think we could actually trim it down to a few steps:

Find those rows from irs_rawdata that are currently in "block" table
Delete those rows from irs_rawdata
Take remaining rows in irs_rawdata and see if they match a specific set of criteria. If so, insert those items into the "block" table.
Remove those inserted into "block" from "irs_rawdata"


Notice how your 8 steps which are replicated thousands of times are trimmed down to 4? Now using this setup, we can "batch process" items. Here's some sample code:
<?php

// 1.) Find rows in irs_rawdata that are currently in "block" table:
$query1 = "SELECT ir.`EIN`
FROM `irs_rawdata` ir
INNER JOIN `block` b
ON ir.`EIN` = b.`fin`";
$in_block = @mysql_query($query1) or die('#1.) Error finding rows in `block`: ' . mysql_error());
$eins = array();
while($row = mysql_fetch_row($in_block))
{
$eins[] = $row['EIN'];
}

// 2.) Delete rows returned from #1:
$query2 = "DELETE FROM `irs_rawdata` WHERE EIN IN(" . implode(', ', $eins) . ")";
$delete = @mysql_query($query2) or die('#2.) Error deleting rows in irs_rawdata: ' . mysql_error());

// 3.) Find those rows which match our criteria:
$query3 = "SELECT EIN
FROM `irs_rawdata`
WHERE
PNO LIKE '%blind%' OR
PNO LIKE '%deaf%' OR
PNO LIKE '%braille%' OR
PNO LIKE '%sight%' OR
PNO LIKE '%vision%' OR
PNO LIKE '%sign%' OR
PNO LIKE '%ear%' OR
PNO LIKE '%eye%' OR
PNO LIKE '%dog%' OR
Activity_Code LIKE '%031%' OR
NTEE_Code LIKE '%G41%' OR
NTEE_Code LIKE '%G42%' OR
NTEE_Code LIKE '%H41%' OR
NTEE_Code LIKE '%H42%' OR
NTEE_Code LIKE '%P86%' OR
NTEE_Code LIKE '%P87%'";
$need_insert = @mysql_query($query3) or die('#3.) Error finding rows which match criteria: ' . mysql_error());
$eins = array();
$query3a = "INSERT INTO `block` (`find`, `who_blocked`)
VALUES";
while($row = mysql_fetch_row($need_insert))
{
$eins[] = $row['EIN'];
$query3a .= "
('" . $row['EIN'] . "', 'S-123456789'),";
}
$query3a = substr($query3a, 0, -1); // Remove the final "," in the SQL string
$inserted = @mysql_query($query3a) or die('#3.) Error inserting data into block table: ' . mysql_error());

// 4.) Delete those inserted rows from the irs_rawdata table:
$query4 = "DELETE FROM `irs_rawdata` WHERE EIN IN(" . implode(', ', $eins) . ")";
$deleted = @mysql_query($query4) or die('#4.) Error deleting rows from irs_rawdata: ' . mysql_error());

That's a basic idea of what I would do. It is really short and to the point. You can add your error print out display and what not as you need, but this is just a skeleton of what you probably need.

Do note that what's left in the `irs_rawdata` table will be those rows that didn't match criteria, and that weren't already in the `block` table. Everything else should be deleted.

Hope that helps explain a few things.

spiritssight
01-30-2008, 05:20 PM
Wow, that is much better, in size and will try tomorrow when I come back home. Thanks just pointing out one thing throw, when there is a match it gets added to the query table, which is the same layout as irs_rawdata

Thanks very very much.

Sincerely,
Christopher

Weedpacket
01-31-2008, 05:51 AM
- Use defined joins, don't use Cartesian joins

Using two tables in the "FROM" clause like: "SELECT * FROM tabeA, tableB" creates a Cartesian Join where every row is included in the result set. Using a defined join like INNER JOIN or LEFT JOIN will allow you to specify which column(s) to match up, and those that don't match up will be left out (with some caveats).Select * notwithstanding, I've found most DBMSs will build exactly the same execution plan for
Select bsc5p.glat,pg.bii from bsc5p, pg where bsc5p.glat=pg.biias forSelect bsc5p.glat, pg.bii from bsc5p inner join pg on bsc5p.glat=pg.bii

spiritssight
02-03-2008, 11:22 PM
Hello bpat1434,

I was wondering, could the following work and elimnat the query3a by doing a insert into query select ....,
and then whats left is added to the block table and deleted from irs_rawdata

so say do another insert into block select ein I just am not sure using that way how I would add the who_blocked colom.

does this make sence?

Sincerely,
Christopher

bpat1434
02-04-2008, 12:20 AM
If you wanted to do an INSERT INTO ... SELECT query, you could specify the "who_blocked" column by just adding a string to the end of the SELECT statement:
INSERT INTO `table` (`column`, `column`, `who_blocked`)
SELECT `column`, `column`, 'myUserID'
FROM `table1`
WHERE (
-- some fancy where clause
)

I would do some testing though. That creates a temp table with the select statement, and it may be faster (depending on your server, size of tables, data being selected, etc.) to just use two queries rather than combine them into one. But you'd have to do some testing to find out. I can't say defniitively that INSERT INTO ... SELECT will be faster than SELECT ... then INSERT ...