Version: 1.0

Type: Full Script

Category: Algorithms

License: GNU General Public License

Description: A program to process large dataset files one record at a time and throw away certain records while retaining others. It is very fast and uses very little memory, and has been tested on files of several hundred megabytes. We use it for filtering and processing log files here to throw away uninteresting entries for usage reporting.



#!/usr/local/bin/php -q
<?
# Program name: Strip
# Purpose:      Delete records that contain certain characters from a
#               file/datastream
#
# Requirements: Has to handle VERY large files or datastreams.  
# Therefore it should process the minimum dataset necessary at 
# a time for minimum memory usage, and that would be about 40 lines at a time.
#
# Data Format:  Each record must start with some distinguishing
# characteristic.  In this case it is that the first line of a record
# starts at character zero, while all subsequent lines begins with one or
# more spaces.  This continues until we come to another line that has
# a letter from a-z in the left column, which starts a new record.
#
# The minimum storage unit will be called $rec.  This record will be
# filled, processed, then emptied and refilled with the next record.

# SETUP STUFF

# Increase this to some huge number, but zero can be a bad
# idea on something that could throeretically hang forever.
set_time_limit(1200); 

# check to make sure we were passed an argument, then make sure it 
# doesn't have imbedded shell command stuff in it to try and trick
# our program into doing bad things.  Then open the file.
if (!$argv[1]) die ("Usage: strip filename\n");
$filename=escapeshellcmd($argv[1]);
$fp = fopen ($filename,"r"); 

# Theory of operation:

# Read in a line.  
# Check to see if it is the beginning of record.  If it starts
# with a letter from A to Z, then it is, otherwise it's just a 
# single line in a record.
# If it is the start of a record, then check the old record to
# see if it contains the "magic cookie".  If it does, then we
# discard the old record and do nothing.  If it does not, then
# we print the old record out.  Either way, we then throw away
# the record.
# 
# Whether or not the line is the beginning of a record or a line
# in the middle of one, we append it to the $rec variable.  If the 
# line is the beginning of a record, then the $rec var will be empty
# and it will be the first line added to it, thus starting a new 
# record.

$magic_cookie="4140244";
while (!feof($fp)){         
	$buffer = fgets($fp,4096); 
	if (eregi("[A-Z]+",$buffer[0])) { 
		#Process last record, then reset record...
		if (!ereg($magic_cookie,$rec)) {
			print $rec."\n";
		}
		unset($rec);
	}
	$rec.=$buffer; 
}
?>