Justtechjobs.com Find a programming school near you






Online Campus Both


php3-list | 2000051

[PHP3] HTML Parsing - No XML thingy, here... From: Allen Francom (aef <email protected>)
Date: 05/08/00

I'm struggling with this a bit...

It is ALMOST elegant.

Perhaps if I knew my regular expressions better...

Anyway, I call this "odd.php3" and there's a BROKEN test case attached
It WILL FIND, oddly, "FORM METHOD=POST in the TEXT of a document if it starts
right after a tag... dangnabit ! But good enough for now !

-------------------------------------------------------------------------
<HTML><HEAD><TITLE>ODD</TITLE></HEAD> <BODY>
<br>

<?

$tags = " !-- !DOCTYPE A ADDRESS APPLET AREA B BASE BASEFONT BGSOUND BIG BLINK H1 H2 H3 H4 H5 H6";
$tags .= " BLOCKQUOTE BODY BR BUTTON CAPTION CENTER CITE CODE COL COLGROUP DD DEL DIV DL DT EM EMBED FIELDSET";
$tags .= " FONT FORM FRAME FRAMESET HEAD HR HTML I IFRAME IMG INPUT INS KBD LABEL LAYER LEGEND LI LINK MAP";
$tags .= " MARQUEE META NOBR NOFRAMES NOSCRIPT OBJECT OL OPTGROUP OPTION P PRE Q S SAMP SCRIPT SELECT SMALL SPAN STRIKE";
$tags .= " STRONG STYLE SUB SUP TABLE TBODY TD TH TEXTAREA TFOOT THEAD TITLE TR TT U UL WBR ";

$formtags = " FIELDSET FORM INPUT SELECT TEXTAREA ";

// Imperfectly decide if it's a tag or TEXT
// ----------------------------------------
function imperfectValidTagTest($whatnow) {
  global $tags;

  if (substr($whatnow, 0, 1) == "/" ) {
    $whatnow = substr($whatnow, 1, strlen($whatnow) - 1);
  }

  // put spaces around FINDME so we don't get false positives
  // when we use the strstr() function, we want WHOLE word matches
  // not like "I" in "CAPTION" matching the <I> tag...
  $whatnow = " " . $whatnow . " ";

  $bob = strstr($tags, $whatnow);
  if ( $bob == false) { $retval = false; }
  else $retval = true;

  return $retval;
}
// ========================================

// Take $what and UPPERCASE all ATTRIBUTE NAMES
// --------------------------------------------
function normalized($what) {

  $attribray = split('[ =]', $what);

// echo "<b>" . $what . "</b><br>";

  $bad = false;

  echo "<TR>";

    for ($x=0; $x<(count ($attribray)); $x++) {
      if (substr($attribray[$x],0,1) != '"') {
        $tmp = strtoupper($attribray[$x]);
        $attribray[$x] = $tmp;

        if ($x == 0) {
          if (imperfectValidTagTest($attribray[$x]) == false) {
            $bad = true;
            break;
          } else {
            echo "<td>$attribray[$x]</td><td>";
          }
        }

      }
      if ($x > 0) {
        echo " [" . $attribray[$x] . "]";
      }
    }

  if ($bad == true) {
    $OhThis = "META NAME=\"TEXT\" " . "[" . $what . "]";
    echo "<td colspan=2>$OhThis</td>";
  }

  echo "</TR>";

// print ("$OhThis<br>");
// echo "<hr>";

  return $OhThis;
}
// ============================================

// MAIN
// ---------
  $file = fopen("html.html", "r");
  if (!$file) {
    echo "<p>Unable to open file.\n";
    exit;
  }

  $line = fread($file, 65535);
  fclose($file);

  $parsedarray = split("[\<\>]", $line);

  echo "<TABLE>";
  for ($x=0; $x<(count ($parsedarray)); $x++) {
    $parsedarray[$x] = Chop($parsedarray[$x]);
    if ($parsedarray[$x] != "") {
      $bob = normalized($parsedarray[$x]);
    }
  }
  echo "</TABLE>";
// ========== end MAIN =============================
?>

</FORM>
</BODY>
</HTML>
------------------------------------------------------------------

And the broken test case...
-------------------------------------------
<html><head><title>boo</title></head>
 <body>
  <FORM METHOD="POST" action="http://localhost/">
   <h1>CUSTOMER PROFILE and let's make this quite a bit bigger shall we ???</h1>
   <h1>FORM METHOD=POST</h1>
   <input type="text" name="bob" value="testing">
   <input type="text" name="bob1" value="testing">
   <input type="text" name="bob2" value="testing">
   <input type="text" name="bob3" value="testing">
   <input type="text" name="bob4" value="testing">
   <input type="text" name="bob5" value="testing"></FORM>
  <p>What happens with <i>regular</i> text ???</p>
 </body>
</html>
-------------------------------------------

Any regex gurus know how to get past some of the dangnabit ?

THX
-AEF

-- 
"If you think the Universe is big, you should see the source code..."
-Frank & Ernest

-- PHP 3 Mailing List <http://www.php.net/> To unsubscribe, send an empty message to php3-unsubscribe <email protected> To subscribe to the digest, e-mail: php3-digest-subscribe <email protected> To search the mailing list archive, go to: http://www.php.net/mailsearch.php3 To contact the list administrators, e-mail: php-list-admin <email protected>