On Thursday, June 6, 2002, at 06:57  PM, Anthony Ritter wrote:

> I understand that one can open a page off an existing website to extract
> text data using a PHP script by using the fopen and fread functions.
>
> And by using the strip_tags() function, one can extract data without the
> html markup as a literal string.
>
> Here's my question...
>
> Let's say that a html page from a website out on the 'net consists of 
> three
> paragraphs of text and four html tables.
>
> Within each of those tables is text.
>
> And let's say that the second table down has three columns - water
> temperature, water level and cfs flow along with numerous records.
>
> Is there any way to extract *specific* data off a page - in this case - 
> the
> second table - while leaving the balance of the text and the other 
> tables
>  1, 3 and 4) alone?

Yeah, you'll have to parse the page.  Use regexes.  Unlike a lot of 
functionality in PHP, there's no simple way to do it, you'll have to 
roll up your sleeves and bust out some parsing routines.

Just remember that an HTTP resource (such as an HTML document or 
whatever data is sent by a web server) is a stream of bytes, or 
characters, and you can use regexes to search for certain patterns in 
those characters and perform certain operations if certain patterns are 
"matched" in the bytestream.

Fundamentally, this is why there is a movement toward XML -- it is a way 
of identifying certain chunks of data in a standards-based fashion so 
that it can be parsed more easily.  There are tons of libraries for 
doing this*, so you may want to look into it.  If your target data is in 
XML format you will have a much easier time of it than if it is an 
HTML-based web page.  That's why some sites are providing the option of 
an XML page or an HTML page, because they recognize that sometimes the 
agent requesting the page is not a human but rather a program.

You may be interested in something called RDF (I think) for which PHP 
libraries have already been written (I think) and will do most of the 
work for you.


Erik

* two popular ways of handling XML are:
- the SAX methodology: you set event handlers which activate and perform 
specified actions when a certain tag is encountered in the course of 
reading the bytestream
- the DOM methodology: the entire bytestream is read into memory and 
then can be treated as a tree with nodes, and you can access these nodes 
directly


----

Erik Price
Web Developer Temp
Media Lab, H.H. Brown
[EMAIL PROTECTED]


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to