Re: [PHP] Regular expression, parsing bad html in a xml document (strange)

Francis Fillion Thu, 12 Jul 2001 12:00:27 -0700
Lazy me, after a short break, alway's helping, I found out wthat it has
to be:
/\<(?!\?xml|\!DOCTYPE|\!ENTITY|image|item|\/item)/


the  ?! negate this text, I though that I could put it in every value
like this (?!\?xml|?!\!ENTITY ... but no by putting in first he do it
for all (k.i.s.s. Francis).


Cu

Francis Fillion wrote:
> 
> I'm having problem with regular expression, not a good eek this week it
> seen like I alway's get a wall of problem. I know that it surely been
> ask a 1000 times, I look around, didn't find anythings, if you find
> somethings please point me out.
> 
> So here what I want to do, I need to parse a xml document , but before
> to parse it I need to get rid of bad html that I don't want, but the
> document that I want require some stuff that I need too, so I don't want
> to get ride of all they HTML.
> 
> So what I want to do, I already did a little bite of code that get out
> my good element and check for bad stuff, the only bad thing is that
> "text<text-1" is a good stuff, but I need to change < to &lt; or it will
> do bad things with my xml parser.
> 
> Here what I try
> 
> $simple = <<<XMLDATA
> <?xml version='1.0'?>
>  <!DOCTYPE chapter SYSTEM "/just/a/test.dtd" [
>  <!ENTITY plainEntity "FOO entity">
>  <!ENTITY systemEntity SYSTEM "xmltest2.xml">
>  ]>
>  <item>
> text
>    <bad stuff>
> text<text-1
>    text
>  <image  title="Ceci est mon titre2" description="Ceci est ma
> description"
> link="http://www.windplanet.com/";
> url="http://www.windplanet.com/images/news/988991159.gif";
> align="left" width="235"  height="131"  size="13310"/>
> text
>         text
>  <image title="Ceci est mon titre" description="Ceci est ma description"
> link="http://www.windplanet.com/";
> url="http://www.windplanet.com/images/news/988991159.gif"; align="left"
> width="235"  height="131"  size="13310"/>
>  </item>
> 
> XMLDATA;
> //$simple = str_replace("\n\n"," &lt;br/>  &lt;br/> ",$simple);
> 
>                                 /* trouve moi tous les < sauf suivant ceci ... */
> $data = $simple;
> print $data;
> 
>if(preg_match_all("/\<(?:(?:\!|\/|\?|)(?:<!xml|<!DOCTYPE|<!ENTITY|<!image|<!item|))/",$data,$cbadhtml)){
>   foreach( $cbadhtml as $key => $myarray){
>       foreach( $myarray as $key2 => $myarray2){
>         print "<p><font color='red'>You can't use HTML here so ".
> htmlentities($myarray2) ." is not allowed</font></p>\n";
>       }
>     }
>                                 // what html? we exit
>     //exit;
> 
> }
> 
> It find all the < but doesnt' remove the one that I accept, so how can I
> find the bad < and transform them to &lt; ?
> 
> Thank you and have a nice day.
> 
> --
> Francis Fillion, BAA SI
> Broadcasting live from his linux box.
> And the maintainer of http://www.windplanet.com
> 
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> To contact the list administrators, e-mail: [EMAIL PROTECTED]

-- 
Francis Fillion, BAA SI
Broadcasting live from his linux box.
And the maintainer of http://www.windplanet.com

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]
Re: [PHP] Regular expression, parsing bad html in a xml document (strange)

Reply via email to