Boyd, Todd M. wrote:
>> -----Original Message-----
>> From: farn...@googlemail.com [mailto:farn...@googlemail.com] On Behalf
>> Of Edmund Hertle
>> Sent: Thursday, January 15, 2009 4:13 PM
>> To: PHP - General
>> Subject: [PHP] Parsing HTML href-Attribute
>>
>> Hey,
>> I want to "parse" a href-attribute in a given String to check if there
>> is a
>> relative link and then adding an absolute path.
>> Example:
>> $string  = '<a class="sample" [...additional attributes...]
>> href="/foo/bar.php" >';
>>
>> I tried using regular expressions but my knowledge of RegEx is very
>> limited.
>> Things to consider:
>> - $string could be quite long but my concern are only those href
>> attributes
>> (so working with explode() would be not very handy)
>> - Should also work if href= is not using quotes or using single quotes
>> - link could already be an absolute path, so just searching for href=
>> and
>> then inserting absolute path could mess up the link
>>
>> Any ideas? Or can someone create a RegEx to use?
> 
> Just spitballing here, but this is probably how I would start:
> 
> RegEx pattern: /<a.*? href=(.+?)>/ig
> 
> Then, using the capture group, determine if the href attribute uses quotes 
> (single or double, doesn't matter). If it does, you don't need to worry about 
> splitting the capture group at the first white space. If it doesn't, then you 
> must assume the first whitespace is the end of the URL and the beginning of 
> additional attributes, and just grab the URL up to (but not including) the 
> first whitespace.
> 
> So...
> 
> <?php
> 
> # here is where $anchorText (text for the <a> tag) would be assigned
> # here is where $curDir (text for the current directory) would be assigned
> 
> # find the href attribute
> $matches = Array();
> preg_match('#<a.*? href=(.+?)>#ig', $anchorText, $matches);
> 
> # determine if it has surrounding quotes
> if($matches[1][0] == '\'' || $matches[1][0] == '"')
> {
>       # pull everything but the first and last character
>       $anchorText = substr($anchorText, 1, strlen($anchorText) - 3);
> }
> else
> {
>       # pull up to the first space (if there is one)
>       $spacePos = strpos($anchorText, ' ');   
>       if($spacePos !== false) 
>               $anchorText = substr($anchorText, 0, strpos($anchorText, ' '))
> }
> 
> # now, check to see if it is relative or absolute
> # (regex pattern searches for protocol spec (i.e., http://), which will be
> # treated as an absolute path for the purpose of this algorithm)
> if($anchorText[0] != '/' && preg_match('#^\w+://#', $anchorText) == 0)
> {
>       # add current directory to the beginning of the relative path
>       # (nothing is done to absolute paths or URLs with protocol spec)
>       $anchorText = $curDir . '/' . $anchorText;
> }
> 
> echo $anchorText;
> 
> ?>
> 
> ...UNTESTED.
> 
> HTH,
> 
> 
> // Todd

Wow, that's alot!  This should work with or without quotes and assumes
no spaces in the URL:

$prefix = "http://example.com/";;
$html = preg_replace("|(href=['\"]?)(?!$prefix)([^>'\"\s]+)(\s)?|",
"$1$prefix$2$3", $html);


-- 
Thanks!
-Shawn
http://www.spidean.com

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to