Shawn McKenzie wrote:
> Boyd, Todd M. wrote:
>>> -----Original Message-----
>>> From: farn...@googlemail.com [mailto:farn...@googlemail.com] On Behalf
>>> Of Edmund Hertle
>>> Sent: Thursday, January 15, 2009 4:13 PM
>>> To: PHP - General
>>> Subject: [PHP] Parsing HTML href-Attribute
>>>
>>> Hey,
>>> I want to "parse" a href-attribute in a given String to check if there
>>> is a
>>> relative link and then adding an absolute path.
>>> Example:
>>> $string  = '<a class="sample" [...additional attributes...]
>>> href="/foo/bar.php" >';
>>>
>>> I tried using regular expressions but my knowledge of RegEx is very
>>> limited.
>>> Things to consider:
>>> - $string could be quite long but my concern are only those href
>>> attributes
>>> (so working with explode() would be not very handy)
>>> - Should also work if href= is not using quotes or using single quotes
>>> - link could already be an absolute path, so just searching for href=
>>> and
>>> then inserting absolute path could mess up the link
>>>
>>> Any ideas? Or can someone create a RegEx to use?
>> Just spitballing here, but this is probably how I would start:
>>
>> RegEx pattern: /<a.*? href=(.+?)>/ig
>>
>> Then, using the capture group, determine if the href attribute uses quotes 
>> (single or double, doesn't matter). If it does, you don't need to worry 
>> about splitting the capture group at the first white space. If it doesn't, 
>> then you must assume the first whitespace is the end of the URL and the 
>> beginning of additional attributes, and just grab the URL up to (but not 
>> including) the first whitespace.
>>
>> So...
>>
>> <?php
>>
>> # here is where $anchorText (text for the <a> tag) would be assigned
>> # here is where $curDir (text for the current directory) would be assigned
>>
>> # find the href attribute
>> $matches = Array();
>> preg_match('#<a.*? href=(.+?)>#ig', $anchorText, $matches);
>>
>> # determine if it has surrounding quotes
>> if($matches[1][0] == '\'' || $matches[1][0] == '"')
>> {
>>      # pull everything but the first and last character
>>      $anchorText = substr($anchorText, 1, strlen($anchorText) - 3);
>> }
>> else
>> {
>>      # pull up to the first space (if there is one)
>>      $spacePos = strpos($anchorText, ' ');   
>>      if($spacePos !== false) 
>>              $anchorText = substr($anchorText, 0, strpos($anchorText, ' '))
>> }
>>
>> # now, check to see if it is relative or absolute
>> # (regex pattern searches for protocol spec (i.e., http://), which will be
>> # treated as an absolute path for the purpose of this algorithm)
>> if($anchorText[0] != '/' && preg_match('#^\w+://#', $anchorText) == 0)
>> {
>>      # add current directory to the beginning of the relative path
>>      # (nothing is done to absolute paths or URLs with protocol spec)
>>      $anchorText = $curDir . '/' . $anchorText;
>> }
>>
>> echo $anchorText;
>>
>> ?>
>>
>> ...UNTESTED.
>>
>> HTH,
>>
>>
>> // Todd
> 
> Wow, that's alot!  This should work with or without quotes and assumes
> no spaces in the URL:
> 
> $prefix = "http://example.com/";;
> $html = preg_replace("|(href=['\"]?)(?!$prefix)([^>'\"\s]+)(\s)?|",
> "$1$prefix$2$3", $html);
> 
> 
Might need to keep a preceding slash out of there:

$html = preg_replace("|(href=['\"]?)(?!$prefix)[/]?([^>'\"\s]+)(\s)?|",
"$1$prefix$2$3", $html);

-- 
Thanks!
-Shawn
http://www.spidean.com

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to