Shawn McKenzie wrote: > Boyd, Todd M. wrote: >>> -----Original Message----- >>> From: farn...@googlemail.com [mailto:farn...@googlemail.com] On Behalf >>> Of Edmund Hertle >>> Sent: Thursday, January 15, 2009 4:13 PM >>> To: PHP - General >>> Subject: [PHP] Parsing HTML href-Attribute >>> >>> Hey, >>> I want to "parse" a href-attribute in a given String to check if there >>> is a >>> relative link and then adding an absolute path. >>> Example: >>> $string = '<a class="sample" [...additional attributes...] >>> href="/foo/bar.php" >'; >>> >>> I tried using regular expressions but my knowledge of RegEx is very >>> limited. >>> Things to consider: >>> - $string could be quite long but my concern are only those href >>> attributes >>> (so working with explode() would be not very handy) >>> - Should also work if href= is not using quotes or using single quotes >>> - link could already be an absolute path, so just searching for href= >>> and >>> then inserting absolute path could mess up the link >>> >>> Any ideas? Or can someone create a RegEx to use? >> Just spitballing here, but this is probably how I would start: >> >> RegEx pattern: /<a.*? href=(.+?)>/ig >> >> Then, using the capture group, determine if the href attribute uses quotes >> (single or double, doesn't matter). If it does, you don't need to worry >> about splitting the capture group at the first white space. If it doesn't, >> then you must assume the first whitespace is the end of the URL and the >> beginning of additional attributes, and just grab the URL up to (but not >> including) the first whitespace. >> >> So... >> >> <?php >> >> # here is where $anchorText (text for the <a> tag) would be assigned >> # here is where $curDir (text for the current directory) would be assigned >> >> # find the href attribute >> $matches = Array(); >> preg_match('#<a.*? href=(.+?)>#ig', $anchorText, $matches); >> >> # determine if it has surrounding quotes >> if($matches[1][0] == '\'' || $matches[1][0] == '"') >> { >> # pull everything but the first and last character >> $anchorText = substr($anchorText, 1, strlen($anchorText) - 3); >> } >> else >> { >> # pull up to the first space (if there is one) >> $spacePos = strpos($anchorText, ' '); >> if($spacePos !== false) >> $anchorText = substr($anchorText, 0, strpos($anchorText, ' ')) >> } >> >> # now, check to see if it is relative or absolute >> # (regex pattern searches for protocol spec (i.e., http://), which will be >> # treated as an absolute path for the purpose of this algorithm) >> if($anchorText[0] != '/' && preg_match('#^\w+://#', $anchorText) == 0) >> { >> # add current directory to the beginning of the relative path >> # (nothing is done to absolute paths or URLs with protocol spec) >> $anchorText = $curDir . '/' . $anchorText; >> } >> >> echo $anchorText; >> >> ?> >> >> ...UNTESTED. >> >> HTH, >> >> >> // Todd > > Wow, that's alot! This should work with or without quotes and assumes > no spaces in the URL: > > $prefix = "http://example.com/"; > $html = preg_replace("|(href=['\"]?)(?!$prefix)([^>'\"\s]+)(\s)?|", > "$1$prefix$2$3", $html); > > Might need to keep a preceding slash out of there:
$html = preg_replace("|(href=['\"]?)(?!$prefix)[/]?([^>'\"\s]+)(\s)?|", "$1$prefix$2$3", $html); -- Thanks! -Shawn http://www.spidean.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php