Google Changing Adwords URL Structure
Well, it was a very fun day yesterday. It appears that Google has changed the URL structure of adwords links. For those people who collect Adwords data I'm sure you were as thrilled as I was when you saw that your script was searching 600 keywords per minute.
I was making changes to my server when it happened so I didn't immediately catch on, but after 2 cron runs not finding any ads I decided to investigate.
It appears that Google has decided to become a little more compliant and have added quotes around class declarations. They also changed the URL parameter from &adurl to the easy recognizable &q.
I've decided to release my rewritten function that will correctly match ads on the new URL structure. Please note that you'll have to add in a way to remove slashes if you're going to insert them into a database. (I've been fiddling around with that but it doesn't seem to be working). I should mention that this function is designed for http://google.com/sponsoredlinks?q=keyword. I haven't gotten around to fixing the natural search result functions yet. Enjoy.
Here is the working code:
<?php
function getSponsoredAds($str)
{
$spartstart = '<div id="tpa';
$spartend = '</div>';
$slinkstart = '<a id="pa';
$slinkend = '</a>';
$sdestlink = "&q=";
$scontentstart = '<font size="-1">';
$scontentend = '</font>';
$stxtcontentstart = '</span>';
$stxtcontentend = '</font>';
$sspanstart = '<span class="a">';
$sspanend = '</span>';
$ad = array();
$gad = array();
$desturl = array();
$dispurl = array();
$str = str_replace(array("\n","\r","\t","amp;"),array("","","",""),$str);
preg_match_all("|(".$spartstart."(.*)".$spartend.")|U",$str, $out);
for($x=0;$x<count($out[1]);$x++)
{
preg_match_all("|(".$slinkstart."(.*)".$slinkend.")|U",$out[1][$x], $out_1);
//var_dump($out_1);
preg_match_all("|<[aA].+[hH][rR][eE][fF]=.+&q=([^[>\s'\"]+)[\'\" >]|U",$out_1[1][0],$link, PREG_PATTERN_ORDER);
if (!isset($link[1][0]) || $link[1][0] == "")
preg_match_all("|<[aA].+[hH][rR][eE][fF]=.+&q=([^[>\s'\"]+)[\'\" >]|U",$out_1[1][0],$link, PREG_PATTERN_ORDER);
preg_match_all("|<[aA].+>(.+)</[aA]>|U",$out_1[1][0],$linktext, PREG_PATTERN_ORDER);
if (isset($link[1][0]) && $link[1][0] != "")
$ad["desturl"] = urldecode($link[1][0]);
else
{
$ad["desturl"] = "No URL";
}
$ad["subject"] = $linktext[1][0];
preg_match_all("|(".$scontentstart."(.*)".$scontentend.")|U",$out[1][$x], $out_1);
preg_match_all("|(".$stxtcontentstart."(.*)".$stxtcontentend.")|U",$out_1[1][0], $out_1);
$ad["body"] = $out_1[1][0];
preg_match_all("|(".$sspanstart."(.*)".$sspanend.")|U",$out[1][$x], $out_1);
$ad["dispurl"] = strip_tags(html_entity_decode($out_1[2][0]));
$ad["subject"] = strip_tags(html_entity_decode($ad["subject"]));
$ad["body"] = preg_replace("(<[bB][rR]([ ]+)?(/)?(remove this and brackets)>)", " ", $ad["body"]);
$ad["body"] = strip_tags(html_entity_decode($ad["body"]));
if ($ad["desturl"] != "")
{
$usgcode = strpos($ad["desturl"],"&usg=");
if ($usgcode !== false)
{
$ad["desturl"] = substr($ad["desturl"],0,$usgcode);
}
}
$gad[] = $ad;
}
return $gad;
}
?> 