How To Identify & Solve URL Canonicalization Issues

October 25th, 2010 • By:  • On-Site SEO

Canonicalization is a term that search engines and SEOs borrowed from computer science, and when referring to computers it means a process for converting data that has more than one possible representation into a standard or “canonical” form.

When search marketers refer to canonicalization, in plain English it means how to deal with web content that has more than one possible Uniform Resource Locator (URL). URL canonicalization is an important aspect of onsite SEO that every webmaster needs to not only understand, but implement solutions for. Having more than one URL that resolves to the same content can cause problems for websites in several ways, including:

  • duplicate content penalties from search engines for having multiple URLs with the same content
  • preventing search engines from determining and showing the correct URL in search results
  • weakening the authority of a page by splitting PageRank across two or more pages
  • not gaining the ranking benefits of inbound links due to inconsistent URL usage

The most common canonicalization issue that webmasters encounter without a doubt involves a website’s homepage. In fact, it’s so common that even Google themselves don’t have it quite right! If you point your browser to google.com (without the www), you’ll see that they correctly 301 (permanently) redirect to the version with the www in front. However, if you browse to google.com/index.html, you’ll see the exact same content at a different address. Physician, heal thyself.

The following URLs can all potentially point to the same page, so try these using your own website. If your pages use a server-side technology such as PHP, ASP, ASP.NET, or ColdFusion, substitute index.html with your own homepage URL (index.php, default.asp, default.aspx, or index.cfm, respectively):

  • website.com
  • website.com/
  • www.website.com
  • www.website.com/
  • website.com/index.html
  • www.website.com/index.html

URL inconsistency doesn’t just apply to homepages though, in many cases it’s a site wide nightmare caused by unsavvy e-commerce suites, content management systems, and blogging software. Pages may be accessible via several different URLs by sites that utilize session IDs (website.com/widgets/index.php?sessionid=123), parameters to sort and drill down to specific products (website.com/widgets.php?product_id=321&color=green&cat_id=1&price=50), and tracking IDs (website.com/?source=blog).

I won’t go into too much detail here, but the important takeaway is that using session IDs, parameters, and tracking IDs causes duplicate content issues because search engines will see different URLs with the same content every time they index your site. Instead of tracking and session IDs, learn to use your web analytics referrer and navigation path reports. If you absolutely must use session IDs, parameters, and/or tracking IDs, change your software to use a hash mark (a ‘#’ sign) instead of a question mark. Search engines ignore everything after the hash, so you’ll avoid confusion.

It’s not only easy to identify these issues, it’s also easy to implement solutions to tell search engines which version of your URLs you prefer and remove duplicates from their index. There are several options available to webmasters, which I’ll cover briefly below.

301 Redirects

Probably the fastest and most widely used method of correcting URL inconsistencies, 301 (NOT 302) redirects tell search engines that the content has been permanently moved to another address. If your web server runs Apache, a simple rewrite rule added to your .htaccess file will handle everything for you. Here’s an example of a rule that redirects all non-www requests to the www version:

RewriteEngine On

RewriteCond %{HTTP_HOST} ^website.com [NC]
RewriteRule ^(.*)$ http://www.website.com/$1 [L,R=301]

So what does this rule do? Basically the ‘(*.)$’ says that the web server should take anything that comes after http://website.com and append it to the end of http://www.website.com (which is the ‘$1′ part), and redirect to that URL. For more details and specifics on exactly how this works and how you can create custom rewrite rules for your website, search the web for “regular expressions“.

Alternatively, you may also setup 301 redirects in the Apache config file, named httpd.conf.

<VirtualHost xx.xx.xx.xx>
ServerName www.website.com
ServerAdmin webmaster@website.com
DocumentRoot /home/website/public_html
</VirtualHost>

<VirtualHost xx.xx.xx.xx>
ServerName website.com
RedirectMatch permanent ^/(.*) http://www.website.com/$1
</VirtualHost>

Google Webmaster Tools

It won’t fix the problem of splitting authority or PageRank, but from the Webmaster Tools Console you can specify the URL that you prefer for Google to use. To set the preferred domain for a site, click Site configuration, and then click Settings. In the Preferred domain section, pick the option you prefer.

Canonical Tag

In the beginning of 2009, Google, Yahoo, and Microsoft announced support for a new link element to make correcting duplicate URLs a little bit easier, called the canonical tag. Unfortunately the search engines only view canonical tags as “suggestions”, but it’s still considered a best practice to add them to pages when needed. A great example of when canonical tags can really help would be sites that use pagination, where visitors can click links numbered 1, 2, 3, etc. to jump to later pages in search results, product lists, or articles. An example of a paginated URL would be website.com/products.php?page=2.

To setup your preferred URL version via canonical tags, create a link as follows:

<link rel="canonical" href="http://www.website.com/folder-name/page-name.html">

After the tag is created, you’ll need to add this new link to the <head> section of non-canonical URLs.

Site Architecture

This one is pretty obvious, but if you setup your URL structure consistently throughout your entire website (such as internal linking, navigational links in the header and footer, etc.) it will prevent issues like this from arising. This is the most important method of all, and it doesn’t require innovation, just some research on the best standard to follow. With proper planning and a consistent linking convention from the start, canonicalization problems can be avoided entirely. However, if you find that your site has any or all of the errors listed above, you can quickly get things back on track.

For more information on URL canonicalization, check out the references below.

Further Reading:

Matt Cutts on URL Canonicalization

Matt Cutts on the Canonical Tag

Google Webmaster Tools Help article about Canonicalization

What major websites that should know better have you seen with unresolved canonicalization issues?

Jason Hendricks

Jason got his start in search engine optimization with his first company, Tidal Wave Media, and achieved top rankings for his clients and his own websites since 2001 before joining Vertical Measures. He handles technical SEO as well as web development projects for the company.

More Posts

This entry was posted on Monday, October 25th, 2010 at 5:42 am and is filed under On-Site SEO. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

5 Responses to “How To Identify & Solve URL Canonicalization Issues”

  1. james@english editing Says:

    I personally have embraced the new technologies and also the CMS platforms, I believe the new tools only make the internet designs greater. I’m glad that new technologies are coming out in internet style that make things less difficult, improved, and much far better seeking layout.

  2. Melvin Haruta Says:

    Cheers for the post, I enjoyed the read. Bookmarked.

  3. chan Says:

    thanks it helped me .. but in webmaster tool it asking for some more verification .. not sure about tht ..

  4. Proofreading Sam Says:

    Really useful and clear information about this issue. I believe that Google now don’t actually penalise sites for duplicate information however they do choose what they see as the key URL from the list of canonical URLs that they detect on a site. So if they are choosing to display a page which is not particularly optimised for searches, then sort out the canonical issues, otherwise, concentrate on the other things first such as getiing the SPG (spelling, punctuation and grammar) and site navigation structure right for visitors and leave this aspect until last.

  5. Billy Says:

    This guide has been so useful. Thank you so much!

Leave a Reply