What is URL canonicalization and how to use it
- Abhinay Kumar
- WordPress
- May 15, 2022
Canonicalization is the process of selecting one URL among more than one URL for the same content. When the same content is found across multiple pages, search engines look for the most authentic content to show in search results. It is this time it selects one URL among the many URLs representing the same content. There may be a chance that the same page may save multiple URLs for different systems. Like there may be different URLs for mobile and desktop versions of the same page. So on mobile search, the mobile version will show although the canonical URL may be the desktop one. It is up to the search engine to select the most appropriate version based on certain factors. Let us say for example let us see the following pages
https:\\www.test.com/area.html
https:\\www.amp.test.com/area.html
https:\\test.com/rectangle/area.html
http:\\www.test.com/area.html
https:\\test.com/area.html
Although all these pages refer to the same page they are treated by search engines as duplicate pages.Based on various factors like HTTPS/HTTP,www/non www,mobile/desktop version,etc.,these are treated as different pages.The search engine has to choose the most authentic version to show in search results. It chooses one of these to show in search results which are called canonicalization. The canonicalization needs to be indicated by the website owner but this may be ignored by the search engines for some reasons. Generally, this is indicated through canonical link tags or canonical links in HTTP headers or sitemaps.All the URLs mentioned in the sitemap are considered canonical. The search engine may ignore them based on factors like HTTP/HTTPS,www/non-www,type of device and others.
How can a canonical page be specified?
A canonical page can be specified by
1)Canonical link tag through rel=canonical.
<link rel=”canonical” href=”https:\\www.test.com\area.html”>
If the canonical page has a mobile variant,you can specify it using alternate.
<link rel=”alternate” media=”only screen and(max-width:640px)” href=”https:\\www.test.com\area.html”>
The specified URL should be absolute and not relative.
2)rel=canonical in http header response
You can specify a canonical in the same way as for link tag for documents containing non HTML documents like as below.
Link: <http://www.test.com/download/test-paper.pdf>; rel=”canonical”
3)By using 301 redirects
For pages on your blog that you are looking to deprecate, you can set 301 redirects to the concerned page.
By default, all the URLs mentioned in the sitemap are considered canonical but the Google may ignore them based on conditions stated earlier.At all places, while specifying the canonical URL it should be the absolute URL and not relative although a relative URL is also considered by Google.The URL should be absolute for the simple reason that there may be mistakes in relative paths.
Reasons for having a duplicate page
1)Your content management system may be saving different URLs for different systems like different device types.
For example desktop and mobile.
https:\\www.test.com/area.html
https:\\www.amp.test.com\area.HTML
2)To have dynamic URLs to as having session-id, etc.
https:\\www.test.com/area.html?ses=123
https:\\www.test.com/area.html?ses=5467
3)Your blog may be saving different URLs for the same item being placed in different categories or sections.
https:\\www.test.com/area.html
https:\\www.test.com/rectangle/area.html
4)If the server is configured to serve same content for http/https or www or non www versions of the blog.
http:\\www.test.com/area.html
https:\\www.test.com/area.html
https:\\test.com/area.html
5)If the content of your blog is replicated in part or full through syndication with other sites.
https:\\www.news.com/area.html
https:\\www.test.com/area.html
Reasons for selecting a canonical URL
1. You need to specify to search engines which page appears in search results out of a set of duplicate pages.
The search engines need to know which page they should display in search results from a set of duplicate pages.
You should be able to indicate this to it using a sitemap or other appropriate methods using canonical.
2. Metrics for content can be viewed in one place. With canonical tag being used all the metrics will be available for it as it will be visited mostly.
3. Link from other sites to the duplicate and canonical piece of content can be consolidated as all duplicates will be referred by the canonical page.
Like links to http://wwww.test.com/area.html?ses=1987 from other sites will be consolidated into the link for the canonical page https://www.test.com/area.html.
4. To avoid crawling time on duplicate pages. This will help in saving the crawl budget for your site as pages not necessary are not crawled.
5. Declaring a page canonical also helps in syndicated content as the correct URLs appear in search results.
How is a page chosen to be canonical by a search engine like Google
A page is chosen to be canonical based on factors like the following.
1. It is served from HTTPS or HTTP.HTTPS is given preference unless the HTTP version is a better-suited one.
There may be a case where HTTPS does not have a valid SSL certificate, in that case, the HTTP is chosen.
2. The page quality of the HTTP version is better than the HTTPS one.
3. The performance of the HTTP version is better than the HTTPS one. Then the HTTP is chosen to be canonical.
4. Sitemap includes the page URL.Google may ignore it for better-suited reasons.
Guidelines while specifying canonical
1. Do not specify Canonical in the Robots.txt file.
2. Do not make category pages canonical to a feature article.
3. There should not be multiple declarations of canonical pages.
4. Do not specify canonical to the first page of a paginated page otherwise only the first page will be shown in search results.
5. The canonical tag should only appear at the head of the document.
6. Link to a canonical URL rather than a duplicate URL.