Why do we use sitemap.xml?


Get to know exactly why we use sitemaps and why it is an essential for on-site SEO.

Article outline:

  • Types of sitemap
  • Reasons for sitemap
  • Relation with robots.txt

Types of sitemap

Search engine index the web by using web crawlers (bot, spider) . Web crawlers are no fancy thing, they are scripts that access your website, load all the content and direct it to a specific server; often a search engine company. Web crawlers can also be used for malicious activities such as vulnerability scanning.

The first type of sitemap is addressed directly at web crawlers, it’s used to communicate with them. Often in XML, it will list in hierarchical fashion your links and often include information about them such as: importance compared to other pages, last update, and change frequency. It’s a tool that was introduced by Google in 2005 to enhance web crawling and give webmasters some control over their websites.

The second type of sitemap is for the visitor, it’s a classic directory containing all your links. It’s usually displayed in an orderly fashion and in HTML, no XML to be easily processed by users.



Sitemaps are universally used by important search engines, having them is a must, but they can also help when:

  • Your website is getting slowed down by web crawlers and you want them to avoid running heavy scripts or you want them to visit in bigger intervals of time.
  • You website is not entirely accessible but you still want your pages to be listed on search engines.
    • Ex: flash, ajax, Silverlight¬†are not normally visited by web crawlers
    • Poorly linked website, isolated pages, hidden content
  • You are producing a great quantity of content or are updated regularly and you want web crawlers to catch everything.
  • You want to specify only a few pages and avoid others
    • Best to do with robots.txt


Relation with robots.txt

Sitemap is a URL inclusion protocol while robots.txt is a URL exclusion protocol, they both are used to communicate with web crawlers. When you don’t a search engine to list some pages, you don’t include them in your sitemap, but that doesn’t exclude them.

To exclude pages, you need to use a robots.txt, and most people want to exclude pages. There are several reasons for that:

  • Privacy, you don’t want some content to be public
  • Exclude sections that are misleading or not relevant to your keywords
  • Prevent execution of heavy scripts
  • Block back-end and administrative section which does not contain any information for the search engine but is somehow sensitive

The image below was taken from a Google Webmaster’s account for a website of only 60 pages. It shows how important the crawl activities of robots can be; an average of 10 000 kilobytes a day!

Google Webmaster's crawl activity for a website of 60 pages.

Google Webmaster’s crawl activity for a website of 60 pages.

Even though a page can be disallowed in robots.txt, it can still show in the search results if it was linked from a page that was crawled. However, search engines will respect and not crawl your sections if it’s instructed in the robots.txt. Malicious software might begin by these sections, but you should not be afraid, security is not assured by blocking access to certain pages via the robots.txt.

Although rare, you can exclude specific robots, put information to robots in meta tags or even in HTTP response header.