Get to know exactly why we use sitemaps and why it is an essential for on-site SEO.
Search engine index the web by using web crawlers (bot, spider) . Web crawlers are no fancy thing, they are scripts that access your website, load all the content and direct it to a specific server; often a search engine company. Web crawlers can also be used for malicious activities such as vulnerability scanning.
The first type of sitemap is addressed directly at web crawlers, it’s used to communicate with them. Often in XML, it will list in hierarchical fashion your links and often include information about them such as: importance compared to other pages, last update, and change frequency. It’s a tool that was introduced by Google in 2005 to enhance web crawling and give webmasters some control over their websites.
The second type of sitemap is for the visitor, it’s a classic directory containing all your links. It’s usually displayed in an orderly fashion and in HTML, no XML to be easily processed by users.
Sitemaps are universally used by important search engines, having them is a must, but they can also help when:
Sitemap is a URL inclusion protocol while robots.txt is a URL exclusion protocol, they both are used to communicate with web crawlers. When you don’t a search engine to list some pages, you don’t include them in your sitemap, but that doesn’t exclude them.
To exclude pages, you need to use a robots.txt, and most people want to exclude pages. There are several reasons for that:
The image below was taken from a Google Webmaster’s account for a website of only 60 pages. It shows how important the crawl activities of robots can be; an average of 10 000 kilobytes a day!
Even though a page can be disallowed in robots.txt, it can still show in the search results if it was linked from a page that was crawled. However, search engines will respect and not crawl your sections if it’s instructed in the robots.txt. Malicious software might begin by these sections, but you should not be afraid, security is not assured by blocking access to certain pages via the robots.txt.
Although rare, you can exclude specific robots, put information to robots in meta tags or even in HTTP response header.