Google Analytics Experts

From EpikOne

Creating an XML Sitemap

An up-to-date sitemap.xml file is important for any web site. It will ensure that all of your pages are listed on search engines with the latest content. In addition, SiteScan uses your sitemap.xml file to generate a list of pages to scan for Google Analytics code.

 

Sitemap Generator Tools

A number of free tools exist that will automatically generate a sitemap.xml file for you. If you have a site with many pages, this is a real time-saver. If you have a smaller site with only a few dozen (or less) pages, it may be better to manually create this file, because the automatic tools do not always find 100% of your pages.
A decent online tool can be found here: http://www.xml-sitemaps.com/
This tool only works for sites that are already online, and has a 500-page limit (most other online tools have similar limits).
Google has an offline tool that can generate a sitemap if you can run Python scripts: http://code.google.com/p/sitemap-generators/

 

Manually Creating A Sitemap

Create a plain text file called “sitemap.xml”.

At the top of the file, first line, add the following:

<?xml version='1.0' encoding='UTF-8'?>

This declares the type of file (xml).
On the second line, add this:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

This lets web crawlers know that this is a sitemap file.

Now, for every page on your site that you want to appear in search engine results, you must explicitly list it in the following manner:

<url>
<loc>http://www.yoursite.com/</loc>
<lastmod>2008-08-17</lastmod>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
</url>

You must put the full URL (hostname + path + page) in the <loc></loc> tags. This field is required.

The <lastmod>, <changefreq>, and <priority> fields are optional, but are useful if you want more control over how often your pages will be updated in search results.

The <lastmod> field specifies the date on which that page was last modified. This lets crawlers know whether or not the page has been updated since they last crawled it. Use the “yyyy-mm-dd” date format for this field. You can also use the W3C Datetime format
The <changefreq> field specifies approximately how often you update the page, and therefore how often crawlers should visit your page to ensure up-to-date data.
You may use the following values for this field: always, hourly, daily, weekly, monthly, yearly, never

The <priority> field is a value of the importance of certain pages compared to other pages on your site, so crawlers know which pages to crawl first. This value is between 0.0 and 1.0, with 0.5 being the default value. Note: assigning a high value to all your pages is not likely to help you, since this priority value is a relative number.

Keep in mind that these optional fields are merely recommendations for web crawlers, they are not commands. For example, setting a <changefreq> to daily does not guarantee that the page will be crawled every day. Similarly, if you forget to update the <lastmod> field, crawlers will still crawl the new version of your page, although it may take slightly longer.

Enclose all these fields in the <url></url> tags.

Once you have created a <url> entry for every page you wish to be indexed, you must end the file with the </urlset> tag.

Note: There is a 50,000-page limit for sitemaps. If your website exceeds this number of pages, you will need to create multiple sitemap files. More details on how to do this can be found here.

Finally, you should also make sure the xml file is properly UTF-8 encoded (i.e. you must escape any special characters, including the URLs). This will ensure your sitemap.xml file can be properly parsed.

You can view a complete sitemap.xml file here.

Once you have finished creating the sitemap.xml file, you must upload it to the root directory of your web site (i.e. if your site is http://www.mysite.com, make it available at http://www.mysite.com/sitemap.xml). The location of your sitemap.xml file determines which pages can be listed in it (it can only list pages in its directory and subdirectories), so by placing it in the root directory, you will ensure that any pages on your site can be added to it.

Now, the next time a web crawler visits your site, it will see the sitemap file and know exactly which pages to visit. If you want to make sure a crawler visits your site as soon as possible, you should read our tutorial on adding your sitemap to Google Webmaster Tools. Your site will generally be crawled within a few days.

For more detailed information on sitemap.xml files, read the official documentation.