XML Sitemap Best Practices
Share
On this page: Quick jump links to help you
- What Is an XML Sitemap
- Why XML Sitemaps Matter for SEO
- XML Sitemap Structure and Format
- What to Include in Your Sitemap
- What to Exclude from Your Sitemap
- Sitemap Index Files for Large Sites
- Submitting and Monitoring Sitemaps
- Common XML Sitemap Mistakes
An XML sitemap is one of the most underutilized tools in SEO. Many site owners create an XML sitemap, submit it to Google, and never think about it again. However, a well-maintained sitemap can significantly improve your sitemap SEO performance by ensuring Google discovers and indexes all your important content efficiently. A poorly maintained sitemap can actually work against you.
An XML sitemap is a structured file that lists all the pages on your website that you want Google to crawl and index. It’s separate from the HTML sitemap users see on your site—this is a machine-readable format designed specifically for search engines. While XML sitemaps don’t guarantee indexation, they help Google discover pages, especially on large sites with complex navigation, and they signal which pages are most important.
In this comprehensive guide, we’ll explain what XML sitemaps are, why they matter for SEO performance, how to structure them correctly, and what common mistakes to avoid. By the end, you’ll have the knowledge to create and maintain an effective sitemap strategy that maximizes your site’s discoverability and indexing efficiency.
What Is an XML Sitemap
An XML sitemap is a file formatted in XML that contains a list of URLs from your website along with metadata about those URLs. The most common XML sitemap format follows the protocol defined at sitemaps.org. Google also recognizes alternative formats like RSS feeds and plain-text sitemaps, but XML is the standard.
A basic XML sitemap looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/page1</loc>
<lastmod>2024-01-15</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
Each entry in the sitemap includes several optional elements: the URL (required), the last modification date, how frequently the page typically changes, and the priority of the page relative to other pages on your site. Google uses this information to determine crawling frequency and priority.
Key Characteristics of XML Sitemaps
XML sitemaps are optional—Google can discover your site without one. However, they’re valuable, especially for sites that are large, new, have poor internal linking, or have pages that aren’t easily discoverable through links. Sitemaps act as a roadmap that helps Google understand your site structure and find all pages you want indexed.
Unlike robots.txt files or meta robots tags, sitemaps don’t prevent Google from crawling or indexing pages. They only provide information about what pages exist and their relative importance. Google still makes independent decisions about whether to crawl and index each page based on its technical SEO quality and other signals.
Why XML Sitemaps Matter for SEO
While optional, XML sitemaps provide several SEO benefits that make them worth implementing:
Improved Discovery
Your sitemap best practices start with helping Google discover all your pages. Sitemaps are particularly valuable for large sites or those with complex navigation where not every page can be easily reached through internal links. Without a sitemap, some deep pages might never be discovered.
Faster Indexing
By explicitly listing your important pages in a sitemap, you signal to Google that these are pages you want indexed. Google can prioritize crawling and indexing these pages. For new content especially, a sitemap can accelerate the time from publishing to indexation.
Crawl Budget Efficiency
A well-curated sitemap focuses Google’s crawl budget on important pages. By excluding low-value pages from your sitemap, you ensure that your crawl budget goes toward pages that matter for your business goals.
Explicit Priority Signals
The priority element in your sitemap tells Google which pages you consider most important. While Google doesn’t give this tremendous weight, it provides useful guidance, especially for large ecommerce sites with thousands of similar products.
Update Frequency Information
The lastmod and changefreq elements provide information about how often pages change. Google uses this to determine appropriate re-crawl frequency. Pages that change daily get crawled more often, while evergreen content might be crawled less frequently.
Support for Alternative Media Types
XML sitemaps can include metadata for images, videos, news, and other content types. This helps Google understand and crawl non-HTML content more effectively, improving your ability to rank for visual and video searches.
XML Sitemap Structure and Format
Understanding the proper structure of an XML sitemap ensures that Google parses it correctly and uses the information effectively:
Root Element
Every XML sitemap must start with a <urlset> element that defines the XML namespace. This wraps all the individual URL entries. The xmlns attribute tells Google which version of the sitemap protocol you’re using.
URL Entries
Within the urlset, each page has a <url> element containing the following possible sub-elements:
- <loc> (required): The full URL of the page. Must be properly URL-encoded and include the protocol (http or https). This is the only required element.
- <lastmod> (optional): The date the page was last modified in YYYY-MM-DD format or ISO 8601 format with time. Only include if you can reliably determine and maintain this.
- <changefreq> (optional): How often the page typically changes. Valid values are: always, hourly, daily, weekly, monthly, yearly, or never. This is advisory only—Google makes its own crawl frequency decisions.
- <priority> (optional): Relative priority of the page on your site, from 0.0 to 1.0. Default is 0.5. This only matters relative to other pages on your site, not compared to other sites.
File Size and Limits
An individual XML sitemap file can contain up to 50,000 URLs and cannot exceed 50 MB (uncompressed). For larger sites, you use a sitemap index file that references multiple sitemap files, allowing you to organize your content logically.
Proper XML Formatting
Ensure your XML is well-formed: all elements are properly closed, special characters are encoded (like & instead of &), and the file is valid XML. Invalid XML sitemaps won’t parse correctly, and Google will ignore them.
What to Include in Your Sitemap
Every page you want Google to crawl and index should generally be in your sitemap, with important priorities: pages that drive business value, pages with unique content, and pages without many internal links pointing to them:
Primary Navigation Pages
Include all pages accessible through your main navigation: home page, main category pages, primary service pages, and main content hubs. These are core pages that typically already get discovered through links but should be in the sitemap with high priority.
Deep Content Pages
Include pages several levels deep in your site structure, especially if they don’t get many internal links. Blog posts, detailed product pages, and documentation pages that don’t have many inbound links benefit from being in the sitemap.
Alternate Language Versions
If your site serves multiple language versions, include all versions in your sitemap. Use hreflang attributes in your sitemap entries to indicate language relationships, helping Google understand your multi-language structure.
Important Evergreen Content
Include pages that are important to your business and rarely change: key service pages, major product categories, pillar content, and frequently referenced pages. These should have higher priority values.
Recently Updated Pages
Include pages you’ve recently significantly updated. The lastmod date signals to Google that the page has fresh content worth re-crawling and potentially re-ranking.
Video and Image-Heavy Pages
Include pages with important videos or images, as these can get discovered through video and image search. You can include video and image sitemaps separately or include this metadata within your main sitemap.
What to Exclude from Your Sitemap
Just as important as what to include is what to deliberately exclude. A focused sitemap with only your important pages is more valuable than a bloated sitemap with everything:
Pages You’ve Blocked with Robots.txt
Never include pages in your sitemap that you’ve blocked in your robots.txt file. This creates a conflict: the sitemap says "index this" while robots.txt says "don’t crawl this." Google respects robots.txt, so the blocking takes precedence, but the conflict is confusing and wastes space in your sitemap.
Pages with Noindex Tags
Don’t include pages with noindex meta tags in your sitemap. Again, this creates a conflict. The noindex tag wins, but the contradiction signals careless implementation to Google.
Duplicate Content
Don’t include multiple URLs for the same content in your sitemap. If you have duplicate pages, use canonical tags to designate the preferred version and exclude duplicates from your sitemap entirely.
Pagination Pages
Consider carefully whether to include pagination pages (like page=2, page=3 of results). Modern best practice recommends using infinite scroll or consolidating paginated content rather than indexing pagination pages. If you do include them, mark them with rel="next" and rel="prev" tags, not in the sitemap.
Session IDs and Parameter Variations
Exclude URLs with session IDs, tracking parameters, or other variable parameters that create duplicate content. Only include the canonical version of each page without tracking parameters.
Auto-Generated Pages with Little Value
Exclude filter results, search results, or auto-generated pages that have minimal unique value. These pages waste crawl budget and dilute the importance of your real content pages. Your sitemap SEO is stronger with a focused list than with thousands of low-value pages.
Temporary Pages
Exclude thank-you pages, password reset pages, checkout pages, and other temporary pages that aren’t meant for search visibility. These don’t need to be indexed and shouldn’t consume sitemap space.
Private or Protected Pages
Exclude pages that require authentication to access, like user accounts, admin panels, or password-protected content. Including pages that Google can’t access creates confusion.
Sitemap Index Files for Large Sites
For sites with more than 50,000 pages, you need a sitemap index file. This special file references multiple individual sitemaps, allowing you to organize your content logically:
Sitemap Index Structure
A sitemap index file looks similar to a regular sitemap but contains <sitemap> entries instead of <url> entries. Each sitemap entry lists the location of an individual sitemap file:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2024-01-15</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-blog.xml</loc>
<lastmod>2024-01-20</lastmod>
</sitemap>
</sitemapindex>
Organization Strategies
Organize your multiple sitemaps logically: by content type (products, blog posts, pages), by date range, by language, or by section of your site. This organization makes your sitemap strategy clear and maintainable.
Limits on Sitemap Index
A sitemap index file can reference up to 50,000 individual sitemaps. In practice, no site has reached this limit, but if you’re managing thousands of sitemaps, consider whether your site is properly organized.
Submitting and Monitoring Sitemaps
Creating a sitemap is only half the work. You need to submit it to Google and monitor its performance:
Submit to Google Search Console
The primary way to submit your sitemap is through Google Search Console. Add your property, then use the "Sitemaps" section to submit your sitemap URL (yourdomain.com/sitemap.xml). Google will crawl and parse your sitemap, and you’ll see status information within a day or two.
Reference in Robots.txt
You can also reference your sitemap in your robots.txt file:
User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap.xml
This helps Google discover your sitemap even if it’s not submitted through Search Console, though Search Console submission is still recommended.
Monitor Indexing Status
In Google Search Console, check the Sitemaps report regularly. It shows how many URLs Google found in your sitemap, how many it’s indexed from that sitemap, and any warnings or errors parsing the file. Compare these numbers to understand what percentage of your sitemap URLs are actually being indexed.
Track Coverage Over Time
If you see that Google isn’t indexing all (or most) of your sitemap URLs, investigate why. Are there technical issues? Are pages too similar? Are they not linking well from your site? Understanding why pages in your sitemap aren’t indexed helps you address the underlying problems.
Update Dynamically
For sites that frequently add new content, implement dynamic sitemap generation rather than manually updating a static sitemap file. Most CMS platforms and hosting providers have built-in sitemap generators that automatically include new content.
Regenerate Periodically
Even with dynamic generation, regenerate your sitemap periodically to ensure it reflects your current site structure. Remove pages you’ve deleted, update lastmod dates for modified content, and adjust priorities as your business priorities change.
Common XML Sitemap Mistakes
| Common Mistake | Why It’s a Problem | How to Fix It |
|---|---|---|
| Including pages with noindex tags | Creates a conflict signal—sitemap says index, noindex says don’t. This confusion wastes sitemap space and signals poor implementation to Google. | Remove noindex pages from your sitemap. Include only pages you actually want indexed. |
| Including duplicate or paginated pages | Wastes sitemap space on low-value URLs and dilutes the importance of your real content pages. Google might crawl these instead of your important pages. | Exclude duplicate versions, pagination pages, and filtered result pages. Keep your sitemap focused on core content. |
| Not updating lastmod dates | If lastmod is inaccurate, Google doesn’t know which pages are freshly updated and might crawl them less frequently than ideal. Over-reporting freshness (every page lastmod = today) is also deceptive. | Only include lastmod if you can reliably maintain it. Update it only for pages you’ve actually significantly modified. Consider removing the element if you can’t maintain it accurately. |
| Setting all priority values to 1.0 | If everything is highest priority, nothing is. Priority only matters relatively. Setting all to 1.0 signals you don’t understand importance or aren’t curating your sitemap. | Reserve priority 1.0 for your truly most important pages. Use a tiered approach: 0.9 for important pages, 0.7 for regular content, 0.5 for supporting pages. |
| Exceeding file size limits | Google won’t process a sitemap that exceeds 50,000 URLs or 50 MB. Large sites might miss hundreds of important pages if the file is too large. | Use a sitemap index file to split content across multiple sitemaps. Organize by content type or date range for clarity. |
| Invalid XML formatting | Malformed XML prevents Google from parsing the sitemap. Special characters not properly encoded cause parsing failures, and Google ignores the entire file. | Validate your XML using an online XML validator. Ensure all special characters are properly encoded (& for &, < for <, etc.). Use proper UTF-8 encoding. |
| Including URLs that redirect | Sitemaps should list final destination URLs, not URLs that redirect elsewhere. Including redirect URLs wastes Google’s crawl budget following unnecessary redirects. | Use the final URL after all redirects as your canonical URL in the sitemap. Let redirects happen naturally but document the endpoint. |
| Never updating the sitemap | A static sitemap that doesn’t reflect your current site structure becomes increasingly inaccurate. Old deleted pages remain, new pages aren’t included, and Google gets confused. | Implement dynamic sitemap generation for sites with frequently changing content. For static sites, regenerate the sitemap monthly at minimum when you add or remove pages. |
| Including pages blocked by robots.txt | Creates a conflict with robots.txt blocking. The block takes precedence, but the contradiction in your directives signals poor site management. | Never include pages in your sitemap if they’re blocked in robots.txt. Use one or the other, not both. |
| Not monitoring sitemap performance | If you submit a sitemap and never check whether Google is actually indexing the URLs in it, you won’t know if your sitemap is effective or if there are problems preventing indexation. | Check Google Search Console’s Sitemaps report regularly. Monitor the ratio of submitted URLs to indexed URLs. Investigate and fix pages that aren’t being indexed. |
Additional Mistakes
Using incorrect URL formatting (relative URLs instead of absolute, HTTP instead of HTTPS) prevents Google from accessing pages. Forgot to include hreflang attributes for multi-language sites, losing ranking opportunities in alternate languages. Including session IDs or tracking parameters in URLs creates duplicate content problems. These seemingly small errors significantly impact sitemap effectiveness.
Building a Comprehensive Sitemap Strategy
An effective XML sitemap isn’t something you create once and forget. It’s a living document that should evolve as your site grows:
Start with Comprehensive Inclusion
Begin by including all pages you want indexed. For new sites, this might be every page. As your site grows, you can refine and focus your sitemap on higher-value content.
Implement Automated Generation
Use tools (CMS plugins, server-side scripts, or third-party services) to automatically generate your sitemap. This ensures your sitemap stays current as you add, modify, and delete pages without requiring manual updates.
Monitor and Adjust
Review your sitemap’s performance in Google Search Console monthly. If you see large discrepancies between submitted and indexed URLs, investigate and fix the underlying problems causing pages not to be indexed.
Refine Based on Performance
Over time, your sitemap SEO improves as you learn what works. Remove pages from your sitemap that aren’t indexing. Increase priority for pages that rank well and drive value. Adjust changefreq based on actual update patterns.
Align with Your Broader SEO Strategy
Your sitemap should support your overall SEO strategy. Prioritize pages supporting your business goals. Link these important pages prominently from your homepage and key hub pages. Create topic-relevant internal link structures that reinforce the importance of key pages.
An XML sitemap is a fundamental technical SEO tool that becomes more valuable as your site grows in size and complexity. By implementing proper sitemap structure, focusing on important pages, and maintaining your sitemap as your site evolves, you ensure that Google discovers, crawls, and indexes your content efficiently. The investment in sitemap optimization directly translates to improved organic visibility and the ability to rank for important keywords that drive your business.