Extract Website URLs from Sitemap.XML for SEO Analysis
Last edited 39 days ago
Overview
This n8n workflow automates the process of crawling a website's sitemap to extract URLs, which is particularly useful for SEO analysis, website auditing, or content monitoring. By leveraging n8n's nodes, the workflow fetches the sitemap from a specified URL, processes the XML data, and extracts individual URLs, which can then be converted into a downloadable file or integrated with tools like Google Sheets.
How It Works
The workflow operates in a sequential manner, utilizing a series of nodes to fetch, parse, and process sitemap data:
- Trigger: Initiates when the user clicks "Test workflow" (
Manual Trigger
node). - Set URL: Defines the base domain (e.g.,
https://phu.io.vn/
) for the sitemap (Set URL
node). - Crawl Sitemap: Fetches the main sitemap file (
sitemap.xml
) from the specified domain using an HTTP request (Crawl sitemap
node). - Parse XML: Converts the sitemap XML into a JSON format for easier processing (
XML
node). - Split Sitemap: Extracts individual sitemap entries (e.g.,
<sitemap>
tags) from the parsed data (Split Out
node). - Crawl Sub-Sitemap: Fetches each sub-sitemap URL listed in the main sitemap (
Crawl sitemap 2
node). - Parse Sub-Sitemap XML: Converts the sub-sitemap XML into JSON (
XML 2
node). - Split URLs: Extracts individual URLs (e.g.,
<url>
tags) from the sub-sitemap (Split Out 2
node). - Convert to File: Saves the extracted URLs into a file for download or further use (
Convert to File
node).
This workflow supports both single sitemap files and sitemap indexes that reference multiple sub-sitemaps, ensuring comprehensive URL extraction.
How to Use
To implement this workflow in n8n, follow these steps:
- Set Up n8n: Ensure you have an active n8n instance (Cloud, npm, or self-hosted). Refer to the n8n documentation for setup instructions.
- Import Workflow: Copy the JSON from the provided
Extract Website URLs from Sitemap.XML for SEO Analysis.json
file and import it into your n8n instance via the workflow editor. - Configure the Domain:
- In the
Set URL
node, update theDomain
parameter with the target website's base URL (e.g.,https://example.com/
). - Alternatively, in the
Crawl sitemap
node, directly paste the full sitemap URL if known (e.g.,https://example.com/sitemap.xml
).
- In the
- Test the Workflow:
- Click "Test workflow" to execute the
Manual Trigger
node. - Verify that the workflow fetches the sitemap and processes the URLs correctly.
- Click "Test workflow" to execute the
- Download or Integrate:
- The
Convert to File
node generates a file containing the extracted URLs. - Optionally, replace this node with a Google Sheets node to append URLs to a spreadsheet. Refer to the Google Sheets node documentation for setup.
- The
- Save and Activate: Save the workflow and activate it for production use if needed, using a trigger like a schedule or webhook (see Trigger Node).
Requirements
- n8n Instance: An active n8n instance (version 1.0 or later recommended) on n8n Cloud, npm, or self-hosted (Docker). See Choose your n8n for details.
- Technical Knowledge: Basic understanding of n8n's editor UI and node configuration. Familiarity with XML sitemaps is helpful but not mandatory.
- Permissions: For self-hosted setups, ensure the n8n process has network access to fetch the sitemap URL. For Docker deployments, verify permissions as outlined in the n8n v1.0 migration guide.
- Optional: If integrating with Google Sheets, valid Google Sheets credentials are required (see Credentials).
- Timeout Configuration: The HTTP Request nodes (
Crawl sitemap
andCrawl sitemap 2
) have a 10-second timeout. Adjust thetimeout
parameter in the node settings if dealing with slow-responding servers.
FAQ
Q: What happens if the sitemap is large or contains many sub-sitemaps?
A: The workflow handles sitemap indexes by splitting and processing each sub-sitemap individually. For very large sitemaps, ensure your n8n instance has sufficient resources (memory and CPU) to avoid performance issues. See Scaling n8n for optimization tips.
Q: Can I use this workflow with a specific sitemap URL instead of a domain?
A: Yes, in the Crawl sitemap
node, replace the url
parameter ({{ $json.Domain }}sitemap.xml
) with the direct sitemap URL (e.g., https://example.com/sitemap.xml
). Update the node’s notes for clarity.
Q: Why am I getting a timeout error?
A: The HTTP Request nodes have a default timeout of 10 seconds. If the target server is slow, increase the timeout
value in the options
parameter of the Crawl sitemap
or Crawl sitemap 2
nodes.
Q: How can I save the URLs to Google Sheets instead of a file?
A: Replace the Convert to File
node with a Google Sheets node. Configure it with your Google Sheets credentials and map the loc
field from the Split Out 2
node to the desired spreadsheet column. Refer to the Google Sheets node documentation.
Q: Is this workflow compatible with older n8n versions?
A: The workflow uses nodes compatible with n8n version 1.0 and later. For older versions, check for deprecated features (e.g., MySQL support) in the n8n v1.0 migration guide.
Q: Can I automate this workflow to run periodically?
A: Yes, replace the Manual Trigger
node with a Schedule Trigger
node to run the workflow at set intervals. See Trigger Nodes for configuration details.
For further assistance, consult the n8n Community Forum or submit an issue on the n8n GitHub repository.
Need help customizing?
Contact me for consulting and support or add me on Facebook or email.
You may also like
New to n8n?
Need help building new n8n workflows? Process automation for you or your company will save you time and money, and it's completely free!