The Goal

In the following short article, we want to import data from an existing sitemap XML file into a new Google Sheet document.

The sheet must pull the sitemap via HTTP protocol and extract all the URLs from the sitemap and insert them. into the sheet.

Implementation

Now let’s implement it .. it only takes 1 minute …​

For demonstration purpose, I’m going to use the sitemap from my old blog, to be found at https://www.hascode.com/sitemap.xml.

XML Structure

First of all, we need to have a look at the sitemap’s XML structure .. it roughly looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
		<loc>https://www.hascode.com/</loc>
		<lastmod>2021-06-18T08:23:22+00:00</lastmod>
		<changefreq>daily</changefreq>
		<priority>1.0</priority>
	</url>
	<url>
		<loc>https://www.hascode.com/snippets/</loc>
		<lastmod>2021-11-04T09:34:24+00:00</lastmod>
		<changefreq>weekly</changefreq>
		<priority>0.6</priority>
	</url>
</urlset>
We can see that all nodes are bound to the namespace http://www.sitemaps.org/schemas/sitemap/0.9 so it is important to mind this for our XPath queries!

XPath Query

The following XPath query extracts all URLs from the XML structure, ignoring the namespaces:

We may test the query using `curl` and `xmllint` in the command line:
curl https://www.hascode.com/sitemap.xml | xmllint --xpath '//*[local-name()="url"]/*[local-name()="loc"]' -

this gives us the following output (shortened):

curl -s https://www.hascode.com/sitemap.xml | xmllint --xpath '//*[local-name()="url"]/*[local-name()="loc"]' - | head -10
<loc>https://www.hascode.com/</loc>
<loc>https://www.hascode.com/snippets/</loc>
<loc>https://www.hascode.com/2011/10/testing-restful-web-services-made-easy-using-the-rest-assured-framework/</loc>
<loc>https://www.hascode.com/2019/01/using-throwaway-containers-for-integration-testing-with-java-junit-5-and-testcontainers/</loc>
<loc>https://www.hascode.com/2016/06/playing-around-with-mqtt-and-java-with-moquette-and-eclipse-paho/</loc>
<loc>https://www.hascode.com/2019/03/terminal-based-progress-bar-for-java-applications/</loc>
<loc>https://www.hascode.com/2019/02/annotation-based-kubernetes-and-openshift-manifests-for-java-applications-with-ap4k/</loc>
<loc>https://www.hascode.com/2012/07/integration-testing-imap-smtp-and-pop3-with-greenmail/</loc>
<loc>https://www.hascode.com/2018/12/next-steps-to-pattern-matching-in-java-with-java-12-and-switch-expressions-aka-jep-325/</loc>
<loc>https://www.hascode.com/2012/02/ordering-your-junit-rules-using-a-rulechain/</loc>
[...]

Sheets Import

Now we’re ready to import the data into our Google Sheet using the XMLIMPORT function.

The Syntax is XMLIMPORT("url", "xpath") so in our case, we’re entering this into the first cell: =IMPORTXML("https://www.hascode.com/sitemap.xml", "//*[local-name()='url']/*[local-name()='loc']")

It should take a short second and then the data is loaded and displayed in our sheet like this:

google sheets sitemap xmlimport
Figure 1. Sitemap’s URLs imported in Google Sheets Document