The Goal
In the following short article, we want to import data from an existing sitemap XML file into a new Google Sheet document.
The sheet must pull the sitemap via HTTP protocol and extract all the URLs from the sitemap and insert them. into the sheet.
Implementation
Now let’s implement it .. it only takes 1 minute …
For demonstration purpose, I’m going to use the sitemap from my old blog, to be found at https://www.hascode.com/sitemap.xml
.
XML Structure
First of all, we need to have a look at the sitemap’s XML structure .. it roughly looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.hascode.com/</loc>
<lastmod>2021-06-18T08:23:22+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://www.hascode.com/snippets/</loc>
<lastmod>2021-11-04T09:34:24+00:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.6</priority>
</url>
</urlset>
Caution
|
We can see that all nodes are bound to the namespace http://www.sitemaps.org/schemas/sitemap/0.9 so it is important to mind this for our XPath queries!
|
XPath Query
The following XPath query extracts all URLs from the XML structure, ignoring the namespaces:
We may test the query using `curl` and `xmllint` in the command line:
curl https://www.hascode.com/sitemap.xml | xmllint --xpath '//*[local-name()="url"]/*[local-name()="loc"]' -
this gives us the following output (shortened):
curl -s https://www.hascode.com/sitemap.xml | xmllint --xpath '//*[local-name()="url"]/*[local-name()="loc"]' - | head -10
<loc>https://www.hascode.com/</loc>
<loc>https://www.hascode.com/snippets/</loc>
<loc>https://www.hascode.com/testing-restful-web-services-made-easy-using-the-rest-assured-framework/</loc>
<loc>https://www.hascode.com/using-throwaway-containers-for-integration-testing-with-java-junit-5-and-testcontainers/</loc>
<loc>https://www.hascode.com/playing-around-with-mqtt-and-java-with-moquette-and-eclipse-paho/</loc>
<loc>https://www.hascode.com/terminal-based-progress-bar-for-java-applications/</loc>
<loc>https://www.hascode.com/annotation-based-kubernetes-and-openshift-manifests-for-java-applications-with-ap4k/</loc>
<loc>https://www.hascode.com/integration-testing-imap-smtp-and-pop3-with-greenmail/</loc>
<loc>https://www.hascode.com/next-steps-to-pattern-matching-in-java-with-java-12-and-switch-expressions-aka-jep-325/</loc>
<loc>https://www.hascode.com/ordering-your-junit-rules-using-a-rulechain/</loc>
[...]
Sheets Import
Now we’re ready to import the data into our Google Sheet using the XMLIMPORT function.
The Syntax is XMLIMPORT("url", "xpath")
so in our case, we’re entering this into the first cell: =IMPORTXML("https://www.hascode.com/sitemap.xml", "//*[local-name()='url']/*[local-name()='loc']")
It should take a short second and then the data is loaded and displayed in our sheet like this: