70 %
Chris Biscardi

Creating a sitemap.xml from a pile of HTML files

First we make 0 assumptions. You've used some static site generator like Gatsby/Hugo/11ty/sapper/whatever's hot today. Now you need a sitemap.

Get a list of all the URLs

We need to start with a file containing all the URLs we care about. To do this we will find all the .html files in our output directory (mine is named public). We can use find to do this but fd is easier to use and explain: where -e specifies the extension (html) and --base-directory changes the output.

shell
fd -e html --base-directory public

with --base-directory the output doesn't include the path we took to get the the directory we care about, which means we can run this command from anywhere.

output
shell
why-use-discord-for-open-communities.html
wtf-is-kubernetes.html
yarn-workspace-nohoist-an-entire-package-s-dependencies.html
yarn-workspace-nohoist.html
your-first-crdt.html

To get the same output from find, which is more commonly found on systems, we need to use regex with -name and then pipe to sed to remove the directory prefix. Note that we're using , as the separator in our sed command here so that we don't have to escape slashes.

shell
find ./public -name '*.html' | sed -e 's,^\./public,,'

Now with the output of find being a list of filepaths, we need to strip .html off the filepath and prefix our site's domain to each line. I like AWK for this although you could also use more sed commands and such.

shell
find -e html --base-directory public |
awk -F '.' '{print "https://christopherbiscardi.com/"$1}'

What this AWK script is doing is going through each line one by one. -F is being used to specify . as a separator because AWK chops up lines similar to a CSV by default (the real default separator is an empty space, but it works the same). This means from a file named post/something.html you'll get back two values: post/something and html.

With the separator taken care of, we then want to format some strings. This is a "one-liner" because we can let AWK handle the defaults and such. We print out our domain ahead of $1, which is the first "split up" variable: our filepath and name.

We end up with a list like this.

awk-output
shell
https://christopherbiscardi.com/what-is-dynamo-db
https://christopherbiscardi.com/what-s-next-for-react-based-products
https://christopherbiscardi.com/why-use-discord-for-open-communities
https://christopherbiscardi.com/wtf-is-kubernetes
https://christopherbiscardi.com/yarn-workspace-nohoist-an-entire-package-s-dependencies
https://christopherbiscardi.com/yarn-workspace-nohoist
https://christopherbiscardi.com/your-first-crdt

Output urls to a file

We can then redirect this output to a file or into our clipboard (pbcopy, xclip), etc. In this case we use > urls.txt to output the list of urls into a file named urls.txt.

shell
find -e html --base-directory public |
awk -F'.' '{print "https://christopherbiscardi.com/"$1}' > urls.txt

Sitemap from URLs

Given the list of URLs we just generated, we can drop in a basic sitemap using npx and an npm package called sitemap. We feed urls.txt into npx sitemap and redirect the output to sitemap.xml.

shell
npx sitemap < urls.txt > sitemap.xml

You'll end up with a sitemap.xml file that contains something similar to the following.

sitemap.xml
xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
xmlns:video="http://www.google.com/schemas/sitemap-video/1.1"
>
<url><loc>https://christopherbiscardi.com/30x500-notes</loc></url>
<url><loc>https://christopherbiscardi.com/30x500-safari-1</loc></url>
<url><loc>https://christopherbiscardi.com/30x500-safari-2-old-todo</loc></url>
<url><loc>https://christopherbiscardi.com/30x500-safari-2</loc></url>
<url><loc>https://christopherbiscardi.com/7guis-recoil-js-counter</loc></url>
<url>
<loc>
https://christopherbiscardi.com/7guis-recoil-js-temperature-converter
</loc>
</url>
<url><loc>https://christopherbiscardi.com/a-css-in-js-of-my-own</loc></url>
<url><loc>https://christopherbiscardi.com/a-modern-copy-button</loc></url>
<url>
<loc>https://christopherbiscardi.com/adjacency-lists-in-dynamodb</loc>
</url>
<url><loc>https://christopherbiscardi.com/amplify-and-appsync</loc></url>
<url><loc>https://christopherbiscardi.com/authoring-stylis-plugins</loc></url>
<url>
<loc>https://christopherbiscardi.com/aws-app-sync-without-amplify</loc>
</url>
<url>
<loc>
https://christopherbiscardi.com/build-time-code-blocks-with-rehype-prism-and-mdx
</loc>
</url>
<url>
<loc>
https://christopherbiscardi.com/building-an-mdx-preview-app-with-electron
</loc>
</url>
</urlset>