(New) Static Site Generation

Motivation
High-Level Implementation
Detailed Implementation
Future Work
Parting Thoughts

For a few years, I've been using Hugo for blog generation. Recently, I've decided I wanted to take static site generation into a different direction. Specifically, I wanted to use a different source markup and I wanted to write my own tool set for generating the actual HTML.

We'll walk through the motivation of changing the content into a different format and changing the generation process into a completely custom set of scripts.

Motivation

When I first set down to build a blog 5 years ago, I had a pretty basic set of requirements.

Lightweight
Native Markdown Support
Minimal Dependencies

Hugo met all of these requirements quite well. The templating engine is fairly simplistic; It supports Markdown; It's a written in Go, therefore, only the built artifact is necessary for site generation.

If it fits so well, why change?

It worked well for what I was asking, however, as I wrote more and time went on, the features of Hugo became more and more complex and created a mismatch of how I wanted to express the text in the markup. I've felt this was going in a direction I did not need nor wanted. More specifically, the issues start surfacing more in the Markdown side. Markdown simply lacks some features in its markup that is corrected by blocks of HTML inline and other hacks that are standard in only some specific "flavor" or translator implementation. Hugo attempts to patch this over with its implementation of shortcode templates, however, these still felt unnatural.

The final nail was discovering Org-mode. I liked the weight of Markdown, but I didn't like its lack of features when needed. I liked rStructured Text, however, I felt it was always too heavy for the documents I was working on.

In finally giving Emacs a full try (another blog post), I discovered Org-mode. It was the exact middle weight I was looking for between Markdown and reST.

Thus, I was in search of a new tool for generating static HTML from a set of source files written in Org-mode.

Org-mode (within Emacs) has a native publish mode, and I had discovered several posts on how people are doing exactly this. However, Org-publish isn't exactly what I was looking for.

Therefore, let's revise the current list of requirements:

Makefile driven
Does not require Emacs to generate

That is, I wanted a Makefile that could generate the site contents. Furthermore, and no less importantly, I wanted the Makefile to not include lines like Emacs --quick --batch .... This obviously creates a bit of a challenge since Org-mode is an Emacs mode.

I decided I could probably generate the content myself with a few scripts and invocations to Pandoc.

High-Level Implementation

The core of the implementation of the new site generation is blog posts written in Org-mode, processed by several shell scripts, using Pandoc to perform the translation from raw Org-mode markup to HTML, all of which is orchestrated by a Makefile.

I'm not going extol Org-mode's capabilities in this post. There's plenty of resources on it already, no greater authority than the Org-mode Manual itself.

There is, in fact, some limitations of Org-mode due to the choices of not allowing the generation to include Emacs itself.

Along the tour of the implementation, it's important to note a guiding principle in the conversion was not breaking existing links. That is, I was and am satisfied with the folders and slug usage for posts and I didn't want the new version to break existing links.

Detailed Implementation

The easy part is generating each post. This is simply an index.html in the correct folder. The majority of the complexities stem from the summaries and main index page.

Post Content Generation

To generate a blog post's index.html page, we consider the following make target:

blog_dir = $(shell $(SCRIPTS_DIR)/org-get-slug.sh $(1))
TEMPLATE_FILES:=$(wildcard templates/*.html)

define BLOG_BUILD_DEF
$(BUILD_DIR)$(call blog_dir,$T):
        mkdir -p $$@
$(BUILD_DIR)$(call blog_dir,$T)/index.html: $T \
                                                                                        $(TEMPLATE_FILES) \
                                                                                        Makefile \
                                                                                  | $(BUILD_DIR)$(call blog_dir,$T)
        $(SCRIPTS_DIR)/generate_post_html.sh $$< > $$@
endef

$(foreach T,$(POSTS_ORG_INPUT),$(eval $(BLOG_BUILD_DEF)))

This definition is fairly opaque now. However, the definition will expand for each post when the foreach macro expands. For example, when run, the following targets will be defined for this post:

$(BUILD_DIR)/blog/2019/03/static-site-generation:
        mkdir -p $@
$(BUILD_DIR)/blog/2019/03/static-site-generation/index.html: posts/static-site-generation.org \
                                                                                                                         $(TEMPLATE_FILES) \
                                                                                                                         Makefile \
                                                                                                                   | $(BUILD_DIR)/blog/2019/03/static-site-generation
        $(SCRIPTS_DIR)/generate_post.html $< > $@

This will create the correct directory for each post, e.g., /blog/2019/03/static-site-generation, and place the translated HTML into this directory as index.html.

Note: it doesn't actually translate to $(TEMPLATE_FILES). During the expansion of the definition, the variable $(TEMPLATE_FILES) is similarly expanded. This is acceptable, however, since it's a static list of files and has no bearing on which post's target is being expanded.

The generate_post.sh script is fairly basic:

#!/usr/bin/env bash
# Generate HTML for blog post

ORGIN=${1}
PROJ_ROOT=$(git rev-parse --show-toplevel)
source ${PROJ_ROOT}/scripts/site-templates.sh
source ${PROJ_ROOT}/scripts/org-metadata.sh
DISPLAY_DATE=$(date -d ${DATE} +'%a %b %d, %Y')
SORT_DATE=$(date -d ${DATE} +'%Y        %m      %d      ')

cat ${HTML_HEADER_FILE}
cat ${HTML_SUB_HEADER_FILE}
echo -n "<h1 class=\"title\">${TITLE}</h1>"
echo -n "<div class=\"post-meta\">"
echo -n '<ul class="tags"><li><i class="fa fa-tags"></i></li>'
echo -n "${TAGS}" | awk '{ printf "<li>%s</li>", $0}'
echo -n '</ul>'
echo -n "<h4>${DISPLAY_DATE}</h4></div>"
pandoc --from org \
       --to html \
       ${ORGIN}
cat ${HTML_FOOTER_FILE}

The org-metadata.sh script, reads the Org-mode preamble, lines starting with #+, and puts them into different variables available for other scripts. For example, the TITLE, DATE, TAGS are pulled out and used to generate the title section of each post. Furthermore, some templates are pulled in to generate the headers and footers of each page. The templates are written directly in HTML and really serve only to simplify each page with otherwise largely duplicated content.

Summary Page Generation

The summary page is a bit more involved to generate. A few questions had to be answered before it was possible: how to generate the summary text? And how to sort and order posts?

To answer the first question, I dug into how Hugo was generating these summaries. It turns out, it really only takes the first couple hundred characters and calls it the "summary". This depends largely on the content of each post to actually describe the post in the first couple hundred characters. Obviously, this led to some awkward results, especially with links and section headings mixed in.

To achieve similar results, it would be fairly easy to write a script to simply take the first few hundred characters after the preamble and output this into something to be collected for the summary page. However, a better solution is available since we are taking full control over the generation process. Namely, we can put the preview content into a specific Org-mode block to be parsed out and used explicitly for this purpose. If the summary for a post is only a sentence or two, the summary generation process won't then start reading extra text, if the summary requires a little more detail, it won't be cut short by the arbitrary read limit.

To generate the preview content, the generate_post_preview.sh script is used:

#!/usr/bin/env bash
# Generate HTML post summary tags

ORGIN=${1}
PROJ_ROOT=$(git rev-parse --show-toplevel)

source ${PROJ_ROOT}/scripts/org-metadata.sh

echo "${LINKS}"
echo "${PREVIEW}"

The LINKS variable is included in this file because we are generating an intermediate file for Pandoc to generate the summary content. Without the LINKS, any links included in the preview section would be broken.

The second question actually turns out to be pretty easy in practice: we parse the #+ DATE: line from the preamble and prepend it to the summary content.

From the org-metadata.sh script:

ORIGIN=${1}
DATE=$(awk -F': ' '/^#\+DATE:/ { printf "%s", $2}' ${ORGIN})

Then, from the generate_post_summary_html.sh script:

#!/usr/bin/env bash
# Generate HTML post summary tags

ORGIN=${1}
GENERATED_PREVIEW_FILE=${2}
PROJ_ROOT=$(git rev-parse --show-toplevel)

source ${PROJ_ROOT}/scripts/org-metadata.sh
DISPLAY_DATE=$(date -d ${DATE} +'%a %b %d, %Y')
SORT_DATE=$(date -d ${DATE} +'%Y        %m      %d      ')
PREVIEW_CONTENT=$(cat ${GENERATED_PREVIEW_FILE} | pandoc -f org -t html)

echo -n "${SORT_DATE}"
echo -n '<article class="post"><header>'
echo -n "<h2><a href=\"${SLUG}\">${TITLE}</a></h2>"
echo -n "<div class=\"post-meta\">${DISPLAY_DATE}</div></header>"
echo -n "<blockquote>$(echo ${PREVIEW_CONTENT})</blockquote>"
echo -n '<ul class="tags"><li><i class="fa fa-tags"></i></li>'
echo -n "${TAGS}" | awk '{ printf "<li>%s</li>", $0}'
echo -n '</ul>'
echo -n '<footer>'
echo -n "<a href=\"${SLUG}\">Read More</a>"
echo -n "</footer>"
echo ""

Finally, this is all put together with the generate_index_html.sh script:

#!/usr/bin/env bash
# Generate index.html page

INPUT_FILES=${@}
PROJ_ROOT=$(git rev-parse --show-toplevel)
source ${PROJ_ROOT}/scripts/site-templates.sh

cat "${HTML_HEADER_FILE}"
echo "<body>"
cat "${HTML_SUB_HEADER_FILE}"
cat ${INPUT_FILES} | sort -r -n -k1 -k2 -k3 | awk -F'   ' '{print $4}'
echo "</body>"
cat "${HTML_FOOTER_FILE}"

Specifically, the following line is of interest with respect to properly sorting:

cat ${INPUT_FILES} | sort -r -n -k1 -k2 -k3 | awk -F'   ' '{print $4}'

Use the tab-separated date fields from before, and use them to sort each of the post summaries onto the index.html page.

RSS/XML Generation

I also wanted to keep the RSS/XML feeds going. However, as it turns out, generating the RSS feed was achieved by performing essentially the same steps used for generating the summary index.html page.

Future Work

There is a fairly obvious limitation of the summary page generation, but only really obvious if I write more content. There was and is no current archive page. Moreover, all posts are put into the index.html summary page. If/when more posts are written and published, a solution for the first page will be necessary. However, this was necessary regardless of whether the blog is generated using Hugo or generated via the new process.

Parting Thoughts

Like many projects, this was started because I personally was dissatisfied with the current state of options. However, that said, I did not write these scripts to be used directly for someone else. I'm not sure I would necessarily recommend this approach to someone else, unless, of course, they wanted to do it to learn or to otherwise take control of their content. That said, I hope this captures the essence of the scripts, their major functions, and the motivations behind them. The scripts are available, WITHOUT WARRANTY, under the GNU General Public License (version 3).

If you have questions or comments, feel free to reach out to me.