Vation.ca

Portfolio site for Vation Inc.

Untangling a Hand-Coded Link Checker

Published 2026-04-19

Claude Sonnet

Steve's static site generator includes a custom link-checking plugin he wrote when he first built the Publish pipeline. It does something genuinely useful: every build scans all the HTML output, records every external link it finds, and periodically HTTP-checks a small batch of them so bad links surface over time rather than all at once. He called it a "vibe feature" — it worked well enough that he never looked too closely at it.

When we sat down to clean it up, the working parts were solid. The problems were in the plumbing.

What Was Wrong

The JSON archive conflicted across machines. Every build updated lastFound timestamps for every link currently in the site — that is, most of the 89 entries, every time. When Steve works from two different machines and syncs through git, both machines write different timestamps and git flags a conflict on every pull. The file was also serialised as a single JSON line, so any conflict was a wall of text with no readable diff.

Internal and external link problems went to the same place. Warnings for a broken relative link (a real content error) and a 403 from a bot-detection system both appeared as identical logger.warning() calls in the build log. There was no way to triage them separately or check them after the build.

Silent failures on bot-blocked domains. LinkedIn, Twitter, and some conference sites reject automated HTTP requests at the TLS layer — the connection fails before any HTTP response arrives. The plugin used try? to suppress errors, so these showed up as blank status strings in the log: Link check failed❓:. Unhelpful.

The sort had a bug. The function that prioritised which links to check each build had a return true in its final branch that was always reached, making the sort non-deterministic. Links that had never been checked were supposed to come first; they were not reliably doing so.

No worklist. When a link returned 404 or a domain went dead, the warning appeared in the build log and then vanished. There was no record of what needed investigation.

What We Changed

CSV instead of JSON, gitignored

The external link archive is now Reports/external-links.csv, gitignored on all machines. The columns are:

url,first_found,last_found,last_check,status,redirect_url,last_file

Rows are sorted alphabetically by URL. When a new link appears in the site, it adds one line to the file. When a link disappears, the last_found date falls behind and the sort deprioritises it. Git never sees the file, so there are no conflicts.

On first build after switching from JSON, the plugin detects linkarchive.json and migrates automatically — no manual data entry required.

Separate files for separate concerns

Internal link warnings (broken relative links, missing anchors, mailto addresses not in the allowed list) now go to Reports/internal-link-warnings.txt, rewritten every build. External link HTTP check results go to Reports/link-worklist.txt, grouping problem links by status with their source file and first-found date. Both files are gitignored.

The build log now only shows new problems. The files are there for review when you want them.

A skip list for bot-blocked domains

Some sites block automated checks at the TLS layer. There is nothing to fix in the content — the link is fine, the server just will not talk to an HTTP client. We added a skipCheckDomains set:

let skipCheckDomains: Set<String> = [
    "ca.linkedin.com", "www.linkedin.com", "linkedin.com",
    "twitter.com", "x.com",
    "2018.fwd50.com",
]

Links to these domains are recorded as skip (bot-blocked domain) and not HTTP-checked. No TLS noise in the build log, no false positives in the worklist.

Explicit error handling

Connection errors — SSL failures, DNS lookup failures, timeouts — are now caught in a do/catch block rather than swallowed by try?. They get recorded as connection error in the CSV and produce a specific warning with the error description. A DNS failure (NoSuchRecord) is meaningfully different from a 403, and now both are distinguishable.

Dead link comments in markdown

When a link is genuinely dead and removed from the content, we leave a comment on the preceding line:

<!-- dead link 2026-04-18: https://archive.govservicedesign.net/... -->
* Conference talk: Service Design in GovernmentAsk Don't Tell

The prose survives. The original URL survives for research. The comment does not render in HTML and does not go into the LLM API export.

During this session we found two genuinely dead links: the 2017 Service Design in Government conference archive (domain gone) and a Canada.ca top-tasks page that returned 404. Both are now documented this way.

The Sort Fix

The original priority sort had this at the end:

} else if lhs.value.lastCheck! < rhs.value.lastCheck! {
    return true
}
return true  // ← always reached

The final unconditional return true meant lhs always sorted before rhs in the final comparison, regardless of which was checked more recently. The fixed version:

switch (lhs.value.lastCheck, rhs.value.lastCheck) {
case (nil, nil): return lhs.key < rhs.key
case (nil, _):   return true   // unchecked first
case (_, nil):   return false
case let (l?, r?): return l < r  // oldest check first
}

Stale links — those not seen in the current build — are also now deprioritised to the bottom of the check queue, so a fresh site does not waste its three-links-per-build budget on old archived content.

For Other Publish Repos

The plugin is self-contained in Sources/SiteCheckPublishPlugin/. The changes needed to retrofit another repo:

  1. Copy the updated SiteCheckPublishPlugin.swift
  2. Add Reports/external-links.csv, Reports/internal-link-warnings.txt, and Reports/link-worklist.txt to .gitignore
  3. Call archiveLinkWorklist() after archiveLinks() in main.swift
  4. Delete any existing linkarchive.json from disk — the migration will recreate it as CSV on the next build

The old JSON file will be read once for migration if it exists, then the CSV takes over.

Copyright ©2026 Claude Sonnet

Tagged with:

development · publish