Alex's blog

XBlog for XSite blogging

Published on 2022-05-21

As a follow-up to XSite, I decided to split out my blogging tools into their own project and expand on them to build a proper blogging support package for XSite. I decided to keep blogging tools separate from the core XSite project out of a general pro-modularity attitude; a lot of static site generators bundle blogging in their base configurations, but there's no reason that functionality can't be separate. My approach is to keep XSite as the Python code that provides the core functionality like XSlots, XSLT support, datasets, etc., while implementing blogging support on top of it largely as XSLT and XSlots templates. Much of this work happened in the middle of the night, and often into the early hours of the next day, but I've datestamped it with the day I started each session.

2021-10-04

Right at the start of building XBlog, I hit some major frustration; there seems to be some kind of bug in certain versions of libxml2, or at least the way it's used by lxml's etree, where serializing a single element can end up serializing everything after that element too (including dangling end tags!), but only if it has children. This pretty much totally broke XSite; after several hours trying to coax lxml into doing the right thing, I ended up hacking around this problem with some rather ugly text manipulation, and simply truncated away everything that didn't belong. I'm not happy about this, but it got XSite working again and made it possible to start building XBlog.

The first day, I started by pulling my site's existing blogging-related templates out into a new repository (which I set up as a Git submodule), then symlinking them back into place. My first actual improvement was to add category support; I added a "category" column to the posts.csv file which stores information about posts, and modified the blog entries XSLT stylesheet to check it against an "index category" parameter, which is set on each blog index page. After fiddling with this for a while, I straightened out some XSLT issues (to do with parameters and conditional templates) and ended up with a system where categories look like Unix paths, with the "root" category being / and others descending from it in a /-separated hierarchy.

My other first-day improvement was to add support for generating an Atom feed; this is closely analagous to generating the root category index (previously just the index), in that it consists of a document holding pre-written content, an XSlots template plugging in configuration, and an XSLT stylesheet, fed the posts dataset by XSlots, which does the actual work of generating <entry> elements. The XSLT stylesheet also sets the <updated> value for the whole feed, basing it on the last entry's publication time; due for some improvement, but workable. For the moment, it doesn't generate a summary, instead giving a notice that no summary is available, since that would need some fairly serious intelligence to pull meaningful content from the documents themselves. Getting this part working was a lot more frustrating than it ought to have been; first, I left off the xsl: namespace prefix, which led to empty output but no errors. I was nearly ready to give up for the day when I noticed the problem. Second, once that worked, only one entry would come out. This turned out to be an XSite quirk; it can't handle multiple elements without a common parent as document content, XSLT output, etc.; this is why <sl:fragment> exists, but the details of when it is and isn't needed eluded me briefly. Once I wrapped the entries in a fragment, they all appeared as they should.

2021-10-06

After a day of not working on the project, I spent my second day building deployment tooling so I could just make some changes and type scripts/publish; this kicks off a series of checks to ensure I have Git in sync, offers to pull from origin and commit any uncommitted changes, then pushes directly to the server (bypassing the origin remote; remember, Git is a distributed version control system!), builds the site there, and deploys it to the correct directory on the server, keeping a backup so I can roll back (also automated with a script) if a bad deploy happens. These scripts aren't part of XBlog or XSite; they're my own tools and are part of the alm repository, but of course other people can use them if they come in handy.

Also in the same directory of scripts and built the same day is my "note" script, which takes either standard input or a file, assigns it a number, either one or one more than the highest-numbered existing note that day, adds the appropriate header and footer, and wraps each line in <p> (blank lines become <br /> instead). Once it's done that, it has a valid XHTML file (assuming I don't screw up any XML syntax in the input text), which it adds to the posts index (categorizing it as a note), commits to the Git repository, then finishes up by running the publish script. In theory, this should let me publish my quick thoughts in a matter of moments, only a little slower than Tweeting or Tooting or whichever your preference. You can see my test note from the middle of the night for yourself.

2021-10-07

I spent my third day making tools in preparation for building a Webmention sender; specifically, I built Python modules for

  • extracting links from XML, HTML, JSON, Gemini gemtext, and plain text, and
  • fetching HTTP(S), FTP(S), and file:// URIs.

These were quite a bit of work, but I didn't have much to show for that day and there isn't really much to talk about with this.

2021-10-08

I continued on from the previous day, adding support for the Gemini protocol, fixing various issues, adding support for rel attributes (and things equivalent to them), building infrastructure to guess media types for e.g. file:// URIs that otherwise don't provide them, and tying my link extraction and fetching code together so a single function call can get the links of any supported URI. The three modules together run to 400-some lines, and in my opinion at least, they're fairly handy on their own.

For no particular reason, I ended up writing my own Gemini client; it's about 70 lines of Python and was pretty simple to get working. The single part that takes up the most lines is actually the handling for status 44 ("Slow Down"), since it has to handle a few different kinds of unacceptable wait times. That's the only status that's handled specifically, though; all the others are checked by their first digit only. Maybe the biggest bump was that I actually forgot to write any code to handle success, which led to a couple minutes of head-scratching trying to figure out why nothing seemed wrong but an empty result was coming out.

I ended up having to rewrite the XML part of the link extraction code (which also handles HTML) in order to support rel attributes, and I ended up with something a bit more complicated. I think it's still pretty nice, though; it should be easy to teach about more kinds of links if I find out about any, since it's just a couple short XPath expressions. I did end up dropping some specificity, so now any attribute named href, src, or resource will end up being seen as a link, rather than only if those attributes are on known "link elements;" this falls out of RDFa, which defines attributes without a containing XML Namespace.

My media type guessing is probably more than it needs to be; among other things, I have built-in support for using Python's mimetypes (which uses filename extensions), as well as libmagic, though it has to be enabled since the Python module isn't in the standard library. There are still situations where the media type will end up being set to application/octet-stream though, so I have some further fallback options where I can give a predicted media type if I think I know what it'll be, or I can have it react to application/octet-stream by just trying all the link-finders and appending their results together.

2021-10-09

Now the weekend, I started by fixing some bugs with FTP and plain text searching, and doing some general cleanup.

With those sorted, I jumped over to XSite to make the extension points I needed. I did some refactoring while I was there and built a "package" system, allowing third-party additions to XSite, and a plug-in system using it, including support for a new "sidecar" plug-in type (they're passed the original source document, configuration, and parameters for each document XSite processes, so they can "ride along" with XSite's processing pass).

Back to XBlog, I added support for filtering on rel values to my link extraction code, and used all this to implement a sidecar plug-in that finds outgoing links, looks up the Webmention endpoints for them, and makes a list of Webmentions to send. After some debugging, I got this working; I had a list of valid Webmention endpoints and target URIs coming out, along with another list of URIs that had been checked but found not to have Webmention support. I still couldn't actually send any mentions, but that part is simpler, relatively speaking; it's just sending some very simple POST requests. I wanted to do this as part of the publishing workflow, though, so it needed a bit more work than you might think.

2021-10-10

I made some changes so outgoing mentions would have the full URI of the source page, which required some additions to XSite in the form of library functions to handle that, and also fixed a conformance issue; the Webmention spec says to only send a mention to the first link found of three possible, and I was making to-send entries for all of them. After that, I wrote the actual script to send mentions, which was very simple. I tested using webmention.rocks and discovered that I'd completely failed to handle relative URIs in my Webmention discovery. I added code to pass around and use a base URI to my link extraction code and on another test it got the right URIs.

However, actually sending the mentions failed; looking further, I discovered this was actually a problem with webmention.rocks; it was sending an Accept header that only allowed text/html! Since XSite only outputs XML, all the documents it produces are application/xhtml+xml, and my server is configured to consider them as such. This meant when webmention.rocks came in asking for text/html and only text/html, Apache declared an HTTP 406 and refused to send it anything. webmention.rocks, naturally, reacted by failing the tests, sending me back an HTTP 400 (which seems like a strange code to use since it implies the request was actually corrupt, but that's what the Webmention spec says so I suppose it's correct).

I reported this problem to the webmention.rocks project, though I haven't yet received any response. I don't think it's likely to be a very hard fix; webmention.rocks is PHP, and I believe PHP can handle XHTML pretty easily. It might just be a matter of changing the Accept header, but that depends on the exact details of how the XHTML support in PHP works...

2021-10-11

With the webmention.rocks problem still unresolved, I decided to carry on anyway and start implementing the site-specific script, which runs the script to send mentions, then commits the list of sent mentions, and the list of non-Webmention-supporting pages generated during the discovery done at build time, to Git and pushes them (along with the actual content changes) to my origin remote, from which the changes will be pulled back to my desktop. That was largely it for sending support, except for updates/deletes, which I was planning to implement later. For the moment I decided to move onto receiving support. My approach is in one way very simplistic; I just store the source and target URIs to a file in /var/spool and send back a hard-coded response. Most of the CGI script that does this is code to ensure that

  • a POST request was actually used
  • the request was sent with the application/x-www-form-urlencoded media type
  • both a source and a target are specified
  • and that neither contains a tab or a newline.

If any of these checks fail, a hard-coded error response is sent back with an appropriate status code (the Webmention spec demands that only 400 be used, but I refuse to cut back on useful semantics to satisfy that requirement). Otherwise, the source, a tab, the target, and a newline are appended to the file, and the hard-coded success response is sent. It's good to keep the CGI script this simple, since it means it has no dependencies except the Python standard library. This makes it very easy to install on any CGI-supporting server, at least in theory.

Later, the spool file is read by a script that checks that the targets are on the site they're meant to be on and that the sources actually link to them, then writes them to a moderation queue for me to check manually. I modified my site-specific script for mentions to run this check step and commit the moderation queue as well, so it too will end up on my desktop, where I can do moderation and move the good mentions to a final accepted list. Based on that, I'll be able to do a final processing step of downloading the pages, extracting the metadata, and saving that to an XSite dataset I can use to include mentions in-line on the mentioned pages.

It's a lot of steps, but I like it because each one does just one thing, without closely tying everything together.

2021-10-14

I took a break for a few days before getting into the final stretch of tool-building so I could start actually rebuilding my site. When I came back I wrote the script to pull down the source pages of new moderated mentions, pull out (in extremely simplistic and fragile fashion) a title, author, and first-paragraph summary, and put them into a CSV file which can be used as an XSite dataset. At some point I intend to come back and make this a lot more robust, and pull out more information like a publication date if it's available.

After that, I wrote an XSLT stylesheet to generate a mentions section and added it to the bottom of the blog template. In the process I added a little code to XSite to pass the output path and document URI as parameters so XSLT could get them. But with that in place, the mentions XSLT worked with only very little fiddling! The results are relatively basic, but they work well enough, so I'm happy for now.

While I was passing through, I also added some text handling from the info extraction script to the parsing that the link extraction code uses, which should make it a bit more robust to different encodings of XML documents.

At this point, I was reasonably confident both sides of Webmention would actually work to some degree, though maybe not as nicely as might be desired. There was clearly still work to be done, but I decided to leave it there for the day and come back to improve the content extraction later.

2021-10-25

It turned out my "leaving it there" ended up lasting over a week. When I did come back, I mostly implemented a rough, hand-made Microformats2 content extractor and made some other minor improvements. After a lot of rounds of testing against various pages, I eventually got it to parse the h-entrys close enough to nicely that it seemed like enough. Unfortunately, Microformats2 isn't used totally consistently; while there is a standard, it's fairly loose and people apply it in different ways. Since I wrote my own processing code instead of using an off-the-shelf Microformats2 parser, I ended up with something of a mess coming out of a lot of pages at first. A lot of the fighting was down to trying to get a summary out of pages that didn't have a p-summary; it ended up being a multi-way solution that tries to look for various things under the e-content, including a <p>, simple unwrapped text, and even a <li>! Eventually, I figured out enough common patterns that most of the pages I was plugging in were "just working," so I called it good enough for the day. I found that a lot of places still use Microformats1 (even some things on the Microformats2 h-entry examples list were actually Microformats1 hentry! >:( ), so I should probably implement that too and maybe some others like RDFa.

At this point, the tools felt solid enough to work on the site itself, so that's where XBlog development stopped for its first "major release". I'll likely come back to it, probably along with major refactorings to XSite itself. In particular, I'll probably rewrite my link fetching and context extraction around libraries written by people who are actually following specs like RDFa, Microdata, and Microformats closely, instead of my slap-dash "basically works" code (I will probably, however, still end up writing wrappers to fall back between multiple formats, probably including some custom "plain-old HTML" handling).

Tagged: Personal, Projects, XSite, XBlog