njk.onl / platy's blog /RSS feed

GovDiff: A tool for highlighting UK government information updates

In 2019, I made a tool called GitGovUK to track changes to documents published by the UK government on www.gov.uk and especially to highlight exactly what was changed with each update (with an inline diff). It was initially for Brexit related advice and I later expanded it to track other topics.

In 2021, I completely re-designed and re-wrote it as GovDiff. This post is about why I made it and where I would like to take it. If there is interest, in future posts, I plan to go over in more detail the designing and building of both versions.

Background

It was early 2019, and Brexit was in one of it's weirdest periods. UK Prime Minister Theresa May's Withdrawal Agreement had been published and agreed with 27 other EU member states. The UK government tries 3 times to have it agreed by parliament, but it is rejected each time. There was a deadline to reach an agreement, which if passed, would have resulted in the UK "crashing out" of the EU. It would immediately end over 40 years of treaties and legislation, bringing with it massive legal and regulatory uncertainty. The deadline was 29th March, then 12th April, then 31st October.

I was a British person living in Germany, and apart from finding Brexit both entertaining and appalling, I also wanted to know how it was going to affect me personally and what I could do to prepare for it. I was subscribed to email updates for the government's 'Living in Germany' advice page, but it is a long page and I didn't find the change descriptions had enough detail to feel like I could stay up to date without having to reread the whole document.

I had recently been made redundant from a job which was burning me out with frequent pivots and short deadlines. The deadlines meant working fast and the pivots meant not having anything to celebrate. I had been a team lead and had had little chance to write code. So I felt a need to practice coding again and get some of my enthusiasm back. To accomplish that, I thought it would help to build something that was all mine and I could approach in a way that avoided all the things which had caused development friction in my projects at that organisation.

We had short deadlines, without which I feel none of the other issues would have been so problematic. We had many contributors, all hired in a short space of time and so we didn't have enough time to converge on ways of working. We created many micro-services so that we could distribute the work between many developers, which lead to more design and development being spent on inter-service APIs than almost anything else. We did the services polyglot (many programming languages), as we didn't have any language that everyone would agree on or that was a clear choice from our collective previous experience. We aimed at using REST APIs for everything, as was vogue at the time, I found REST asks for a lot in terms of design and doesn't offer a lot of flexibility, making it a burden for internal APIs. I also had the feeling that some services became fairly anaemic, where they weren't much more that CRUD services which exposed access to a database in a different format, leaving the frontend single page application to do the work of tying it all together.

My new project: Wouldn't have a deadline, but I wanted to see results quickly. Would only have me working on it, which meant I would be taking responsibility for design and frontend for the first time. Would have no inter-service APIs thus avoiding all that work. Would use the minimal number of languages and technologies.

Attempt 1: GitGovUK

The system involves receiving update emails, fetching the documents that have been updated and storing them with the description of their updates. Then there is a web tool which shows those updates, and produces an inline diff of the document to show the changes. For this initial version I used git to store the updates, used the GitHub API to access them and built the tool as a single page application with Preact to access them.

Screenshot of GitGovUK: Screenshot of GitGovUK

The repo that it pushes to continues to grow, now with over 20,000 updates recorded.

I'll talk more about the reasoning and shortcomings of these design decisions in a future post, but in summary: although this version delivered quite fast, it wasn't very adaptable. It kind of got to this point and got stuck. So I ended up wanting to rewrite it so that I could improve the page load performance (GitGovUK needs to make N + 1 queries from the browser to the GitHub API to show the index page), once it wasn't just about Brexit anymore it needed filtering (I didn't see a reasonable way to accomplish this is git) and when I added the fetching of attachments, git didn't really store that information properly.

Attempt 2: GovDiff

By late 2020, with the pandemic using most of the government attention, I got around to adding updates about that. Then the ingress program failed and I found the code I'd written nearly 2 years before almost unreadable and unfixable. So I decided I had to rewrite it in my new language of choice - rust. That went pretty well, but it was in a rush as the original was broken and I actually had a job at the time, but I could use bits from the original and I found it much easier to write decent tests in rust.

I later got around to designing and building a replacement for the storage and browsing tool. The new version swapped git for just storing files in the filesystem under a particular structure, and swapped client side rendering for server side rendering. This version took much longer to reach feature parity with the first, but wasn't limited in the same way. Features like filtering, or showing the history of updates for an individual page or url prefix were uncomplicated to implement. The performance was also much better, as only one request is needed per server-rendered page.

Screenshot of GovDiff: Screenshot of GovDiff

Future work

As with every project, I have a long list of improvements that I would like to make if time allows it. These seem to be the most interesting:

HTML diffing improvements

Some of the diffs that this app produces are terrible, and a massive barrier to usability. The library I used (htmldiff) uses a lot of memory in complex situations, I'm not sure how much but it feels like O(n^2). In some cases, this diff algorithm produces invalid HTML. In others such as the one below, it produces a valid diff, but one which is not readable:

ReadYou aboutcan applyingstill fortravel ato medicalEngland exemptionif fromyou vaccinationdo usingnot the NHS COVIDqualify Passas iffully vaccinated but you livemust infollow Englanddifferent rules.

I don't think HTML diffing can be solved perfectly but there are definitely some improvements that could be made to the way that the diffs come out in GovDiff. I would like to do some more research here and try to improve the result. If I get around to working on this further, I'll post my findings.

Feeds and sitemaps as ingress

When I started out with this, I was trying to automate and improve something I was doing manually, receiving email notifications and checking what had changed, I didn't go and search to see whether there was a better ingress than email. I recently realised there is both an XML sitemap and an atom feed. Both of which would provide a better way to find updates. But it looks like there would still be a place for the email notifications, as some update descriptions only seem to be available over email.

Text search

There are of course many solutions for text search indexes, but from where I've got to so far, none fitted in with how I wanted to build this tool. I'll explain more in a future post.

My new home

I've now not lived in the UK for 10 years, maybe I should try to make this work for advice in my new home.

Clusters, trends and stats

Due to the quantity of data, the tool is probably only useful when targeted on particular updates or documents. It might be possible to create some stats, trends and clusters from the data, to make it a bit more interesting to look at without having a target.

Use semantic data in the documents

Most documents provide the date they were published and updated, some provide the list of updates with descriptions, some provide other useful information which could improve indexing.

Feedback

Please try out GovDiff. If you are having trouble finding a document with previous versions to diff against, try switching the topic filter from 'All' to 'Brexit'.

Have a use for this, comments or suggestions? : platy@njk.onl