A cascading failure with the ingress of GovDiff
GovDiff was failing from 17th Jan to 27th Jan, there is still some data cleanup that needs to be executed, but it's currently running again, with the most immediate problems fixed.
Recursive download
The failure was triggered by a gov.uk document which mistakenly linked to itself - (it was later fixed after I reported it).
I hadn't coded the attachment crawler defensively enough and the ingress started downloading the same document over and over again, continuing until a volume was full.
Effect on the datastores
The ingress currently writes to 2 document stores, there is the legacy git repo which is used by gitgovuk.njk.onl and the newer, purpose-built repo used by govdiff.njk.onl. The new one is supposed to be the source of truth but I've kept the git repo around for comparison and because I still quite like it. The new repo stores each version of a document, keyed by the url and then the timestamp it was downloaded at, so each of these downloads were stored with a different timestamp. The repo avoids duplication of identical versions which immediately follow each other, but I had disregarded HTML attribute ordering in my preprocessing, and that randomisation meant that generally the deduplication didn't work. This store was continuing to grow.
The git repo represents each document on www.gov.uk as a file in the repo and each update notification received as a commit, so it just kept writing over the same file again and again, but without completing the processing of the update, so it was never committed. Internally, git stores those files as content addressed blobs, and due to the html attribute ordering issue above, they would have some different contents, I think that's how the volume holding the git repo ran out of space.
Full volumes
Writing the data to 2 stores makes things complicated, if one fails you might have them inconsistent. I'd put the recording of the new repo first as it was more important, so once the git repo was full, each processed update would download a doc, write it to the new repo, attempt to write it to the git repo and fail. The new repo is not transactional (and doesn't need to be) so there is no rollback.
Here I had some very bad error handling, so this failing update would then be skipped, and the ingress would continue processing other updates, but it also left the update in the processing inbox, so it would come around to it again on the next loop. So as this went on, more updates were arriving, and on each loop, all of those documents would be downloaded and added to the new repo, until that volume filled up as well
What should have happened?
There were several things I had done badly which contributed to this cascading failure, some of which originates in how I urgently rewrote the ingress in rust when the old Node.js one failed, I cut many corners.
- When the downloaded document linked to itself, that link shouldn't have been followed
- When linked documents kept linking to other documents, there should have been a recursion limit which stopped it
- When the duplicate document versions were added, they should have been properly canonicalised so that the duplicates would not be stored.
- When processing an update, the input file for the update should have been moved out of the inbox into a working directory, so that if it failed it would not be processed again automatically
Where are we now?
I knew I was leaving technical debt with this when I built the thing, but it was better to get it working and collecting data, and since this new rust version has been running there have only been minor failures, but that is mostly a testament to how uniform and reliable the gov.uk content is. With most sources I would have had many more difficulties.
The biggest difficulty here is that I need to clean up the datastore, which gives me a use case for an experiment I was doing with stream processing of html (as I need to parse and canonicalise several gigabytes of html). The hard part of course will be judging whether the cleanup is valid. The fixes for the above things are pretty simple, then I have a list of other items of technical debt for this project, but I could fix those and avoid a potential failure, or I could write a better html diff algorithm or add proper search, both of which would make a big difference to the usefulness of this project. In the end, if no one finds it useful, no one would care if it's failing.