Yesterday we had to rotate our main database's credentials as a precaution (they were briefly, accidently, committed to our private repository) and update the various apps using this database. Unfortunately we missed our image registry proxy which queries the database to authorize our users to push to their respective namespace.
Our scheduler might move around instances a lot throughout the day. Every time we pull the manifest from our registry. While the registry was malfunctioning, our servers were unable to pull the manifest and failed to bring up VMs. This especially affected applications that are set to autoscale. These apps will bring up and down instances much more frequently as their traffic shifts.
To detect this in the future, we are setting up alerts for the registry as well as end-to-end checks.