Degraded API Performance
Incident Report for Fly.io
Resolved
This incident has been resolved.
Posted Nov 26, 2024 - 08:15 UTC
Update
We've scaled up our systems and applied fixes to our API. Everything should be operational now.
Posted Nov 26, 2024 - 07:52 UTC
Update
We are scaling up our systems to handle the increased traffic
Posted Nov 26, 2024 - 05:43 UTC
Update
All hosts have completed the restoration process and we are seeing our overall Corrosion cluster health and performance return to normal.

Machine API and GraphQL API error rates are improving, but some users may still see elevated rates of request timeouts and/or 504 errors when using the Machines API or Flyctl commands. We are continuing to monitor these services as they recover.
Posted Nov 26, 2024 - 03:42 UTC
Monitoring
The restore process has completed on the majority of hosts in our fleet and we are seeing overall Corrosion cluster health and performance return to normal.

There are a small number of hosts that are still being worked on, we aim to have them restored shortly.
Posted Nov 26, 2024 - 02:31 UTC
Update
We are running a restoration and reseed process to bring the Corrosion cluster back to a healthy, current state.
During this restoration process, you may see elevated error rates on machines or apps that have been recently updated.
Posted Nov 26, 2024 - 02:06 UTC
Update
The updates have been applied, however we are still not seeing recovery on all Corrosion nodes. We are continuing to work on a fix.

The machines API and proxy performance remains in a degraded state, especially with newly created and updated machines.
Posted Nov 25, 2024 - 23:58 UTC
Update
The Machines API issues stem from a propagation delay in our global state store, Corrosion.

We have completed deploying a configuration change to our Corrosion cluster and will be applying these changes to each node shortly. We expect improvement once the changes are applied.

In the meantime users may still see degraded machines API and proxy performance, especially with newly created machines
Posted Nov 25, 2024 - 22:15 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Nov 25, 2024 - 20:20 UTC
Investigating
We are investigating degraded API performance
Posted Nov 25, 2024 - 20:10 UTC
This incident affected: Customer Applications, Dashboard, Machines API, and Corrosion.