tag:status.flyio.net,2005:/historyFly.io Status - Incident History2024-03-28T13:40:35-05:00Fly.iotag:status.flyio.net,2005:Incident/203900352024-03-28T13:10:24-05:002024-03-28T13:10:24-05:00Network Issue - NRT<p><small>Mar <var data-var='date'>28</var>, <var data-var='time'>13:10</var> CDT</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Mar <var data-var='date'>28</var>, <var data-var='time'>12:34</var> CDT</small><br><strong>Investigating</strong> - We have confirmed a network issue in the Tokyo, Japan (nrt) region. Some applications in this region may have interrupted network connectivity.</p>tag:status.flyio.net,2005:Incident/203896342024-03-28T13:10:09-05:002024-03-28T13:10:09-05:00Capacity Issues in MAD region<p><small>Mar <var data-var='date'>28</var>, <var data-var='time'>13:10</var> CDT</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Mar <var data-var='date'>28</var>, <var data-var='time'>11:46</var> CDT</small><br><strong>Identified</strong> - We're actively adding more capacity to our MAD region. Scaling events and blue-green deployments may fail until this is resolved. Please consider scaling in nearby regions (LHR, CDG, AMS).</p>tag:status.flyio.net,2005:Incident/203111922024-03-20T22:19:54-05:002024-03-20T22:19:54-05:00App Logs Delayed or Missing<p><small>Mar <var data-var='date'>20</var>, <var data-var='time'>22:19</var> CDT</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Mar <var data-var='date'>20</var>, <var data-var='time'>21:31</var> CDT</small><br><strong>Monitoring</strong> - We're monitoring recovery of our primary logging cluster.</p><p><small>Mar <var data-var='date'>20</var>, <var data-var='time'>20:22</var> CDT</small><br><strong>Update</strong> - Hardware replacement for our primary logging cluster is underway. Stored logs remain delayed or missing until this process is complete. Live following of logs and log-shipper functionality will not be impacted by this process.</p><p><small>Mar <var data-var='date'>20</var>, <var data-var='time'>13:57</var> CDT</small><br><strong>Identified</strong> - Live following of logs, as well as log-shipper functionality is now restored. Stored logs remain delayed or missing while we work towards recovery of our primary logging cluster.</p><p><small>Mar <var data-var='date'>20</var>, <var data-var='time'>13:09</var> CDT</small><br><strong>Update</strong> - We continue to investigate this issue. Logs for most customer apps are delayed or missing at this time.</p><p><small>Mar <var data-var='date'>20</var>, <var data-var='time'>10:19</var> CDT</small><br><strong>Investigating</strong> - Our observability infrastructure is currently experiencing delayed log ingestion</p>tag:status.flyio.net,2005:Incident/203147452024-03-20T19:05:23-05:002024-03-20T19:06:04-05:00Network issue - SJC<p><small>Mar <var data-var='date'>20</var>, <var data-var='time'>19:05</var> CDT</small><br><strong>Resolved</strong> - This incident has been resolved.<br />There was a power disturbance at a SJC datacenter that caused a network switch and several servers to be rebooted. Some applications hosted in this region were offline from 22:56 - 23:26 UTC.</p><p><small>Mar <var data-var='date'>20</var>, <var data-var='time'>18:36</var> CDT</small><br><strong>Monitoring</strong> - Network connectivity has been restored.</p><p><small>Mar <var data-var='date'>20</var>, <var data-var='time'>18:22</var> CDT</small><br><strong>Identified</strong> - We have confirmed a network issue in the San Jose, California (sjc) region. Some applications in this region may have interrupted network connectivity.</p>tag:status.flyio.net,2005:Incident/203023312024-03-19T16:28:12-05:002024-03-19T16:28:12-05:00GraphQL API and UI unavailable<p><small>Mar <var data-var='date'>19</var>, <var data-var='time'>16:28</var> CDT</small><br><strong>Resolved</strong> - We have identified and rolled out a fix</p><p><small>Mar <var data-var='date'>19</var>, <var data-var='time'>11:10</var> CDT</small><br><strong>Monitoring</strong> - We rolled back a change that broke our API and consequently our UI, CLIs. We are seeing things mitigating and are monitoring before we resolve.</p><p><small>Mar <var data-var='date'>19</var>, <var data-var='time'>11:08</var> CDT</small><br><strong>Identified</strong> - We believe we've identified the issue and are working to rollback</p><p><small>Mar <var data-var='date'>19</var>, <var data-var='time'>10:43</var> CDT</small><br><strong>Investigating</strong> - We're currently investigating an increase in 500 responses from our GraphQL API which is also affecting our CLI and UI.</p>tag:status.flyio.net,2005:Incident/202454562024-03-14T12:35:08-05:002024-03-14T12:35:08-05:00Increased 500s from the dashboard<p><small>Mar <var data-var='date'>14</var>, <var data-var='time'>12:35</var> CDT</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Mar <var data-var='date'>14</var>, <var data-var='time'>12:28</var> CDT</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Mar <var data-var='date'>14</var>, <var data-var='time'>12:25</var> CDT</small><br><strong>Identified</strong> - The issue has been identified and a fix is being implemented.</p><p><small>Mar <var data-var='date'>14</var>, <var data-var='time'>12:03</var> CDT</small><br><strong>Investigating</strong> - We are currently investigating this issue.</p>tag:status.flyio.net,2005:Incident/202377422024-03-13T19:21:20-05:002024-03-13T19:21:20-05:00Degraded networking in HKG<p><small>Mar <var data-var='date'>13</var>, <var data-var='time'>19:21</var> CDT</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Mar <var data-var='date'>13</var>, <var data-var='time'>18:54</var> CDT</small><br><strong>Update</strong> - We are performing emergency network maintenance in HKG. We expect network connectivity to be unavailable for approximately 15 minutes.</p><p><small>Mar <var data-var='date'>13</var>, <var data-var='time'>16:37</var> CDT</small><br><strong>Identified</strong> - We are working to resolve intermittent connectivity issues in HKG</p>tag:status.flyio.net,2005:Incident/202280532024-03-12T18:16:41-05:002024-03-12T18:16:41-05:00Metrics issues<p><small>Mar <var data-var='date'>12</var>, <var data-var='time'>18:16</var> CDT</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Mar <var data-var='date'>12</var>, <var data-var='time'>15:51</var> CDT</small><br><strong>Monitoring</strong> - An application routing fix has been deployed, and metrics query response times have returned back to normal levels.</p><p><small>Mar <var data-var='date'>12</var>, <var data-var='time'>15:37</var> CDT</small><br><strong>Identified</strong> - The issue has been identified and a fix is being implemented.</p><p><small>Mar <var data-var='date'>12</var>, <var data-var='time'>15:06</var> CDT</small><br><strong>Investigating</strong> - We're investigating an issue accessing application metrics.</p>tag:status.flyio.net,2005:Incident/201665952024-03-06T00:04:50-06:002024-03-06T00:04:50-06:00Networking issue in LHR<p><small>Mar <var data-var='date'> 6</var>, <var data-var='time'>00:04</var> CST</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Mar <var data-var='date'> 5</var>, <var data-var='time'>08:24</var> CST</small><br><strong>Monitoring</strong> - We have routed around the bad circuit and are monitoring.</p><p><small>Mar <var data-var='date'> 5</var>, <var data-var='time'>08:13</var> CST</small><br><strong>Identified</strong> - We have identified a bad circuit and are investigating</p><p><small>Mar <var data-var='date'> 5</var>, <var data-var='time'>07:51</var> CST</small><br><strong>Investigating</strong> - We are currently investigating slow network connections in LHR.</p>tag:status.flyio.net,2005:Incident/200817392024-02-27T06:23:09-06:002024-02-27T06:23:09-06:00API Unavailable<p><small>Feb <var data-var='date'>27</var>, <var data-var='time'>06:23</var> CST</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Feb <var data-var='date'>26</var>, <var data-var='time'>15:07</var> CST</small><br><strong>Identified</strong> - Some security updates affected internal communication of services causing some API requests to fail.</p><p><small>Feb <var data-var='date'>26</var>, <var data-var='time'>14:13</var> CST</small><br><strong>Investigating</strong> - We are currently investigating.</p>tag:status.flyio.net,2005:Incident/200799992024-02-26T12:21:38-06:002024-02-26T12:21:38-06:00routing issues and ECONNRESET<p><small>Feb <var data-var='date'>26</var>, <var data-var='time'>12:21</var> CST</small><br><strong>Resolved</strong> - Routing is normalized.</p><p><small>Feb <var data-var='date'>26</var>, <var data-var='time'>11:26</var> CST</small><br><strong>Monitoring</strong> - Routing is normalized and we're monitoring to ensure traffic is reaching the correct regions.</p><p><small>Feb <var data-var='date'>26</var>, <var data-var='time'>10:06</var> CST</small><br><strong>Investigating</strong> - We're investigating possible routing problems, causing reset connections (visible as ECONNRESET or "Failed to fetch" client side and "client closed connection" in server logs).</p>tag:status.flyio.net,2005:Incident/200519032024-02-22T16:32:08-06:002024-02-22T16:32:08-06:00API errors when creating machines<p><small>Feb <var data-var='date'>22</var>, <var data-var='time'>16:32</var> CST</small><br><strong>Resolved</strong> - This incident is resolved.</p><p><small>Feb <var data-var='date'>22</var>, <var data-var='time'>16:08</var> CST</small><br><strong>Monitoring</strong> - We've deployed a fix for this problem and are monitoring to ensure everything is working normally.</p><p><small>Feb <var data-var='date'>22</var>, <var data-var='time'>15:41</var> CST</small><br><strong>Identified</strong> - We've identified the cause of this issue. We've determined it's most likely to affect deploys of newly-created organizations and/or applications. It might affect deploys even on old applications with builder failures. A mitigation for the latter is to use --local-only when deploying. Full resolution is underway.</p><p><small>Feb <var data-var='date'>22</var>, <var data-var='time'>15:18</var> CST</small><br><strong>Investigating</strong> - Machine creation operations, including deploys, currently fail with "could not launch machine: failed to launch VM: failed to get org". Other operations such as destroy work normally. Currently-running machines continue to work normally.</p>tag:status.flyio.net,2005:Incident/200404112024-02-21T14:57:00-06:002024-02-21T14:57:00-06:00Degraded API Performance<p><small>Feb <var data-var='date'>21</var>, <var data-var='time'>14:57</var> CST</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Feb <var data-var='date'>21</var>, <var data-var='time'>12:24</var> CST</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Feb <var data-var='date'>21</var>, <var data-var='time'>12:18</var> CST</small><br><strong>Identified</strong> - The issue has been identified and a fix is being implemented.</p><p><small>Feb <var data-var='date'>21</var>, <var data-var='time'>12:09</var> CST</small><br><strong>Update</strong> - We are continuing to investigate this issue.</p><p><small>Feb <var data-var='date'>21</var>, <var data-var='time'>12:09</var> CST</small><br><strong>Investigating</strong> - We are currently investigating an issue with the API that involves deploying, creating, and updating machines.</p>tag:status.flyio.net,2005:Incident/200091942024-02-17T19:06:10-06:002024-02-17T19:06:10-06:00Application logs are not currently available<p><small>Feb <var data-var='date'>17</var>, <var data-var='time'>19:06</var> CST</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Feb <var data-var='date'>17</var>, <var data-var='time'>16:50</var> CST</small><br><strong>Identified</strong> - The logs issue has been identified and we are continuing to investigate.</p><p><small>Feb <var data-var='date'>17</var>, <var data-var='time'>16:12</var> CST</small><br><strong>Monitoring</strong> - A fix is rolling out across the fleet but may take some time to unblock application logs as we run through the backlog of hosts. We are continuing to monitor as logs become available.</p><p><small>Feb <var data-var='date'>17</var>, <var data-var='time'>15:54</var> CST</small><br><strong>Identified</strong> - The logs issue has been identified and a fix is being rolled out across the fleet.</p><p><small>Feb <var data-var='date'>17</var>, <var data-var='time'>15:00</var> CST</small><br><strong>Investigating</strong> - Customer logs are not currently available - investigating.</p>tag:status.flyio.net,2005:Incident/199717102024-02-12T21:56:52-06:002024-02-12T21:56:52-06:00State Management Maintenance<p><small>Feb <var data-var='date'>12</var>, <var data-var='time'>21:56</var> CST</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Feb <var data-var='date'>12</var>, <var data-var='time'>20:59</var> CST</small><br><strong>Identified</strong> - We are performing a maintenance on our state management database. Recently updated resources may take longer to be accessible</p>tag:status.flyio.net,2005:Incident/199509922024-02-09T14:37:50-06:002024-02-09T14:37:50-06:00Context timeout exceeded when deploying<p><small>Feb <var data-var='date'> 9</var>, <var data-var='time'>14:37</var> CST</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Feb <var data-var='date'> 9</var>, <var data-var='time'>14:14</var> CST</small><br><strong>Identified</strong> - We identified the faulty component and are attempting remediation. Users connected to WireGuard gateways may see issues with connectivity for a brief time.</p><p><small>Feb <var data-var='date'> 9</var>, <var data-var='time'>13:43</var> CST</small><br><strong>Investigating</strong> - Some users may see an "error: context timeout exceeded" error when deploying apps. We are investigating.</p>tag:status.flyio.net,2005:Incident/199367272024-02-08T01:23:24-06:002024-02-08T01:23:24-06:00API errors with secrets and volumes<p><small>Feb <var data-var='date'> 8</var>, <var data-var='time'>01:23</var> CST</small><br><strong>Resolved</strong> - An issue with the secrets service which was affecting API requests was resolved.</p><p><small>Feb <var data-var='date'> 8</var>, <var data-var='time'>00:21</var> CST</small><br><strong>Investigating</strong> - We are currently investigating issues with setting secrets or launching machines or volumes that use secrets</p>tag:status.flyio.net,2005:Incident/199339942024-02-07T16:58:25-06:002024-02-07T16:58:25-06:00Issues allocating new public IPv6 addresses<p><small>Feb <var data-var='date'> 7</var>, <var data-var='time'>16:58</var> CST</small><br><strong>Resolved</strong> - We are resolved</p><p><small>Feb <var data-var='date'> 7</var>, <var data-var='time'>16:52</var> CST</small><br><strong>Monitoring</strong> - We believe we are mitigated. Monitoring for any further issues,</p><p><small>Feb <var data-var='date'> 7</var>, <var data-var='time'>16:21</var> CST</small><br><strong>Identified</strong> - We believe we've identified the issue and are working on a fix.</p><p><small>Feb <var data-var='date'> 7</var>, <var data-var='time'>15:18</var> CST</small><br><strong>Investigating</strong> - We are investigating issues some customers are having allocating new public IPv6 addresses for their apps.</p>tag:status.flyio.net,2005:Incident/199230432024-02-06T18:04:30-06:002024-02-06T18:04:30-06:00GraphQL API timing out<p><small>Feb <var data-var='date'> 6</var>, <var data-var='time'>18:04</var> CST</small><br><strong>Resolved</strong> - We’ve applied several mitigations to account for an increase in load on our primary DB which has resulted in the API requests no longer timing out.</p><p><small>Feb <var data-var='date'> 6</var>, <var data-var='time'>11:42</var> CST</small><br><strong>Monitoring</strong> - We've mitigated the issue and are continuing to monitor for increased error rates.</p><p><small>Feb <var data-var='date'> 6</var>, <var data-var='time'>10:28</var> CST</small><br><strong>Investigating</strong> - We're aware of elevated time outs to our GraphQL API.</p>tag:status.flyio.net,2005:Incident/199127192024-02-05T09:17:11-06:002024-02-05T09:17:11-06:00Public DNS Issues in SIN<p><small>Feb <var data-var='date'> 5</var>, <var data-var='time'>09:17</var> CST</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Feb <var data-var='date'> 5</var>, <var data-var='time'>08:32</var> CST</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Feb <var data-var='date'> 5</var>, <var data-var='time'>08:32</var> CST</small><br><strong>Identified</strong> - New Singapore edge servers were provisioned last night and came up misconfigured. We've implemented a temporary fix which has restored DNS responses, but we're working on finding the root cause and issuing a permanent fix.</p><p><small>Feb <var data-var='date'> 5</var>, <var data-var='time'>08:09</var> CST</small><br><strong>Investigating</strong> - We are currently investigating this issue.</p>tag:status.flyio.net,2005:Incident/199002062024-02-03T10:02:11-06:002024-02-03T10:02:11-06:00Network connectivity issues in LHR<p><small>Feb <var data-var='date'> 3</var>, <var data-var='time'>10:02</var> CST</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Feb <var data-var='date'> 3</var>, <var data-var='time'>09:04</var> CST</small><br><strong>Monitoring</strong> - All services are back online. We will continue to monitor.</p><p><small>Feb <var data-var='date'> 3</var>, <var data-var='time'>05:54</var> CST</small><br><strong>Identified</strong> - This datacenter is impacted by a large scale power failure.</p><p><small>Feb <var data-var='date'> 3</var>, <var data-var='time'>05:50</var> CST</small><br><strong>Investigating</strong> - We are currently investigating this issue.</p>tag:status.flyio.net,2005:Incident/198420552024-01-26T21:08:59-06:002024-01-26T21:08:59-06:00Partial metrics outage<p><small>Jan <var data-var='date'>26</var>, <var data-var='time'>21:08</var> CST</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Jan <var data-var='date'>26</var>, <var data-var='time'>19:51</var> CST</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Jan <var data-var='date'>26</var>, <var data-var='time'>19:46</var> CST</small><br><strong>Identified</strong> - The issue has been identified and a fix is being implemented.</p><p><small>Jan <var data-var='date'>26</var>, <var data-var='time'>18:22</var> CST</small><br><strong>Investigating</strong> - Some customers may have missing metrics display on fly-metrics.net</p>tag:status.flyio.net,2005:Incident/198423442024-01-26T19:24:44-06:002024-01-26T19:24:44-06:00General Networking Issue<p><small>Jan <var data-var='date'>26</var>, <var data-var='time'>19:24</var> CST</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Jan <var data-var='date'>26</var>, <var data-var='time'>19:19</var> CST</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Jan <var data-var='date'>26</var>, <var data-var='time'>19:11</var> CST</small><br><strong>Investigating</strong> - We are currently investigating elevated error rates from our proxy</p>tag:status.flyio.net,2005:Incident/198330842024-01-26T04:00:46-06:002024-01-26T04:00:46-06:00SJC Datacenter Maintenance<p><small>Jan <var data-var='date'>26</var>, <var data-var='time'>04:00</var> CST</small><br><strong>Completed</strong> - The scheduled maintenance has been completed.</p><p><small>Jan <var data-var='date'>26</var>, <var data-var='time'>01:01</var> CST</small><br><strong>In progress</strong> - Scheduled maintenance is currently in progress. We will provide updates as necessary.</p><p><small>Jan <var data-var='date'>25</var>, <var data-var='time'>15:53</var> CST</small><br><strong>Scheduled</strong> - Our SJC datacenter will be performing host maintenance during this time. An outage of 30-40 minutes is anticipated.</p>tag:status.flyio.net,2005:Incident/197785892024-01-20T15:28:19-06:002024-01-20T15:28:23-06:00Unplanned network maintenance in MAA<p><small>Jan <var data-var='date'>20</var>, <var data-var='time'>15:28</var> CST</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Jan <var data-var='date'>20</var>, <var data-var='time'>15:14</var> CST</small><br><strong>Identified</strong> - We are currently working to mitigate the impact of unplanned network maintenance in the Chennai, India (MAA) region.</p>