At around 15:31 EDT on May 11, 2021, we started seeing outage notifications that our service was unresponsive. Our engineering team was notified shortly afterwards and an investigation was kicked off.
Our team restored service shortly afterwards at 15:36 EDT which was due to our self-healing infrastructure. However, at 16:01 EDT, we started seeing similar unresponsive issues. Once again, service was restored automatically.
From our investigation, we were able to isolate the cause which occurred when calculating the approximate time when a message would next be sent to a contact. When set to a high value, this caused our web servers to calculate a time that was infeasible and eventually would cause our web servers to crash. When this request was repeated, it would cause our entire web server fleet to become unresponsive, which then caused our entire service to respond with a 503 error.
This issue was first sighted on May 10, 2021 at approximately 10:07 EDT and is also related to a previous incident on May 10, 18:25 EDT.
Our team patched that code and released it into production at around 18:07 EDT.
We know that outages can cause frustration during the work day. Please know that our team is working diligently to investigate how we can handle and prevent these issues from occurring in the future.