During a routine system procedure a migration job was triggered. At the time of the occurrence we were experiencing a peak of processed documents that we related to customers end of month/year processes. This migration job had a bug that caused a processing loop causing our database cluster to go out of space with the auto-scaler having no time to increase. The down time was due to our DB cluster nodes having to recover all data (~140Gb at crash time).
Instructions | Report |
---|---|
Leadup | |
List the sequence of events that led to the incident. | Automatic migration routine job started Bug increased data stored in DB by creating a loop processing Space grow rate was higher then scale-up rate |
Fault | |
Describe what didn't work as expected. If available, include relevant data visualizations. | The migration was not expected to increase DB size substantially even during peak times and the Atlas auto-scaler was expected to handle the DB growing increasing the DB size accordingly. |
Impact | |
Describe how internal and external users were impacted during the incident. Include how many support cases were raised. | Studio was down and all its features were inaccessible by all customers and by internal users. Studio API was also down during all incident period and customers were nor able to use it. One support ticket was opened by a customer and later one by the support team. Some customers complained by email to customer success team. |
Detection | |
Report when the team detected the incident and how they knew it was happening. Describe how the team could've improved time to detection. | First to detect the problem were internal users complaining about files stuck on upload in studio. The Engineering team was contacted directly and the Support team followed the Engineering analysis and investigation. Our monitoring system alerted with a lot of messages in our queueing mechanism and when all the systems went down our StatusCake and Pingdom monitoring systems also started sending alerting messages. |
Response | |
Report who responded to the incident and describe what they did at what times. Include any delays or obstacles to responding. | 12h02 - The first symptoms (documents stuck on upload) were communicated by internal teams and they started investigating 14h27 - First customer request arrived 14h42 - First response sent to customers 14h44 - An Incident team was connected (video call). Engineering and Support worked together to investigate and communicate with customers 15h05 - API is down for the first time 15h57 - Atlas support was contacted and started investigating further 16h22 - First response from Atlas support telling that the nodes were healing and we should let the process finish |
Recovery | |
Report how the user impact was mitigated and when the incident was deemed resolved. Describe how the team could've improved time to mitigation. | 17h44 - first secondary replica recovered 17h54 - All DB nodes recovered 19h10 - API/Studio is back up 19h14 - Monitoring phase started 19h20 - Studio API is back up 20h29 - Incident resolved |
Five whys root cause identification | |
Run a 5-whys analysis to understand the true causes of the incident. | Why did studio/API went down? Because MongoDB cluster crashed; Why did Mongo DB cluster crashed? Because the 3 replicas went out of space; Why did the replicas went out of space? Because Atlas resource manager didn’t had enough time to scale-up and handle the DB size grow; Why did the DB grow rate was higher then the scale-up rate? Because the migration caused a processing loop with huge files; Why did the migration cause a processing loop? Because the migration had a bug that occurred during peak time; |
Related records | |
Check if any past incidents could've had the same root cause. Note what mitigation was attempted in those incidents and ask why this incident occurred again. | N/A |
Lessons learned | |
Describe what you learned, what went well, and how you can improve. | Improve tests on QA to make sure all this types of bugs are resolved before going to production Enhance vendor response time on critical issues |
11h22 - Migration job started
12h02 - The first symptoms (documents stuck on upload) were communicated
14h41 - A fix was deployed to correct the migration bug
15h57 - Atlas support was contacted and started investigating further
16h22 - first response from Atlas support telling that the nodes were recovering
17h44 - first secondary replica recovered
17h54 - All DB nodes recovered
19h10 - Studio is back up
19h14 - Monitoring phase started
19h20 - Studio API is back up
20h29 - Incident resolved
Issue | Owner | Action items | Documentation |
---|---|---|---|
Improve tests on QA to make sure this type of bugs don’t go to production | Engineering | Improve tests on QA to make sure this type of bugs are resolved before deploy to production | |
Enhance vendor response time on critical issues | Engineering and Product | Review support contracts and TTRs to improve vendors response time |