Dec 2nd 2021 - We are experiencing issues in Hypatos Studio and Studio API since 14:20 CET
Incident Report for Hypatos
Postmortem

Executive summary

During a routine system procedure a migration job was triggered. At the time of the occurrence we were experiencing a peak of processed documents that we related to customers end of month/year processes. This migration job had a bug that caused a processing loop causing our database cluster to go out of space with the auto-scaler having no time to increase. The down time was due to our DB cluster nodes having to recover all data (~140Gb at crash time).

Postmortem report

Instructions Report
Leadup
List the sequence of events that led to the incident. Automatic migration routine job started Bug increased data stored in DB by creating a loop processing Space grow rate was higher then scale-up rate
Fault
Describe what didn't work as expected. If available, include relevant data visualizations. The migration was not expected to increase DB size substantially even during peak times and the Atlas auto-scaler was expected to handle the DB growing increasing the DB size accordingly.
Impact
Describe how internal and external users were impacted during the incident. Include how many support cases were raised. Studio was down and all its features were inaccessible by all customers and by internal users. Studio API was also down during all incident period and customers were nor able to use it. One support ticket was opened by a customer and later one by the support team. Some customers complained by email to customer success team.
Detection
Report when the team detected the incident and how they knew it was happening. Describe how the team could've improved time to detection. First to detect the problem were internal users complaining about files stuck on upload in studio. The Engineering team was contacted directly and the Support team followed the Engineering analysis and investigation. Our monitoring system alerted with a lot of messages in our queueing mechanism and when all the systems went down our StatusCake and Pingdom monitoring systems also started sending alerting messages.
Response
Report who responded to the incident and describe what they did at what times. Include any delays or obstacles to responding. 12h02 - The first symptoms (documents stuck on upload) were communicated by internal teams and they started investigating 14h27 - First customer request arrived 14h42 - First response sent to customers 14h44 - An Incident team was connected (video call). Engineering and Support worked together to investigate and communicate with customers 15h05 - API is down for the first time 15h57 - Atlas support was contacted and started investigating further 16h22 - First response from Atlas support telling that the nodes were healing and we should let the process finish
Recovery
Report how the user impact was mitigated and when the incident was deemed resolved. Describe how the team could've improved time to mitigation. 17h44 - first secondary replica recovered 17h54 - All DB nodes recovered 19h10 - API/Studio is back up 19h14 - Monitoring phase started 19h20 - Studio API is back up 20h29 - Incident resolved
Five whys root cause identification
Run a 5-whys analysis to understand the true causes of the incident. Why did studio/API went down? Because MongoDB cluster crashed; Why did Mongo DB cluster crashed? Because the 3 replicas went out of space; Why did the replicas went out of space? Because Atlas resource manager didn’t had enough time to scale-up and handle the DB size grow; Why did the DB grow rate was higher then the scale-up rate? Because the migration caused a processing loop with huge files; Why did the migration cause a processing loop? Because the migration had a bug that occurred during peak time;
Related records
Check if any past incidents could've had the same root cause. Note what mitigation was attempted in those incidents and ask why this incident occurred again. N/A
Lessons learned
Describe what you learned, what went well, and how you can improve. Improve tests on QA to make sure all this types of bugs are resolved before going to production Enhance vendor response time on critical issues

 

Incident timeline

11h22 - Migration job started

12h02 - The first symptoms (documents stuck on upload) were communicated

14h41 - A fix was deployed to correct the migration bug

15h57 - Atlas support was contacted and started investigating further

16h22 - first response from Atlas support telling that the nodes were recovering

17h44 - first secondary replica recovered

17h54 - All DB nodes recovered

19h10 - Studio is back up

19h14 - Monitoring phase started

19h20 - Studio API is back up

20h29 - Incident resolved

Follow-up tasks

Issue Owner Action items Documentation
Improve tests on QA to make sure this type of bugs don’t go to production Engineering Improve tests on QA to make sure this type of bugs are resolved before deploy to production
Enhance vendor response time on critical issues Engineering and Product Review support contracts and TTRs to improve vendors response time  
Posted Dec 15, 2021 - 16:56 CET

Resolved
Issue is completely resolved now.
We tested our products and all pending requests were processed successfully, our solution is stable.
Thank you for your patience.
Posted Dec 02, 2021 - 20:29 CET
Monitoring
Our systems are back up again.
We are monitoring all systems carefully to make sure our products are stable now.
We are very sorry for this inconvenience and we are working to make sure it won't happens again!
Thank you for your patience!
Posted Dec 02, 2021 - 19:14 CET
Update
We are still finishing bringing up all our systems.
We expect to have an update in 30 minutes.
Thank you for your patience.
Posted Dec 02, 2021 - 19:02 CET
Update
Our DB services are now up and running. We are finishing bringing up all systems.
We expect to have an update in 15 minutes.
Thank you for your patience.
Posted Dec 02, 2021 - 18:05 CET
Update
We are still bringing back our Database up and we expect to get our system up soonest possible.
Sorry for the inconvenience, next update will be in 30 minutes.
Posted Dec 02, 2021 - 17:34 CET
Update
We are still bringing back our Database up and we expect to get our system up soonest possible.
Sorry for the inconvenience, next update will be in 30 minutes.
Posted Dec 02, 2021 - 16:51 CET
Update
We are still bringing back our Database up and we expect to get our system up soonest possible.
Sorry for the inconvenience, next update will be in 30 minutes.
Posted Dec 02, 2021 - 16:15 CET
Identified
We are now bringing back our Database up and we expect to get our system up soonest possible.
We will update you in 15 minutes.
Posted Dec 02, 2021 - 15:47 CET
Update
We are now bringing back our Database up and we expect to get our system up soonest possible.
We will update you in 15 minutes.
Posted Dec 02, 2021 - 15:41 CET
Investigating
We are experiencing issues in Hypatos Studio and Studio API since 14:20 CET.
Our Engineering team is investigating to have a solution soon.
Sorry for the inconvenience, we will update you in 15 minutes.
Posted Dec 02, 2021 - 15:22 CET