Daily DevOps Interview Questions Day #3
Daily Interview Questions for SRE and DevOps engineers
Question for today
Scenario: Let's say you are a SaaS business, and users are reporting that their account dashboard is loading very slowly. How would you troubleshoot something like this? What would be the steps? Assume in this case you do not know anything about the infrastructure. What questions would you ask other engineers or end users?
Answer #1
Trust but verify, I would never blindly believe an end user. Ensure that you could replicate the problem on your end
Once that is done, I would ask things like, is the service that is being affected talking to any other services? Is it a service on its own? How is it being monitored? What metrics are we monitoring?
Scenario: continued
Lets say this service that is handling all of the account dashboard service is a service running in an ec2 server that talks to an RDS instance, lets use PostgreSQL. The CPU graph for the EC2 instance shows:
What would be your next steps?
Answer #2
In this scenario where the CPU is bogged down, there is enough room to ask more questions. You learned that there is an RDS instance; how does the monitoring look there? What kind of metrics would help you out in this situation? Are there any intensive processing tasks? What do the logs say?
Scenario: continued
Now we tail the application logs you see a few of the following:
2024-03-25 14:30:12 [WARN] Slow query detected: SELECT * FROM orders WHERE user_id = 12345; Execution time: 1500 ms
What does this mean and what would you do next?
Answer #3
There are also other things to consider; the slow query could be a smoking gun (other applications also have this slow query log, but performance is not impacted)
Ask to see more monitoring graphs related to RDS
In certain situations where this could be impacting multiple customers, immediate action could be taken to locate the slow queries and terminate them just as a temporary measure to restore normal operations.
If you are not the application owner, it is important to inform the development team of the issue and work together to produce a fix.
There are also other things to consider as well. Is this due to a possible release where database migrations occurred? New database schemas were introduced that had design inefficiencies.
Once again, there are multiple approaches to take. There's no wrong answer, really; it's just about seeing how the interviewing engineer thinks and approaches a situation thoroughly, considering a variety of situations in a mission-critical system
Stop gap can also be to upsize the ec2, rds instance and see if that helps while the app team gets the app working properly.