Observability and Troubleshotting
Table of Contents
Notes from the book DevOps for the desperate by Bradley Smith #
Observability: Any application should be observable. Know what it is doing internally by analyzing system outputs like:
- Metrics: Data over time. Application’s health and performance
- Traces: Track request throw different services.
- Logs:}historical audit trail of events.
Monitoring Entails recording anayzing and alerting on predefined metrics to understand the current state of a system.
An obeservable system should answer two main questions: “What?” and “Why?” “What?” talks about the symptom. “Why?” asks for the reasons behind the symtom.
Prometheus, alertmanager and grafana Prometheus Is a metric collection application that queries metric data.
Alertmanager Takes alerts from prometheus ad decides where to route them based on some configurable criteria.
Grafana Provides an esasy to-use interface to create and view dashboards and graphs from the data prometheus provides.
Tips to troubleshooting a problem
- Start simple. Be methodical. The problem is usually human error.
- Build a mental model. Undestanding what the system’s role is and how it interacts with other systems will help you troubleshoot faster.
- Take time developing a theory. Its always worth checking to see if the breadcrumb trail leads any farther.
- Have consister tools across hosts.
- Keep a journal. Keep a high-level account of problems, symptoms, and fixes so you dont forget important details about an issue.
- Know when to ask for help.
Linux commands that might help
uptime - display how long a host has been running the number of logged-in-users, and the system load.
top - Information about the system and processes running on that host.
tools to discover more about a process’s interaction with the system.
- vmstat
- strace
- lsof
free - Quick sanity check on system memory by displaying used and available memory at the time it is run. the -h and -m flags show all output in human-readable format.
vmstat - useful information about processes, memoty, IO, disks, and CPU activity.
ps - If the memory usage is high on the host, you’ll want to check all the running processes to find where the memory is being used.