Host Memory and CPU Troubleshooting: A Practical Playbook
Context
When a Linux host is “slow,” the hardest part is not fixing the problem — it’s figuring out where to look first.
This guide is a host-level troubleshooting playbook for:
- high CPU usage
- memory pressure
- load average confusion
- performance complaints without clear errors
It’s written for engineers who need ground truth from the OS, whether the host runs bare metal, VMs, or Kubernetes nodes.
Step 0: Make Sure You’re Collecting Data
Many useful diagnostics rely on historical metrics. If sysstat isn’t running, you’re blind to the past.
Install and enable sysstat
systemctl enable --now sysstat
systemctl status sysstat
Enable data collection
Edit:
sudo vim /etc/default/sysstat
Ensure:
ENABLED="true"
Verify cron configuration
sudo cat /etc/cron.d/sysstat
If sysstat isn’t collecting, tools like sar won’t help you retroactively.
Step 1: Understand Uptime and Load
Check uptime and load averages
uptime
This shows:
- how long the system has been running
- load averages over 1, 5, and 15 minutes
Variants worth knowing
uptime -s # system start time
uptime -p # human-readable uptime
cat /proc/uptime
Why load averages matter
Load average is not CPU usage.
It represents:
- runnable processes
- processes waiting on CPU or I/O
High load with low CPU usage often means:
- I/O contention
- memory pressure
- blocked processes
Step 2: Know Your CPU Topology
Before interpreting CPU metrics, know what “100%” actually means.
List CPU details
lscpu
Just the CPU count
lscpu | grep '^CPU(s)'
This matters on:
- multi-core systems
- hyperthreaded CPUs
- virtual machines with vCPUs
Step 3: Diagnose CPU Pressure
Real-time CPU usage
mpstat
Sample every second for a minute
mpstat 1 60
Inspect a single CPU
mpstat -P 0 1 60
Historical CPU usage
sar -u
sar -u -P 0
Identify CPU hogs
top
Look for:
- sustained high
%CPU - many runnable processes
- uneven CPU utilization
Step 4: Investigate Memory and Swap
Quick overview
free -h
Kernel memory details
cat /proc/meminfo
Historical memory usage
sar -r
Memory-heavy processes
In top:
- press
f(fields) - move to
MEM - press
s(select) qto quit
High memory pressure often manifests as:
- swap activity
- CPU spikes (due to reclaim)
- latency under load
Understanding /proc (Why This Works)
The /proc filesystem is a pseudo-filesystem exposing kernel data structures.
It’s authoritative.
If monitoring tools disagree, /proc usually wins.
Learn more:
man procfs
Common Failure Patterns
| Symptom | Likely Cause | What to Check |
|---|---|---|
| High load, low CPU | I/O wait | mpstat, sar -u |
| CPU spikes under memory pressure | Reclaim activity | free, /proc/meminfo |
| One core pegged | Single-threaded workload | mpstat -P |
| “Random” slowness | Historical saturation | sar |
Platform & Virtualization Notes
- Kubernetes nodes often hide host pressure until pods fail
- VM CPU steal time can look like “slow hardware”
- Memory overcommitment amplifies reclaim costs
Host-level visibility is still essential — even in abstracted platforms.
Takeaways
- Always enable historical metrics before you need them
- Load averages require context
- CPU and memory issues are often intertwined
/procprovides ground truth- Host-level diagnostics still matter in modern platforms
When performance feels vague or intermittent, start at the host — it usually tells the truth faster than higher layers.