Host Memory and CPU Troubleshooting: A Practical Playbook

2 minute read

Context

When a Linux host is “slow,” the hardest part is not fixing the problem — it’s figuring out where to look first.

This guide is a host-level troubleshooting playbook for:

high CPU usage
memory pressure
load average confusion
performance complaints without clear errors

It’s written for engineers who need ground truth from the OS, whether the host runs bare metal, VMs, or Kubernetes nodes.

Step 0: Make Sure You’re Collecting Data

Many useful diagnostics rely on historical metrics. If sysstat isn’t running, you’re blind to the past.

Install and enable sysstat

systemctl enable --now sysstat
systemctl status sysstat

Enable data collection

Edit:

sudo vim /etc/default/sysstat

Ensure:

ENABLED="true"

Verify cron configuration

sudo cat /etc/cron.d/sysstat

If sysstat isn’t collecting, tools like sar won’t help you retroactively.

Step 1: Understand Uptime and Load

Check uptime and load averages

uptime

This shows:

how long the system has been running
load averages over 1, 5, and 15 minutes

Variants worth knowing

uptime -s   # system start time
uptime -p   # human-readable uptime
cat /proc/uptime

Why load averages matter

Load average is not CPU usage.

It represents:

runnable processes
processes waiting on CPU or I/O

High load with low CPU usage often means:

I/O contention
memory pressure
blocked processes

Step 2: Know Your CPU Topology

Before interpreting CPU metrics, know what “100%” actually means.

List CPU details

lscpu

Just the CPU count

lscpu | grep '^CPU(s)'

This matters on:

multi-core systems
hyperthreaded CPUs
virtual machines with vCPUs

Step 3: Diagnose CPU Pressure

Real-time CPU usage

mpstat

Sample every second for a minute

mpstat 1 60

Inspect a single CPU

mpstat -P 0 1 60

Historical CPU usage

sar -u
sar -u -P 0

Identify CPU hogs

top

Look for:

sustained high %CPU
many runnable processes
uneven CPU utilization

Step 4: Investigate Memory and Swap

Quick overview

free -h

Kernel memory details

cat /proc/meminfo

Historical memory usage

sar -r

Memory-heavy processes

In top:

press f (fields)
move to MEM
press s (select)
q to quit

High memory pressure often manifests as:

swap activity
CPU spikes (due to reclaim)
latency under load

Understanding `/proc` (Why This Works)

The /proc filesystem is a pseudo-filesystem exposing kernel data structures.

It’s authoritative.

If monitoring tools disagree, /proc usually wins.

Learn more:

man procfs

Common Failure Patterns

Symptom	Likely Cause	What to Check
High load, low CPU	I/O wait	`mpstat`, `sar -u`
CPU spikes under memory pressure	Reclaim activity	`free`, `/proc/meminfo`
One core pegged	Single-threaded workload	`mpstat -P`
“Random” slowness	Historical saturation	`sar`

Platform & Virtualization Notes

Kubernetes nodes often hide host pressure until pods fail
VM CPU steal time can look like “slow hardware”
Memory overcommitment amplifies reclaim costs

Host-level visibility is still essential — even in abstracted platforms.

Takeaways

Always enable historical metrics before you need them
Load averages require context
CPU and memory issues are often intertwined
/proc provides ground truth
Host-level diagnostics still matter in modern platforms

When performance feels vague or intermittent, start at the host — it usually tells the truth faster than higher layers.

Host Memory and CPU Troubleshooting: A Practical Playbook

Context

Step 0: Make Sure You’re Collecting Data

Install and enable sysstat

Enable data collection

Verify cron configuration

Step 1: Understand Uptime and Load

Check uptime and load averages

Variants worth knowing

Why load averages matter

Step 2: Know Your CPU Topology

List CPU details

Just the CPU count

Step 3: Diagnose CPU Pressure

Real-time CPU usage

Sample every second for a minute

Inspect a single CPU

Historical CPU usage

Identify CPU hogs

Step 4: Investigate Memory and Swap

Quick overview

Kernel memory details

Historical memory usage

Memory-heavy processes

Understanding `/proc` (Why This Works)

Common Failure Patterns

Platform & Virtualization Notes

Takeaways

You May Also Enjoy

Operational Guardrails for Multi-Tenant PostgreSQL

Recovering from Toolchain Drift on macOS

Operational Realities of Running PostgreSQL

Kubernetes ServiceAccount Tokens and CI/CD Authentication

Context

Step 0: Make Sure You’re Collecting Data

Install and enable sysstat

Enable data collection

Verify cron configuration

Step 1: Understand Uptime and Load

Check uptime and load averages

Variants worth knowing

Why load averages matter

Step 2: Know Your CPU Topology

List CPU details

Just the CPU count

Step 3: Diagnose CPU Pressure

Real-time CPU usage

Sample every second for a minute

Inspect a single CPU

Historical CPU usage

Identify CPU hogs

Step 4: Investigate Memory and Swap

Quick overview

Kernel memory details

Historical memory usage

Memory-heavy processes

Understanding /proc (Why This Works)

Common Failure Patterns

Platform & Virtualization Notes

Takeaways

You May Also Enjoy

Operational Guardrails for Multi-Tenant PostgreSQL

Recovering from Toolchain Drift on macOS

Operational Realities of Running PostgreSQL

Kubernetes ServiceAccount Tokens and CI/CD Authentication

Understanding `/proc` (Why This Works)