2 minute read

Context

Network issues are rarely binary.

Most of the time, the network is:

  • partially working
  • working for some clients but not others
  • fast sometimes and slow at others

This makes network troubleshooting feel chaotic. The cure is structure.

This post lays out a layered, repeatable approach to diagnosing network problems without guessing.


The Core Principle: Eliminate Layers

Effective troubleshooting is about answering one question at a time:

What is the highest layer I can confidently rule out?

Each step narrows the failure domain until the problem becomes obvious—or at least localized.


Step 1: Is the Host Reachable?

Start with the simplest possible test.

ping <destination>

What this tells you:

  • basic IP connectivity exists
  • routing is functioning (at least one way)
  • ICMP is not blocked

What it does not tell you:

  • application reachability
  • TCP/UDP health
  • latency under load

If ping fails, don’t go higher.


Step 2: Is Name Resolution Working?

Many “network” problems are actually DNS problems.

nslookup <hostname>
dig <hostname>

Verify:

  • the hostname resolves
  • it resolves to the expected IP
  • the result is consistent across hosts

If DNS is broken, everything above it lies.


Step 3: Can You Reach the Port?

ICMP working does not mean services are reachable.

nc -vz <host> <port>

Or with curl:

curl -v http://<host>:<port>

This validates:

  • TCP connectivity
  • firewall rules
  • service listening state

If the port is unreachable, application debugging is premature.


Step 4: Inspect the Local Network State

Look at the local interface and routing table.

ip addr
ip route

Check for:

  • correct IP assignment
  • expected default route
  • multiple routes competing unexpectedly

Misrouting often looks like “random” failures.


Step 5: Identify Latency or Loss

When things are slow but not broken:

traceroute <destination>
mtr <destination>

These tools help surface:

  • where latency increases
  • where packet loss begins
  • whether the issue is local or upstream

Remember: packet loss at one hop does not always mean failure at that hop—but trends matter.


Step 6: Validate the Service Itself

If the network path is healthy, verify the application endpoint.

  • Is the service running?
  • Is it bound to the correct interface?
  • Is it overloaded?

Many “network outages” are healthy networks exposing failing services.


Common Failure Patterns

Symptom Likely Cause
Ping works, app fails Port blocked or service down
Works by IP, not hostname DNS issue
Intermittent slowness Congestion or shared I/O
Works from some hosts Routing or policy asymmetry
Random timeouts Packet loss or MTU mismatch

Patterns save time.


Virtualized and Platform Environments

In VMs and containers, add more layers:

  • virtual switches
  • overlay networks
  • policy engines
  • NAT and port mapping

Always ask:

Is this failure inside the guest, on the host, or in the fabric?

Troubleshooting stops being linear once virtualization is involved.


What Not to Do

  • Don’t jump straight to packet captures
  • Don’t assume “the network is fine”
  • Don’t debug applications before validating connectivity
  • Don’t change things before you understand the failure

Structure beats heroics.


Takeaways

  • Network troubleshooting is about elimination, not intuition
  • DNS failures masquerade as everything else
  • Validate reachability before services
  • Latency and loss require different tools than outages
  • Virtualization adds layers—be explicit about where you’re looking

A calm, layered approach turns “the network is broken” into a solvable problem.