Network Troubleshooting Methodology

1. Why Methodology Matters

When a network problem occurs, the instinct is to act immediately — to try something, anything, that might fix it. This reactive, unstructured approach — sometimes called "random troubleshooting" — leads to wasted time, additional problems, and the same issue recurring because the root cause was never identified.

A systematic troubleshooting methodology provides a structured process: gather symptoms, isolate the problem to a specific layer or device, test a hypothesis, implement a fix, and verify the resolution. Every experienced network engineer develops a methodology — the approaches described in this page are the formal frameworks that underpin those instincts. They are tested in the CCNA exam and applied daily in production networks.

Random Troubleshooting Systematic Troubleshooting
Try random fixes until something works Gather information, form a hypothesis, test it
May introduce new problems (rebooting a working device) Changes are deliberate and targeted — minimal risk
Root cause often never identified Root cause always identified — prevents recurrence
Undocumented — knowledge lost after the incident Documented — builds institutional knowledge
Escalation is chaotic — no clear starting point Clear handoff — documented state of investigation

Related pages: Related pages: Troubleshooting Connectivity | OSI Model | TCP/IP Model | ping Command | traceroute Command | Debug Commands | show interfaces | show ip route | Wireshark | End-to-End Troubleshooting Scenario Lab

2. The Structured Troubleshooting Process

Regardless of which specific approach (top-down, bottom-up, etc.) is used, all systematic troubleshooting follows the same underlying process. This process is based on the scientific method applied to network engineering.

  Structured troubleshooting process — universal framework:

  Step 1: DEFINE THE PROBLEM
  ─────────────────────────────────────────────────────────────────────
  - What exactly is not working? (specific symptom, not vague report)
  - Who is affected? (one user, one site, everyone?)
  - When did it start? (time of first occurrence)
  - Has anything changed recently? (new config, maintenance, hardware swap)
  - Is it intermittent or constant?
  - Gather output: ping results, error messages, show command output

  Step 2: GATHER INFORMATION
  ─────────────────────────────────────────────────────────────────────
  - Review relevant show commands on affected devices
  - Check syslog for error messages around the time of the problem
  - Check SNMP alerts / NMS dashboard
  - Review recent change log — what was changed and when?
  - Reproduce the problem if possible (proves it is real and consistent)

  Step 3: ANALYSE THE INFORMATION
  ─────────────────────────────────────────────────────────────────────
  - What is the baseline? (what should the output look like?)
  - What is abnormal in the collected output?
  - Which OSI layer is the problem most likely at?
  - What devices are in the path between source and destination?

  Step 4: FORM A HYPOTHESIS
  ─────────────────────────────────────────────────────────────────────
  - State a specific, testable theory: "The problem is a missing route
    on R2 because show ip route on R2 does not show 10.20.0.0/24"
  - Rank hypotheses by likelihood — test the most probable first

  Step 5: TEST THE HYPOTHESIS
  ─────────────────────────────────────────────────────────────────────
  - Design a test that either confirms or eliminates the hypothesis
  - Test one variable at a time — changing multiple things simultaneously
    makes it impossible to know which change fixed the problem

  Step 6: IMPLEMENT THE SOLUTION
  ─────────────────────────────────────────────────────────────────────
  - Apply the fix (add the missing route, correct the misconfiguration)
  - Have a rollback plan ready before making changes

  Step 7: VERIFY AND DOCUMENT
  ─────────────────────────────────────────────────────────────────────
  - Confirm the problem is fully resolved (end-to-end verification)
  - Check that no new problems were introduced
  - Document: root cause, fix applied, verification steps, prevention measures
The change log question is critical: "What changed recently?" is the single most powerful question in network troubleshooting. The vast majority of network problems are caused by recent changes — a misconfigured command, a wrong interface, an unintended side effect. If a recent change exists, examine it first before pursuing a full systematic investigation.

3. The OSI Model as a Troubleshooting Framework

The OSI model provides the foundation for all systematic network troubleshooting approaches. By associating symptoms with specific OSI layers, an engineer can narrow the investigation to a specific protocol, technology, or device type without exhaustively checking everything.

OSI Layer Name Technologies / Protocols Common Problem Symptoms Key Diagnostic Tools
7 Application HTTP, HTTPS, DNS, DHCP, FTP, SMTP, SSH Application works for some users but not others; browser error messages; authentication failures; specific application cannot connect curl, web browser, application logs, nslookup, dig
6 Presentation SSL/TLS, encryption, data encoding SSL certificate errors; encryption negotiation failures; data corruption / garbled output openssl, certificate inspection tools
5 Session NetBIOS, RPC, SIP, H.323 Sessions drop unexpectedly; cannot establish or maintain session; VoIP call setup failures Application logs, Wireshark session analysis
4 Transport TCP, UDP TCP connection resets; sessions time out; high retransmission rate; ACL blocking specific ports netstat, Wireshark, show ip access-lists
3 Network IP, ICMP, OSPF, EIGRP, BGP, ACLs, NAT Ping fails but Layer 2 works; routing loop; wrong route in table; NAT misconfiguration; ACL blocking traffic ping, traceroute, show ip route, show ip protocols
2 Data Link Ethernet, 802.11, VLANs, STP, ARP, MAC address table Ping to default gateway fails; interface up but no Layer 3 connectivity; STP loop; duplex mismatch; VLAN misconfiguration show interfaces, show mac address-table, show spanning-tree, show vlan
1 Physical Cables, connectors, transceivers, NICs, ports Interface "down/down"; no link light; high error counters; intermittent connectivity; CRC errors show interfaces (CRC, input errors), cable tester, optical power meter, visual inspection

See: OSI Model | OSI Layer Functions | TCP/IP Model

4. Bottom-Up Troubleshooting

Bottom-up troubleshooting starts at Layer 1 (Physical) and works upward through the OSI stack layer by layer until the problem is found. The core principle: each layer depends on the layers below it. If Layer 1 has a fault, fixing it might resolve what appeared to be a Layer 3 problem — no point investigating routing if the cable is faulty.

Bottom-Up Workflow

  Bottom-up troubleshooting — start at Layer 1, work upward:

  Layer 1 — Physical: Is the cable connected? Is the link up?
  ───────────────────────────────────────────────────────────
  Check: show interfaces Gi0/0
  Look for: "GigabitEthernet0/0 is down, line protocol is down"
  Check: interface counters — input errors, CRC, giants, runts
  Check: cable, connector, SFP, media converter
  → If Layer 1 is OK, proceed to Layer 2.

  Layer 2 — Data Link: Is the correct VLAN assigned? Is STP OK?
  ───────────────────────────────────────────────────────────────
  Check: show vlan brief — is the port in the correct VLAN?
  Check: show spanning-tree — is the port in forwarding state?
  Check: show mac address-table — does the switch know the MAC?
  Check: show interfaces — duplex/speed mismatch?
  → If Layer 2 is OK, proceed to Layer 3.

  Layer 3 — Network: Is there a valid IP route?
  ─────────────────────────────────────────────────────────────
  Check: show ip route — does a route to the destination exist?
  Check: ping <default-gateway> — can the device reach its gateway?
  Check: show ip interface brief — is the IP address correct?
  Check: show ip protocols — is the routing protocol running?
  → If Layer 3 is OK, proceed to Layer 4.

  Layer 4 — Transport: Is an ACL blocking the port?
  ─────────────────────────────────────────────────────────────
  Check: show ip access-lists — is traffic being denied?
  Check: Is the specific TCP/UDP port open? (telnet <ip> <port>)
  → If Layer 4 is OK, proceed to Layer 7 (application).

  Layer 7 — Application: Is the service running?
  ─────────────────────────────────────────────────────────────
  Check: Is the server listening on the expected port?
  Check: Is DNS resolving correctly?
  Check: Application logs for errors

When to Use Bottom-Up

Use Bottom-Up When... Reason
Physical or Data Link layer problems are suspected (new cable run, hardware change, port flapping) Layer 1/2 problems are common and inexpensive to check — eliminate them first
The problem is a complete connectivity failure (no ping, interface down/down) Total failure usually starts at the physical layer — working upward is logical
The symptom is unfamiliar and the layer is unknown Systematic layer-by-layer approach ensures nothing is missed when the problem is ambiguous
Bottom-up limitation: Starting at Layer 1 when the problem is obviously at Layer 3 (e.g., a misconfigured OSPF area) wastes time checking physical connectivity that is clearly working. Bottom-up is thorough but slow — use it when you genuinely do not know which layer is at fault.

5. Top-Down Troubleshooting

Top-down troubleshooting starts at Layer 7 (Application) and works downward through the OSI stack. The rationale: the user experiences an application problem — if the application works but an underlying layer is marginal, the problem still manifests at the top. Starting at the application layer tests the entire stack end-to-end immediately.

Top-Down Workflow

  Top-down troubleshooting — start at Layer 7, work downward:

  Layer 7 — Application: Does the application work?
  ─────────────────────────────────────────────────────────────────
  Test: Open a browser and navigate to the web server.
  Result: "ERR_CONNECTION_REFUSED" or timeout.
  → Application layer is failing. Is it a server issue or a network issue?
  Test: Can the user ping the server IP? (bypasses DNS and app)

  Layer 3 — Network: Can IP reach the server?
  ─────────────────────────────────────────────────────────────────
  Test: ping 192.168.10.5 from the client.
  Result: Ping succeeds (100% success, correct RTT).
  → Layer 3 is fine. The problem is at Layer 4 or above.

  Layer 4 — Transport: Is the service port reachable?
  ─────────────────────────────────────────────────────────────────
  Test: telnet 192.168.10.5 80 (test TCP port 80 reachability)
  Result: Connection refused / timeout.
  → Either the server is not listening on port 80, or an ACL
    is blocking port 80 between client and server.

  Check ACL: show ip access-lists on routers in the path.
  Found: access-list 110 deny tcp any any eq 80
  Root cause: ACL is blocking HTTP traffic.
  Fix: Remove or modify the ACL entry.

When to Use Top-Down

Use Top-Down When... Reason
The problem is application-specific (web works but FTP fails; email broken but browsing OK) Application-specific problems often have application or transport layer causes — no need to check cables
The user has clearly described an application failure ("I cannot open Outlook") rather than total connectivity loss Partial failures usually indicate upper-layer issues — lower layers are likely working
Network Layer 1/2/3 is known to be operational (users can ping, access some services) No need to re-verify lower layers that are visibly working — start where the problem is
Top-down limitation: Checking application functionality requires access to the application and often to the server. In some environments this is not straightforward. Top-down also presupposes the lower layers are working — if they are not, the application test gives no useful information about the real cause. See also ACLs and Firewalls for transport-layer blocking.

6. Divide-and-Conquer Troubleshooting

Divide-and-conquer (also called half-splitting) starts in the middle of the OSI stack — typically at Layer 3 with a ping test — and uses the result to eliminate half the stack immediately. If Layer 3 works, the problem is at Layer 4 or above. If Layer 3 fails, the problem is at Layer 3 or below. Each test cuts the remaining search space in half.

Divide-and-Conquer Workflow

  Divide-and-conquer — binary search through the OSI stack:

  Start at Layer 3 (middle ground):
  Test: ping <destination IP>

  ┌─────────────────────────────────────────────────────────────────┐
  │  PING SUCCEEDS                    │  PING FAILS                 │
  │  Layer 3 and below are OK         │  Problem is Layer 3 or below│
  │  → Focus on Layer 4 and above     │  → Focus on Layer 1, 2, or 3│
  └─────────────────────────────────────────────────────────────────┘

  If PING FAILS → test Layer 3 specifically:
  → Check show ip route — does a route exist?
  → Check show ip interface brief — correct IP, interface up?

  If routes exist but ping still fails → drill down to Layer 2:
  → Check ARP — is the MAC learned?
  → Check VLAN — is the port in the correct VLAN?
  → Check STP — is the port in forwarding state?

  If Layer 2 is OK but ping still fails → check Layer 1:
  → show interfaces — CRC errors, physical down?

  If PING SUCCEEDS → test Layer 4:
  → telnet <ip> <port> — is the specific port reachable?
  → show ip access-lists — any hits on deny statements?

  If Layer 4 OK → test Layer 7:
  → Application log, service status on server

  Each test halves the remaining search space — efficient for unknown problems.

When to Use Divide-and-Conquer

Use Divide-and-Conquer When... Reason
The problematic layer is unknown and could be anywhere in the stack Most efficient approach when you have no hypothesis — each test eliminates 50% of possible causes
Time is critical and you need to narrow down the problem quickly Divide-and-conquer typically reaches the root cause in 2–3 tests vs 6–7 tests for bottom-up on a 7-layer stack
The problem spans multiple possible OSI layers Starting in the middle avoids committing to a direction before any evidence points one way
Why ping is the ideal divide-and-conquer starting point: A successful ping confirms that Layer 1 (physical), Layer 2 (data link — MAC, VLAN, STP), and Layer 3 (IP routing, ARP) are all functional end-to-end. In one test, three layers are confirmed working, narrowing the problem to Layer 4–7. A failed ping immediately focuses investigation on Layers 1–3.

7. Follow-the-Path Troubleshooting

Follow-the-path (also called path isolation or trace-the-packet) follows the actual route a packet takes from source to destination, examining each device along the path. Rather than working vertically through OSI layers on one device, this approach works horizontally across the network topology — device by device, hop by hop.

Follow-the-Path Workflow

  Scenario: PC (192.168.1.10) cannot reach Server (10.20.0.5)

  Network path:
  [PC] ─── [Switch SW1] ─── [Router R1] ─── [Router R2] ─── [Switch SW2] ─── [Server]

  Step 1: Determine the path using traceroute:
  PC# traceroute 10.20.0.5
  1  192.168.1.1  (R1) — 2 ms  ← reaches R1
  2  * * *                      ← timeout at R2 or beyond

  → Packet reaches R1 but something fails at or after R2.

  Step 2: Examine R1:
  R1# show ip route 10.20.0.5
  → Route exists: via 172.16.0.2 (R2). R1 is OK.
  R1# ping 10.20.0.5 source Lo0
  → Fails. Problem is downstream from R1.

  Step 3: Examine R2:
  R2# show ip route 10.20.0.5
  → 10.20.0.0/24 is directly connected, Gi0/1. R2 routing OK.
  R2# ping 10.20.0.5
  → Fails. Problem is at R2 or between R2 and the server.
  R2# show ip arp 10.20.0.5
  → ARP entry missing. R2 cannot resolve server MAC.
  R2# show interfaces Gi0/1
  → "GigabitEthernet0/1 is up, line protocol is down"
  ← FOUND: R2's interface toward SW2 is down (Layer 1 or 2 issue)

  Step 4: Examine SW2:
  SW2# show interfaces Gi0/24
  → Port is down — cable unplugged on SW2's uplink to R2.
  ROOT CAUSE: Physical disconnection between R2 Gi0/1 and SW2 Gi0/24.

Key Follow-the-Path Tools

Tool Purpose in Path Troubleshooting
traceroute Identifies the last responding hop — shows exactly where the path breaks. Timeouts (***) indicate where packets stop.
ping with source ping <dst> source <interface> simulates traffic from a specific interface — confirms which hop the problem is on vs which hop is reporting it.
show ip route Confirms the route exists on each hop and points to the correct next-hop toward the destination.
show ip arp Confirms ARP resolution is working at each hop — a missing ARP entry on the last-hop router is a common issue.
show interfaces Physical and data link status at each hop; error counters reveal transmission problems on a specific link.
show cdp neighbors Confirms physical adjacency — which devices are connected to which port, and their platform/model.

When to Use Follow-the-Path

Use Follow-the-Path When... Reason
The path between source and destination traverses multiple routers and switches The problem could be on any device in the path — following the path pinpoints the exact failing device
traceroute shows where the path breaks (first timeout hop) traceroute has already identified the approximate location — follow-the-path investigates that device in detail
Intermittent connectivity — some paths work, some do not A specific device in a specific path is misbehaving — following the affected path isolates it

8. Comparing the Four Approaches

Approach Starting Point Direction Best For Main Limitation
Bottom-Up Layer 1 (Physical) Upward through OSI Unknown layer; suspected physical; complete failure Slow when problem is at upper layers
Top-Down Layer 7 (Application) Downward through OSI Application-specific failure; lower layers known OK Requires application access; slow if lower layers broken
Divide-and-Conquer Layer 3 (Network — ping) Binary split up or down Unknown layer; time-critical; most efficient method Requires experience to interpret middle-layer tests
Follow-the-Path Source device Horizontal — hop by hop Multi-hop path failure; after traceroute points to a device Requires access to each device in the path

Hybrid Approach — Real-World Practice

  In practice, experienced engineers use a hybrid:

  1. Start with divide-and-conquer (ping test) to determine which half of
     the OSI stack contains the problem.

  2. If physical layer is suspect (interface down/down, errors) →
     switch to bottom-up for detailed Layer 1/2 investigation.

  3. If application-specific (ping works, app fails) →
     switch to top-down for Layer 4/7 investigation.

  4. If the problem involves a multi-hop path →
     use follow-the-path (traceroute + per-device show commands)
     to identify the exact failing device, then apply bottom-up or
     divide-and-conquer on that specific device.

  The approaches are tools — use the right tool for the current phase
  of the investigation. Switching methods is not inconsistency;
  it is efficiency.

9. Layer-by-Layer Diagnostic Commands

Knowing which commands to run at each OSI layer is as important as knowing which methodology to use. The following is a practical reference for the most important diagnostic commands at each layer.

Layer 1 — Physical

Key commands: show interfaces — check for up/down status, CRC errors, input errors, giants, runts. See also Cable Testing Tools.

  Router# show interfaces GigabitEthernet 0/0
  GigabitEthernet0/0 is up, line protocol is up       ← L1 and L2 status
  GigabitEthernet0/0 is down, line protocol is down   ← L1 failure (cable, SFP)
  GigabitEthernet0/0 is up, line protocol is down     ← L1 OK, L2 failure (keepalive)
  GigabitEthernet0/0 is administratively down         ← shutdown command applied

  Hardware CRC/error counters (signs of Layer 1 problems):
     5 minute input rate 1234000 bits/sec, 0 packets/sec
     Input errors: 14, CRC: 14     ← faulty cable, bad connector, electrical noise
     Giants: 0, Runts: 0           ← frame size issues
     No buffer: 0, Ignored: 0

  Key threshold: any non-zero and increasing CRC errors = physical problem.

  Router# show interfaces status     (on switches)
  Port      Name   Status       Vlan  Duplex  Speed  Type
  Gi0/1            connected    1     a-full  a-1000 10/100/1000BaseTX
  Gi0/2            notconnect   1       --    auto   10/100/1000BaseTX  ← no link

Layer 2 — Data Link

Key commands: show vlan brief, show mac address-table, show spanning-tree (STP), show arp. See also VLANs.

  ! VLAN and port assignment:
  Switch# show vlan brief
  Switch# show interfaces GigabitEthernet 0/1 switchport

  ! MAC address table — is the destination MAC known?
  Switch# show mac address-table dynamic
  Switch# show mac address-table address AA:BB:CC:DD:EE:FF

  ! Spanning Tree — is the port in forwarding state?
  Switch# show spanning-tree vlan 10
  ! Port states: root, designated, alternate, backup
  ! Port roles: forwarding, blocking, listening, learning

  ! Duplex and speed (duplex mismatch causes high errors):
  Switch# show interfaces GigabitEthernet 0/1 | include duplex|speed

  ! ARP table — Layer 2 to Layer 3 mapping:
  Router# show arp
  Router# show ip arp 192.168.1.5    ! check specific entry

Layer 3 — Network

Key commands: show ip route, show ip interface brief, show ip protocols, ping. Routing protocols: OSPF, EIGRP, BGP. See also ACLs and NAT.

  ! Routing table — does a route exist to the destination?
  Router# show ip route
  Router# show ip route 10.20.0.0
  Router# show ip route 10.20.0.5    ! specific host lookup

  ! Interface IP configuration:
  Router# show ip interface brief    ! all interfaces, status, IP
  Router# show ip interface Gi0/0    ! detailed per-interface IP info

  ! Routing protocol status:
  Router# show ip protocols           ! which protocols are running
  Router# show ip ospf neighbor       ! OSPF adjacencies
  Router# show ip eigrp neighbors     ! EIGRP adjacencies
  Router# show ip bgp summary         ! BGP peer status

  ! Ping — end-to-end Layer 3 connectivity:
  Router# ping 10.20.0.5
  Router# ping 10.20.0.5 source GigabitEthernet 0/0   ! from specific source
  Router# ping 10.20.0.5 repeat 100                   ! extended: 100 pings

Layer 4 — Transport

Key commands: show ip access-lists (ACLs), telnet <ip> <port> to test TCP port reachability. See also Firewalls and Common Port Numbers.

  ! ACL hit counters — are packets being denied?
  Router# show ip access-lists
  ! Look for non-zero match counters on deny statements

  ! Test TCP port reachability:
  Router# telnet 10.20.0.5 80     ! test if port 80 is open
  Trying 10.20.0.5, 80 ...
  % Connection refused by remote host    ← port closed or ACL blocking

  ! On the host (Windows):
  C:\> telnet 10.20.0.5 443       ! test HTTPS port
  C:\> Test-NetConnection 10.20.0.5 -Port 443  (PowerShell)

Layer 7 — Application

Key commands: nslookup, dig for DNS resolution; show ip dhcp binding for DHCP; ssh for remote access. See also HTTP/HTTPS.

  ! DNS resolution:
  C:\> nslookup www.example.com
  C:\> nslookup www.example.com 8.8.8.8    ! query specific DNS server

  Router# show hosts                 ! locally cached DNS entries

  ! DHCP troubleshooting:
  Router# show ip dhcp binding       ! current DHCP leases
  Router# show ip dhcp conflict      ! addresses with ARP conflicts
  Router# debug ip dhcp server events  ! trace DHCP request/offer process

  ! SSH connection test (confirm Layer 7 SSH service is running):
  $ ssh [email protected]
  ssh: connect to host 192.168.1.1 port 22: Connection refused
  ← SSH service not running or port blocked

10. Common Problem Patterns and Their Layer

Experience builds a mental library of problem patterns — certain symptoms that almost always point to a specific layer and cause. The following table is a reference for the most common network problems and where they live in the OSI stack.

Symptom Most Likely Layer Most Common Cause First Check
Interface "down/down" Layer 1 Cable unplugged, broken cable, faulty SFP, speed/auto-negotiation failure show interfaces — physical status and error counters
Interface "up/down" Layer 2 Keepalive failure, encapsulation mismatch (serial), no HDLC/PPP peer show interfaces — encapsulation, keepalives
High CRC / input errors Layer 1 Damaged cable, faulty connector, interference, duplex mismatch show interfaces — CRC counter; inspect cable and connectors
Ping to default gateway fails Layer 2 or Layer 3 Wrong VLAN, STP blocking, wrong IP/subnet on PC or gateway interface show vlan, show arp, show ip int brief
Ping succeeds but application fails Layer 4 or Layer 7 ACL blocking the specific port, firewall rule, server not listening, DNS failure show ip access-lists, telnet <ip> <port>, nslookup
Routing loop / high CPU on router Layer 3 Redistributed default route causing a loop, summary route pointing back, static route loop show ip route — look for routes with very low admin distance or recursive loops
Intermittent connectivity (flapping) Layer 1 or Layer 2 Marginal cable (passes some traffic, fails under load), STP topology change, routing protocol instability show interfaces — input/output errors over time; show spanning-tree detail — TCN events
Users in one VLAN cannot reach users in another Layer 2 or Layer 3 Missing inter-VLAN routing, wrong default gateway, ACL blocking inter-VLAN traffic show ip route on the L3 switch/router; verify default gateways on clients
DNS resolution failing Layer 7 Wrong DNS server configured, DNS server unreachable, UDP 53 blocked by ACL nslookup <name>; ping <DNS-server-IP>
DHCP not providing addresses Layer 3 or Layer 7 Missing DHCP relay (ip helper-address) when server is on different subnet, DHCP pool exhausted, server misconfiguration show ip dhcp binding; show ip dhcp pool; verify ip helper-address on gateway

11. Documenting the Troubleshooting Process

Documentation is an integral part of systematic troubleshooting — not an afterthought. Real-time notes during an incident serve as a working memory aid, enable smooth handoff to colleagues, and build the institutional knowledge base that prevents the same problem from taking as long to resolve next time.

  Minimum information to document during a troubleshooting incident:

  ┌─────────────────────────────────────────────────────────────────────┐
  │ INCIDENT RECORD                                                     │
  │                                                                     │
  │ Date/Time reported: 2025-03-15 14:32 UTC                           │
  │ Reported by: NOC Engineer / User Help Desk ticket #12345           │
  │                                                                     │
  │ SYMPTOM: PC users in VLAN 10 (Finance) cannot reach server         │
  │          192.168.20.5 (Payroll). Issue started ~14:15 UTC.         │
  │          Users in VLAN 20 (HR) CAN reach the server.               │
  │                                                                     │
  │ RECENT CHANGES: Access switch SW3 was replaced at 13:45 UTC.       │
  │                                                                     │
  │ INVESTIGATION:                                                      │
  │  14:35 — Ping from PC (10.10.1.5) to server fails                  │
  │  14:36 — Ping from PC to default gateway (10.10.1.1) fails         │
  │  14:38 — show vlan brief on SW3: Fa0/1 is in VLAN 1 (should be 10)│
  │                                                                     │
  │ ROOT CAUSE: Replacement switch SW3 not configured with VLAN 10.    │
  │             Port Fa0/1 defaulted to VLAN 1.                         │
  │                                                                     │
  │ FIX: Configured SW3 Fa0/1: switchport access vlan 10              │
  │                                                                     │
  │ VERIFICATION: Ping from Finance PCs to server succeeds.            │
  │               All Finance users confirm access restored at 14:47.  │
  │                                                                     │
  │ PREVENTION: Add SW3 to configuration management — apply standard   │
  │             VLAN config via Ansible playbook on next change window. │
  └─────────────────────────────────────────────────────────────────────┘

12. Troubleshooting Methodology Summary — Key Facts

Topic Key Fact
Structured process steps Define problem → gather info → analyse → form hypothesis → test hypothesis → implement fix → verify and document
Most powerful first question "What changed recently?" — most problems follow a recent change
Bottom-up Start Layer 1 → work upward; best for suspected physical problems or total connectivity failure
Top-down Start Layer 7 → work downward; best for application-specific failures when lower layers are known to work
Divide-and-conquer Start Layer 3 (ping); result divides stack in half; most efficient when problematic layer is unknown
Follow-the-path Horizontal hop-by-hop investigation; best for multi-hop paths; used after traceroute points to a suspect device
Ping interpretation Success = L1, L2, L3 all working end-to-end; Failure = problem at L1, L2, or L3 — see ping Command
Interface "down/down" Layer 1 physical failure — cable, SFP, port — check show interfaces
Interface "up/down" Layer 2 failure — encapsulation mismatch, keepalive, peer — check show interfaces
Ping works, app fails Layer 4 (port blocked by ACL or firewall) or Layer 7 (server not listening, DNS failure)
Change one thing at a time Making multiple changes simultaneously makes root cause identification impossible
Documentation Document in real time — symptom, timeline, investigation steps, root cause, fix, prevention

13. Troubleshooting Methodology Quiz

1. A user reports that "the network is slow." Using the structured troubleshooting process, what is the correct first step?

Correct answer is B. "The network is slow" is not a problem statement — it is a user experience. Before any technical investigation, the engineer must define the problem precisely: Is it one application or all applications? One user or many? Slow to where — internal servers, Internet, or both? Did it start suddenly or gradually? What changed recently? Without this information, any investigation is guesswork. A well-defined problem statement — "Users in VLAN 10 experience 5–10 second delays when accessing the ERP server at 10.20.0.5, starting at 14:00 today, following a router firmware upgrade" — immediately directs the investigation to the right place and time.

2. Which troubleshooting approach starts at Layer 3 with a ping test, uses the result to determine which half of the OSI stack the problem is in, and is considered the most time-efficient method when the problematic layer is unknown?

Correct answer is D. Divide-and-conquer (half-splitting) begins at the middle of the OSI stack — Layer 3 — using a ping test as the starting probe. The result of that ping immediately eliminates half of all possible OSI layers from investigation. If the ping succeeds, Layers 1, 2, and 3 are all confirmed working, and the problem must be at Layer 4 or above. If the ping fails, the problem is in Layers 1, 2, or 3 — the upper layers need no further investigation. This binary elimination makes divide-and-conquer the fastest approach when no prior knowledge points to a specific layer. Contrast with bottom-up, which checks all 7 layers sequentially from the bottom — inefficient when the problem is at Layer 6.

3. A network engineer observes: show interfaces GigabitEthernet0/0 outputs "GigabitEthernet0/0 is up, line protocol is down." At which OSI layer is this problem, and what are likely causes?

Correct answer is A. The interface status line in Cisco IOS has two components: "X is Y" and "line protocol is Z." The first part (X is Y) reflects the physical layer (Layer 1). "Up" means the physical signal is detected — the cable is connected and carrier detect is active. The second part (line protocol is Z) reflects the data link layer (Layer 2). "Down" means the Layer 2 protocol is not functional — keepalive exchanges are failing, or the encapsulation is mismatched. Common causes: on a serial WAN link, one end is configured HDLC and the other PPP; on an Ethernet link, a connected device is not sending keepalives (rare but possible); or the far-end interface is shut down. "Down/down" (both parts down) = Layer 1 physical failure (cable/SFP).

4. A user cannot browse the web. A ping to the default gateway succeeds. A ping to 8.8.8.8 succeeds. What is the next most logical test, and which troubleshooting approach does this demonstrate?

Correct answer is C. This is a classic divide-and-conquer result: the successful pings to the default gateway and 8.8.8.8 confirm that Layers 1, 2, and 3 are all fully functional end-to-end including Internet connectivity. Since the user cannot browse the web but IP connectivity works, the remaining question is: does the browser's DNS lookup work? A web browser uses hostnames (www.google.com) which must be resolved to IP addresses by DNS before any connection is made. nslookup www.google.com or ping www.google.com (which forces DNS resolution) will immediately confirm or rule out DNS as the cause. If DNS fails, the root cause is Layer 7 (DNS misconfiguration, DNS server unreachable, or UDP 53 blocked by an ACL).

5. When should follow-the-path be the primary troubleshooting approach, and what tool is most useful to start a follow-the-path investigation?

Correct answer is B. Follow-the-path is the horizontal troubleshooting approach — it moves across the network topology rather than up and down the OSI stack. It is most valuable when the problem occurs somewhere in a multi-hop path (PC → switch → router → router → switch → server) and the question is: which device is failing? traceroute is the perfect starting tool: it sends probes with incrementing TTL values and collects ICMP TTL-exceeded responses from each router hop, building a map of the path. When a hop shows asterisks (no response), the previous hop is the last device that successfully handled the packet — the problem is on or immediately after that device. The engineer then logs into that specific device for detailed investigation.

6. An engineer is troubleshooting a connectivity problem and makes three configuration changes simultaneously: corrects a VLAN, fixes an IP address, and removes an ACL entry. The problem is resolved. What is wrong with this approach?

Correct answer is D. Making multiple changes simultaneously is one of the cardinal sins of troubleshooting. Even though the problem was resolved, the root cause is unknown — was it the VLAN? The IP address? The ACL? Or a combination? The two "extra" changes might be unnecessary — and unnecessary changes can have unintended side effects. Removing an ACL entry that was actually correct for security reasons, for example, creates a new vulnerability. The principle is: change one variable at a time. Test after each change. This makes root cause identification unambiguous and prevents unnecessary or harmful changes. It also builds the diagnostic skill — each single change either confirms or refutes a specific hypothesis.

7. A network engineer notices that show interfaces FastEthernet0/1 shows increasing CRC errors on a switch port over time. The interface is "up/up" (connected). At which OSI layer is this problem, and what are the most likely causes?

Correct answer is A. CRC (Cyclic Redundancy Check) errors are a definitive Layer 1 signal. The CRC is a checksum calculated over the frame content at the sending device and recalculated at the receiving device. If they do not match, the frame was corrupted in transit — meaning the electrical signal carrying the bits was degraded between the two devices. Common causes are: damaged copper cable (bent, crushed, or cut), poor crimp on an RJ-45 connector, too long a cable run, excessive electrical interference from nearby power cables, or a duplex mismatch (one side at half-duplex causes collisions that manifest as CRC errors on the full-duplex side). The fix starts with replacing the cable and checking duplex settings (show interfaces | include duplex). See: show interfaces Command

8. Which troubleshooting approach is most appropriate when a user reports that "email works fine, but I cannot access the new internal web application deployed yesterday"?

Correct answer is C. This scenario has two critical clues. First, email works — this confirms that the user's lower OSI layers (physical, data link, network, and transport) are all functional. An email client makes TCP connections (SMTP/IMAP) to remote servers — if email works, Layers 1–4 are working for this user. There is no need to check cables or routing. Second, the web application was deployed yesterday — this is a recent change, immediately making it the prime suspect. Top-down is the right approach: start at the application layer and work down only as far as needed. Typical checks: can the user reach the server by IP? Does DNS resolve the hostname? Is the server actually running and listening on the expected port? Did the deployment configure the correct firewall rules or ACLs?

9. A traceroute from a PC to a server shows: 1 10.1.1.1 (1ms), 2 10.2.2.1 (3ms), 3 * * *, 4 * * *. What does this output tell you, and what is the next step?

Correct answer is B. Traceroute works by sending packets with incrementing TTL values (1, 2, 3...) and collecting ICMP TTL-exceeded messages from each router that decrements the TTL to zero. In this output: hop 1 (10.1.1.1) responded = the first router is reachable. Hop 2 (10.2.2.1) responded = the second router is reachable. Hop 3 shows asterisks = no ICMP TTL-exceeded was received from the third device. This means either the third device does not respond to ICMP TTL-exceeded messages (some devices are configured this way), OR the packet never reaches a third device (the link from hop 2 to hop 3 is down). The best next step is to log into 10.2.2.1 and examine its routing table for the destination prefix, and check the status of the outgoing interface toward hop 3.

10. After resolving a network incident where a trunk port misconfiguration caused VLAN connectivity loss for 45 minutes, which activities should the engineer complete as part of the final step of the structured troubleshooting process?

Correct answer is D. The final step of the structured troubleshooting process is "Verify and Document." Verification means more than just checking that the immediate symptom is gone — it means confirming end-to-end that all affected users have restored access, AND confirming that the fix did not introduce any new problems (for example, correcting a trunk VLAN might restore VLAN 10 but accidentally remove VLAN 20 from the trunk). Documentation is equally important: recording the root cause, timeline, and fix creates institutional knowledge that reduces time-to-resolution for similar incidents in the future. The prevention measure is the most valuable long-term output — identifying why the misconfiguration occurred (manual change without verification) and implementing a process to prevent it (automation, peer review, change management). "Documentation later" (option B) almost never happens — it must be done while the details are fresh.

← Back to Home