Network Diagnostics for DevOps: Troubleshooting Guide
How DevOps teams use traceroute, ping, and DNS tools to debug connectivity in multi-region deployments.
Why Network Diagnostics Matter for DevOps
In modern infrastructure, applications span multiple cloud regions, rely on third-party APIs, and serve users globally. When something breaks, the question is rarely if the network is involved — it is where in the network the problem lies. DevOps engineers who can systematically diagnose network issues resolve incidents faster, write better postmortems, and build more resilient systems.
This guide covers a practical diagnostic workflow, common failure patterns in cloud and multi-region environments, and how to integrate network testing into your operations.
The Diagnostic Workflow
When a service is unreachable or slow, follow this systematic approach. Each step narrows the problem space:
Step 1: Verify Basic Connectivity (Ping)
Start simple. Can you reach the host at all?
ping -c 10 api.example.com
If ping works, you have IP connectivity and DNS resolution. Note the latency — is it normal for the geographic distance? Use TraceMapper Ping to test from multiple locations simultaneously. If ping fails, the problem could be DNS, routing, firewall, or the host being down. Move to the next steps.
Step 2: Trace the Path (Traceroute)
If latency is high or connectivity is intermittent, trace the path:
mtr -rwbzc 100 api.example.com
This runs mtr with 100 probes and shows hop-by-hop latency, packet loss, and ASN information. Look for:
- Packet loss at a specific hop that carries through to the destination — this is a real problem, not just ICMP rate limiting.
- Unexpected geographic detours — traffic going through distant regions instead of taking a direct path.
- ASN transitions — identify where traffic leaves your cloud provider's network and enters the public internet, which is often where issues occur.
Use TraceMapper to run visual traceroutes from multiple source locations — this is essential for multi-region services where the path differs per region.
Step 3: Check DNS Resolution
DNS failures are one of the most common causes of outages. Verify resolution from multiple locations:
dig +short api.example.com @8.8.8.8
Check for: stale cached records, propagation delays after DNS changes, NXDOMAIN responses, and high DNS query latency. Use TraceMapper DNS Lookup to query multiple resolvers and record types simultaneously.
Step 4: Test HTTP Connectivity
The host is reachable and DNS resolves, but the application is not responding? Test at the HTTP level:
curl -o /dev/null -s -w "HTTP %{http_code} in %{time_total}s\n" https://api.example.com/health
This reveals TLS handshake issues, HTTP-level errors (502, 503, 504), slow application responses versus slow network, and redirect chains adding latency. Our HTTP Check tool performs this analysis with detailed timing breakdowns.
Step 5: Verify Port Accessibility
If HTTP checks fail, verify the port is open. A closed or filtered port indicates a firewall rule, security group misconfiguration, or the service not listening:
nc -zv api.example.com 443
Test from multiple networks — a port may be open from within a VPC but filtered from the public internet. Use TraceMapper Port Check to test from external locations.
Common Network Issues in Cloud Environments
DNS Resolution Failures
Cloud DNS (Route 53, Cloud DNS, Azure DNS) can fail or return stale records. Common causes: TTL set too low causing excessive queries, DNS zone delegation errors after migration, split-horizon DNS returning internal IPs to external clients. Always have monitoring on DNS resolution from external vantage points.
Routing Changes and BGP Issues
BGP route leaks and hijacks can redirect traffic through unexpected paths. After a major cloud provider or ISP incident, run traceroutes to verify your traffic paths have returned to normal. Use TraceMapper BGP Lookup to check ASN and prefix information.
Peering Congestion
Traffic between cloud providers (e.g., AWS to GCP) or between a cloud provider and a major ISP often traverses peering points that can become congested during peak hours. Symptoms: latency increases at specific times of day, packet loss appears at the ASN boundary between two networks. Solution: use direct connect/dedicated interconnects or route through a different peering point.
MTU and Fragmentation Issues
VPN tunnels, VXLAN overlays, and GRE encapsulation reduce the effective MTU. If packets exceed the path MTU and the Don't Fragment bit is set, they are silently dropped. Symptoms: small requests work, large responses fail; TCP connections hang after the handshake. Test with: ping -M do -s 1472 destination (reduces size until it works). Set your interface MTU to match the path MTU.
Security Group and Firewall Blocks
The most common cause of "it works from my machine but not from the server." Cloud security groups are stateful but have limits. Check: inbound rules on the destination, outbound rules on the source, NACLs (which are stateless), and host-level firewalls (iptables, nftables, Windows Firewall).
Multi-Source Tracing
A traceroute from your laptop only shows one path. Your users connect from hundreds of different networks. Multi-source tracing runs diagnostics from multiple geographic locations simultaneously, revealing:
- Regional outages that only affect certain ISPs or countries.
- Geo-routing issues where some users are sent to distant servers.
- Asymmetric problems where the path works from region A but not from region B.
TraceMapper supports multi-source tracing from data centers in Frankfurt and Paris, with more locations coming soon. Pro users can run traces from all available sources simultaneously.
Integrating Network Diagnostics into Your Workflow
Automated Health Checks
Add network connectivity checks to your deployment pipeline. Before deploying a new region, verify that traceroutes from key user locations reach your infrastructure with acceptable latency. Use TraceMapper's tools programmatically to validate connectivity as part of your CI/CD process.
Monitoring and Alerting
Set up continuous monitoring for:
- Latency thresholds: Alert when RTT to critical services exceeds your SLA.
- Packet loss: Any sustained packet loss above 0.1% warrants investigation.
- DNS resolution time: Alert if DNS queries take longer than 100 ms.
- Certificate expiry: Catch TLS certificate issues before they cause outages.
Use TraceMapper Monitoring to set up automated checks with alerts delivered to your team's notification channels.
Incident Response Runbook
Document the diagnostic workflow above as a runbook. When an incident occurs, on-call engineers should:
- Run ping and traceroute from both the affected location and a known-good location.
- Compare results to identify where the paths diverge.
- Check DNS, HTTP, and port accessibility.
- Save results (screenshots, mtr reports) for the postmortem.
Start Diagnosing
Effective network troubleshooting follows a systematic approach — from basic connectivity through path analysis to application-level checks. TraceMapper provides all the tools you need in one place: Traceroute, Ping, DNS Lookup, HTTP Check, Port Check, IP Reputation, and Monitoring. Try a free traceroute now to see your network path visualized on a map.