Servers down: How do I find out whats wrong?

Servers down: How do I find out whats wrong?

Many different problems can creep up on a network, so network troubleshooting skills become crucial for anyone responsible for servers or services on servers attached to a network. Linux provides a large set of network troubleshooting tools, and this article discusses a few common network problems along with how to use some of the tools available for Linux to track down the root cause.

Problem: Server A can’t talk to server B

Probably the most common network troubleshooting scenario involves one server being unable to communicate with another server on the network. This section will use an example in which a server named dev1 can’t access the web service (port 80) on a second server named web1. Any number of different problems could cause this, so we’ll run step by step through tests you can perform to isolate the cause of the problem.

Normally when troubleshooting a problem like this, you might skip a few of these initial steps (such as checking the link), since tests further down the line will also rule them out. For instance, if you test and confirm that DNS works, you’ve proven that your host can communicate on the local network. For this example, though, we’ll walk through each intermediary step to illustrate how you might test each level.

Client or server problem?

One quick test you can perform to narrow down the cause of your problem is to go to another host on the same network and try to access the server. In this example, you would find another server on the same network as dev1, such as dev2, and try to access web1. If dev2 also can’t access web1, then you know the problem is more likely on web1, or on the network between dev1, dev2, and web1. If dev2 can access web1, then you know the problem is more likely on dev1. To start, let’s assume that dev2 can access web1, so we will focus our troubleshooting on dev1.

Is it plugged in?

The first troubleshooting steps to perform are on the client. You first want to verify that your client’s connection to the network is healthy. To do this you can use the ethtool program (installed via the ethtool package) to verify that your link is up (the Ethernet device is physically connected to the network). If you aren’t sure what interface you use, run the /sbin/ifconfig command to list all the available network interfaces and their settings. So if your Ethernet device was at eth0

$ sudo ethtool eth0
Settings for eth0:
     Supported ports: [ TP ]
     Supported link modes:   10baseT/Half 10baseT/Full 
                               100baseT/Half 100baseT/Full 
                               1000baseT/Half 1000baseT/Full 
     Supports auto-negotiation: Yes
     Advertised link modes:  10baseT/Half 10baseT/Full 
                               100baseT/Half 100baseT/Full 
                               1000baseT/Half 1000baseT/Full 
     Advertised auto-negotiation: Yes
     Speed: 100Mb/s
     Duplex: Full
     Port: Twisted Pair
     PHYAD: 0
     Transceiver: internal
     Auto-negotiation: on
     Supports Wake-on: pg
     Wake-on: d
     Current message level: 0x000000ff (255)
     Link detected: yes

Here, on the final line, you can see that Link detected is set to yes, so dev1 is physically connected to the network. If this was set to no, you would need to physically inspect dev1’s network connection and make sure it was connected. Since it is physically connected, you can move on.

Is the interface up?

Once you have established that you are physically connected to the network, the next step is to confirm that the network interface is configured correctly on your host. The best way to check this is to run the ifconfig command with your interface as an argument. So to test eth0’s settings, you would run

$ sudo ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:17:42:1f:18:be  
          inet addr:10.1.1.7  Bcast:10.1.1.255  Mask:255.255.255.0
          inet6 addr: fe80::217:42ff:fe1f:18be/64 Scope:Link
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:1 errors:0 dropped:0 overruns:0 frame:0
          TX packets:11 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:229 (229.0 B)  TX bytes:2178 (2.1 KB)
          Interrupt:10 

Probably the most important line in this is the second line of output, which tells us our host has an IP address (10.1.1.7) and subnet mask (255.255.255.0) configured. Now, whether these are the correct settings for this host is something you will need to confirm. If the interface is not configured, try running sudo ifup eth0 and then run ifconfig again to see if the interface comes up. If the settings are wrong or the interface won’t come up, inspect /etc/network/interfaces on Debian-based systems or /etc/-sysconfig/-network_scripts/ifcfg- on Red Hat-based systems. It is in these files that you can correct any errors in the network settings. Now if the host gets its IP through DHCP, you will need to move your troubleshooting to the DHCP host to find out why you aren’t getting a lease.

Is it on the local network?

Once you see that the interface is up, the next step is to see if a default gateway has been set and whether you can access it. The route command will display your current routing table, including your default gateway:

$ sudo route -n
Kernel IP routing table
Destination     Gateway      Genmask          Flags Metric Ref     Use Iface
10.1.1.0        *             255.255.255.0    U     0      0        0 eth0
default         10.1.1.1     0.0.0.0           UG    100    0        0 eth0

The line you are interested in is the last line, which starts with default. Here you can see that the host has a gateway of 10.1.1.1. Note that the -n option was used with route so it wouldn’t try to resolve any of these IP addresses into hostnames. For one thing, the command runs more quickly, but more important, you don’t want to cloud your troubleshooting with any potential DNS errors. If you don’t see a default gateway configured here, and the host you want to reach is on a different subnet (say, web1, which is on 10.1.2.5), that is the likely cause of your problem. To fix this, either be sure to set the gateway in /etc/network/interfaces on Debian-based systems or /etc/-sysconfig/network_scripts/ifcfg- on Red Hat-based systems, or if you get your IP via DHCP, be sure it is set correctly on the DHCP server and then reset your interface with the following on Debian-based systems:

$ sudo service networking restart

The following would be used on Red Hat-based systems:

$ sudo service network restart

On a side note, it’s amazing that these distributions have to differ even on something this fundamental.
Once you have identified the gateway, use the ping command to confirm that you can communicate with the gateway:

$ ping -c 5 10.1.1.1
PING 10.1.1.1 (10.1.1.1) 56(84) bytes of data.
64 bytes from 10.1.1.1: icmp_seq=1 ttl=64 time=3.13 ms
64 bytes from 10.1.1.1: icmp_seq=2 ttl=64 time=1.43 ms
64 bytes from 10.1.1.1: icmp_seq=3 ttl=64 time=1.79 ms
64 bytes from 10.1.1.1: icmp_seq=5 ttl=64 time=1.50 ms
— 10.1.1.1 ping statistics —
5 packets transmitted, 4 received, 20% packet loss, time 4020ms
rtt min/avg/max/mdev = 1.436/1.966/3.132/0.686 ms

As you can see, we were able to successfully ping the gateway, which means that we can at least communicate with the 10.1.1.0 network. If you couldn’t ping the gateway, it could mean a few things. It could mean that your gateway is blocking ICMP packets. If so, tell your network administrator that blocking ICMP is an annoying practice with negligible security benefits and then try to ping another Linux host on the same subnet. If ICMP isn’t being blocked, then it’s possible that the switch port on your host is set to the wrong VLAN, so you will need to further inspect the switch to which it is connected

Is DNS working?

Once you have confirmed that you can speak to the gateway, the next thing to test is whether DNS functions. Both the nslookup and dig tools can be used to troubleshoot DNS issues, but since you need to perform only basic testing at this point, just use nslookup to see if you can resolve web1 into an IP:

$ nslookup web1
Server: 10.1.1.3
Address: 10.1.1.3#53
Name: web1.example.net
Address: 10.1.2.5

In this example DNS is working. The web1 host expands into web1.example.net and resolves to the address 10.1.2.5. Of course, make sure that this IP matches the IP that web1 is supposed to have! In this case, DNS works, so we can move on to the next section; however, there are also a number of ways DNS could fail.

No name server configured or inaccessible name server

If you see the following error, it could mean either that you have no name servers configured for your host or they are inaccessible:
$ nslookup web1
;; connection timed out; no servers could be reached

In either case you will need to inspect /etc/resolv.conf and see if any name servers are configured there. If you don’t see any IP addresses configured there, you will need to add a name server to the file. Otherwise, if you see something like the following, you need to start troubleshooting your connection with your name server, starting off with ping:

search example.net nameserver 10.1.1.3

If you can’t ping the name server and its IP address is in the same subnet (in this case, 10.1.1.3 is within the subnet), the name server itself could be completely down. If you can’t ping the name server and its IP address is in a different subnet, then skip ahead to the Can I route to the remote host? section, but only apply those troubleshooting steps to the name server’s IP. If you can ping the name server but it isn’t responding, skip ahead to the Is the remote port open? section.

Missing search path or name server problem

It is also possible that you will get the following error for your nslookup command:

$ nslookup web1
Server: 10.1.1.3
Address: 10.1.1.3#53
** server can’t find web1: NXDOMAIN

Here you see that the server did respond, since it gave a response: server can’t find web1. This could mean two different things. One, it could mean that web1’s domain name is not in your DNS search path. This is set in /etc/resolv.conf in the line that begins with search. A good way to test this is to perform the same nslookup command, only use the fully qualified domain name (in this case, web1.example.net). If it does resolve, then either always use the fully qualified domain name, or if you want to be able to use just the hostname, add the domain name to the search path in /etc/resolv.conf.

If even the fully qualified domain name doesn’t resolve, then the problem is on the name server. Here are some basic pointers for troubleshooting DNS issues. If the name server is supposed to have that record, then that zone’s configuration needs to be examined. If it is a recursive name server, then you will have to test whether or not recursion is working on the name server by looking up some other domain. If you can look up other domains, then you must check if the problem is on the remote name server that does contain the zones

Can I route to the remote host?

After you have ruled out DNS issues and see that web1 is resolved into its IP 10.1.2.5, you must test whether you can route to the remote host. Assuming ICMP is enabled on your network, one quick test might be to ping web1. If you can ping the host, you know your packets are being routed there and you can move to the next section, Is the remote port open? If you can’t ping web1, try to identify another host on that network and see if you can ping it. If you can, then it’s possible web1 is down or blocking your requests, so move to the next section. If you can’t ping any hosts on the remote network, packets aren’t being routed correctly. One of the best tools to test routing issues is traceroute. Once you provide traceroute with a host, it will test each hop between you and the host. For example, a successful traceroute between dev1 and web1 would look like this:

$ traceroute 10.1.2.5
traceroute to 10.1.2.5 (10.1.2.5), 30 hops max, 40 byte packets
1 10.1.1.1 (10.1.1.1) 5.432 ms 5.206 ms 5.472 ms
2 web1 (10.1.2.5) 8.039 ms 8.348 ms 8.643 ms
Here you can see that packets go from dev1 to its gateway (10.1.1.1), and then the next hop is web1. This means it’s likely that 10.1.1.1 is the gateway for both subnets. On your network you might see a slightly different output if there are more routers between you and your host. If you can’t ping web1, your output would look more like the following:
$ traceroute 10.1.2.5
traceroute to 10.1.2.5 (10.1.2.5), 30 hops max, 40 byte packets
1 10.1.1.1 (10.1.1.1) 5.432 ms 5.206 ms 5.472 ms
2 * * *
3 * * *

Once you start seeing asterisks in your output, you know that the problem is on your gateway. You will need to go to that router and investigate why it can’t route packets between the two networks. Instead you might see something more like

$ traceroute 10.1.2.5
traceroute to 10.1.2.5 (10.1.2.5), 30 hops max, 40 byte packets
1 10.1.1.1 (10.1.1.1) 5.432 ms 5.206 ms 5.472 ms
1 10.1.1.1 (10.1.1.1) 3006.477 ms !H 3006.779 ms !H 3007.072 ms

In this case, you know that the ping timed out at the gateway, so the host is likely down or inaccessible even from the same subnet. At this point, if you haven’t tried to access web1 from a machine on the same subnet as web1, try pings and other tests now.
Note: If you have one of those annoying networks that block ICMP, don’t worry, you can still troubleshoot routing issues. You just need to install the tcptraceroute package (sudo apt-get install tcptraceroute), then run the same commands as for traceroute, only substitute tcptraceroute for traceroute.
Is the remote port open?

So you can route to the machine but you still can’t access the web server on port 80. The next test is to see whether the port is even open. There are a number of different ways to do this. For one, you could try telnet:

$ telnet 10.1.2.5 80
Trying 10.1.2.5…
telnet: Unable to connect to remote host: Connection refused

If you see Connection refused, then either the port is down (likely Apache isn’t running on the remote host or isn’t listening on that port) or the firewall is blocking your access. If telnet can connect, then, well, you don’t have a networking problem at all. If the web service isn’t working the way you suspected, you need to investigate your Apache configuration on web1. (Troubleshooting web server issues is covered elsewhere in the book).

Instead of telnet, I prefer to use nmap to test ports because it can often detect firewalls. If nmap isn’t installed, use your package manager to install the nmap package. To test web1, type the following:

$ nmap -p 80 10.1.2.5
Starting Nmap 4.62 ( http://nmap.org ) at 2009-02-05 18:49 PST
Interesting ports on web1 (10.1.2.5):
PORT STATE SERVICE
80/tcp filtered http

Aha! nmap is smart enough that it can often tell the difference between a closed port that is truly closed and a closed port behind a firewall. Normally when a port is actually down, nmap will report it as closed. Here it reported it as filtered. What this tells us is that some firewall is in the way and is dropping the packets to the floor. This means you need to investigate any firewall rules on the gateway (10.1.1.1) and on web1 itself to see if port 80 is being blocked

Test the remote host locally

At this point, we have either been able to narrow the problem down to a network issue or we believe the problem is on the host itself. If we think the problem is on the host itself, we can do a few things to test whether port 80 is available.

Test for listening ports

One of the first things you should do on web1 is test whether port 80 is listening. The netstat -lnp command will list all ports that are listening along with the process that has the port open. You could just run that and parse through the output for anything that is listening on port 80, or you could use grep to show only things listening on port 80:

$ sudo netstat -lnp | grep :80
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 919/apache

The first column tells you what protocol the port is using. The second and third columns are the receive and send queues (both are set to 0 here). The column you want to pay attention to is the fourth column, as it lists the local address on which the host is listening. Here the 0.0.0.0:80 tells us that the host is listening on all of its IPs for port 80 traffic. If Apache were listening only on web1’s Ethernet address, you would see 10.1.2.5:80 here.

The final column will tell you which process has the port open. Here you can see that Apache is running and listening. If you do not see this in your netstat output, you need to start your Apache server.

Firewall rules

If the process is running and listening on port 80, it’s possible that web1 has some sort of firewall in place. Use the iptables command to list all of your firewall rules. If your firewall is disabled, your output will look like this:

$  sudo /sbin/iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
Notice that the default policy is set to ACCEPT. It’s possible, though, that your firewall is set to drop all packets by default, even if it doesn’t list any rules. If that is the case you will see output more like the following:
$  sudo /sbin/iptables -L
Chain INPUT (policy DROP)
target     prot opt source               destination
Chain FORWARD (policy DROP)
target     prot opt source               destination
Chain OUTPUT (policy DROP)
target     prot opt source               destination
On the other hand, if you had a firewall rule that blocked port 80, it might look like this:
$ sudo /sbin/iptables -L -n
Chain INPUT (policy ACCEPT)
target prot opt source destination
REJECT tcp — 0.0.0.0/0 0.0.0.0/0 tcp dpt:80 reject-with(
icmp-port-unreachable
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Clearly, in the latter case you would need to modify the firewall rules to allow port 80 traffic from the host.

Troubleshoot slow networks

In a way, it’s easier to troubleshoot network problems when something doesn’t work at all. When a host is unreachable, you can perform the troubleshooting steps discussed earlier until the host is reachable again. When the network is just slow, however, sometimes it can be a bit tricky to track down why. This section discusses a few techniques you can use to track down the cause of slow networks.

Although DNS is blamed more often than it should be for network problems, when DNS does have an issue, it can often result in poor network performance. For instance, if you have two DNS servers configured for a domain and the first one you try goes down, your DNS requests will wait 30 seconds before they time out and go to the secondary DNS server. Although this will definitely be noticeable when you run tools like dig or nslookup, DNS issues can cause apparent network slowdowns in some unexpected ways; this is because so many services rely on DNS to resolve hostnames to IP addresses. Such issues can even affect your network troubleshooting tools.

Ping, traceroute, route, netstat, and even iptables are great examples of network troubleshooting tools that can degrade during DNS issues. By default, all of these tools will attempt to resolve IP addresses into hostnames if they can. If there are DNS problems, however, the results from each of these commands might stall while they attempt to look up IP addresses and fail. In the case of ping or traceroute, it might seem like your ping replies are taking a long time, yet when they do finally come through, the round-trip time is relatively low. In the case of route, netstat, and iptables, the results might stall for quite some time before you get output. The system is waiting for DNS requests to time out

In all of the cases mentioned, it’s easy to bypass DNS so your troubleshooting results are accurate. All of the commands we discussed earlier accept an -n option, which disables any attempt to resolve IP addresses into hostnames. I’ve just become accustomed to adding -n to all of the commands I introduced you to in the first part of this chapter unless I really do want IP addresses resolved.
Note: DNS resolution can also affect your web server’s performance in an unexpected way. Some web servers are configured to resolve every IP address that accesses them into a hostname for logging. Although that can make the logs more readable, it can also dramatically slow down your web server at the worst times — when you have a lot of visitors. Instead of serving traffic, your web server can get busy trying to resolve all of those IPs.

Find the network slowdown with traceroute

When your network connection seems slow between your server and a host on a different network, sometimes it can be difficult to track down where the real slowdown is. Especially in situations where the slowdown is in latency (the time it takes to get a response) and not overall bandwidth, it’s a situation traceroute was made for. traceroute was mentioned earlier as a way to test overall connectivity between you and a server on a remote network, but traceroute is also useful when you need to diagnose where a network slowdown might be. Since traceroute outputs the reply times for every hop between you and another machine, you can trace down servers that might be on a different continent or gateways that might be overloaded and causing network slowdowns. For instance, here’s part of a traceroute between a server in the United States and a Chinese Yahoo server:

$ traceroute yahoo.cn
traceroute to yahoo.cn (202.165.102.205), 30 hops max, 60 byte packets
1 64-142-56-169.static.sonic.net (64.142.56.169) 1.666 ms 2.351 ms 3.038 ms
2 2.ge-1-1-0.gw.sr.sonic.net (209.204.191.36) 1.241 ms 1.243 ms 1.229 ms
3 265.ge-7-1-0.gw.pao1.sonic.net (64.142.0.198) 3.388 ms 3.612 ms 3.592 ms
4 xe-1-0-6.ar1.pao1.us.nlayer.net (69.22.130.85) 6.464 ms 6.607 ms 6.642 ms
5 ae0-80g.cr1.pao1.us.nlayer.net (69.22.153.18) 3.320 ms 3.404 ms 3.496 ms
6 ae1-50g.cr1.sjc1.us.nlayer.net (69.22.143.165) 4.335 ms 3.955 ms 3.957 ms
7 ae1-40g.ar2.sjc1.us.nlayer.net (69.22.143.118) 8.748 ms 5.500 ms 7.657 ms
8 as4837.xe-4-0-2.ar2.sjc1.us.nlayer.net (69.22.153.146) 3.864 ms 3.863 ms 3.865 ms
9 219.158.30.177 (219.158.30.177) 275.648 ms 275.702 ms 275.687 ms
10 219.158.97.117 (219.158.97.117) 284.506 ms 284.552 ms 262.416 ms
11 219.158.97.93 (219.158.97.93) 263.538 ms 270.178 ms 270.121 ms
12 219.158.4.65 (219.158.4.65) 303.441 ms * 303.465 ms
13 202.96.12.190 (202.96.12.190) 306.968 ms 306.971 ms 307.052 ms
14 61.148.143.10 (61.148.143.10) 295.916 ms 295.780 ms 295.860 ms


Without knowing much about the network, you can assume just by looking at the round-trip times that once you get to hop 9 (at the 219.158.30.177 IP), you have left the continent, as the round-trip time jumps from 3 milliseconds to 275 milliseconds.
Find what is using your bandwidth with iftop

Sometimes your network is slow not because of some problem on a remote server or router, but just because something on the system is using up all the available bandwidth. It can be tricky to identify what process is using up all the bandwidth, but there are some tools you can use to help identify the culprit.

top is such a great troubleshooting tool that it has inspired a number of similar tools like iotop to identify what processes are consuming the most disk I/O. It turns out there is a tool called iftop that does something similar with network connections. Unlike top, iftop doesn’t concern itself with processes but instead lists the connections between your server and a remote IP that are consuming the most bandwidth. For instance, with iftop you can quickly see if your backup job is using up all your bandwidth by seeing the backup server IP address at the top of the output.

iftop is available in a package of the same name on both Red Hat- and Debian-based distributions, but in the case of Red Hat-based distributions, you might have to find it from a third-party repository. Once you have it installed, just run the iftop command on the command line (it will require root permissions). Like with the top command, you can hit Q to quit.
At the very top of the iftop screen is a bar that shows the overall traffic for the interface. Just below that is a column with source IPs followed by a column with destination IPs and arrows between them so you can see whether the bandwidth is being used to transmit packets from your host or receive them from the remote host. After those columns are three more columns that represent the data rate between the two hosts over 2, 10, and 40 seconds, respectively. Much like with load averages, you can see whether the bandwidth is spiking now, or has spiked some time in the past. At the very bottom of the screen, you can see statistics for transmitted data (TX) and received data (RX) along with totals. Like with top, the interface updates periodically.

The iftop command run with no arguments at all is often all you need for your troubleshooting, but every now and then, you may want to take advantage of some of its options. The iftop command will show statistics for the first interface it can find by default, but on some servers you may have multiple interfaces, so if you wanted to run iftop against your second Ethernet interface (eth1), type iftop -i eth1.

By default iftop attempts to resolve all IP addresses into hostnames. One downside to this is that it can slow down your reporting if a remote DNS server is slow. Another downside is that all that DNS resolution adds extra network traffic that might show up in iftop! To disable network resolution, just run iftop with the -n option.

Normally iftop displays overall bandwidth used between hosts, but to help you narrow things down, you might want to see what ports each host is using to communicate. After all, if you knew a host was consuming most of your bandwidth over your web port, you would perform different troubleshooting than if it was connecting to an FTP port. Once iftop is launched, press P to toggle between displaying all ports and hiding them. One thing you’ll notice, though, is that sometimes displaying all the ports can cause hosts you are interested in to fall off the screen. If that happens, you can also hit either S or D to toggle between displaying ports only from the source or only from the destination host, respectively. Showing only source ports can be useful when you run iftop on a server, since for many services, the destination host uses random high ports that don’t necessarily identify what service is being used, but the ports on your server are more likely to correspond to a service on your machine. You can then follow up with the netstat -lnp command referenced earlier to find out what service is listening on that port.

Like with most Linux commands, iftop has an advanced range of options. What we covered should be enough to help with most troubleshooting efforts, but in case you want to dig further into iftop’s capabilities, just type man iftop to read the manual included with the package.

0 comments