Troubleshooting ESXi Hosts which are Not responding or in a Disconnected State
ESXi Hosts going into a Not Responding Mode or Disconnected is not a problem which is something new for the vSphere Admins. For those experienced users, would also know that there is no single solution or an approach to the problem as the issue can arise at multiple levels right from the Hardware level, through the Networking and/or Storage Stack and ultimately the vmkernel stack provided by VMware vSphere.
It is always recommended to have a thorough understanding of the problem in hand and look into the logs to tackle the issue at hand.
The VMware Knowledge Base has a lot of articles surrounding this issue and hence I thought I will put up a document which will help guide through a Fish-Bone Approach on troubleshooting this problem. Most of this is taken from the KB Articles published by VMware.
Since these issues can happen due to multiple reasons starting from the Hardware, Network, Storage or the vSphere’s Kernel modules itself. The below steps are more of a primer for you to understand what could have possibly gone wrong.
Start from a Hardware Level then Network & Storage level and then vSphere level.
Ask yourself the following questions:
- How many times has the ESX host experienced this condition?
- What were the exact times and dates that the host became unresponsive?
- Have any other hosts experienced this issue?
- What else was happening in your environment at the time of the events?
- Is there a pattern to the times when the host becomes unresponsive?
- Are there any regularly scheduled jobs running when the host becomes unresponsive?
- Do I have the logs which would have captured the time frame when the issue would have occurred?
The above questions coupled with logs for the host should be provided to the VMware Technical Support Team when logging a Service Request, this will help prevent going back-and-forth for additional information.
Remember when submitting logs, ensure that the logs you submit should have the timeframe when the issue occurred so that the Technical Support team can help find the exact problem.
To collect diagnostic information for ESX/ESXi hosts and vCenter Server using the vSphere Web Client Please see http://kb/vmware/com/kb/2032892 , alternately when using vSphere Client please see http://kb.vmware.com/kb/653
To upload diagnostic information to VMware Technical Support Team, please see http://kb.vmware.com/kb/1008525
NOTE: When the ESXi Server is in a Not Responding State, you will not be able to collect logs using the vSphere Client or the vSphere Web Client. Ideally you should troubleshoot to get the host back into a manageable state, however if you still need to get logs. You can get the same by running the command /usr/lib/vmware/vm-support from the SSH/DCUI.
Understanding the difference between a Not Responding Host and a Disconnected Host
Not Responding: A host can become greyed out and shown as Not Responding due to an external factor that vCenter Server is unaware of. If a host is showing as Not Responding, vCenter Server no longer receives heartbeats from it. This can happen for several reasons, all of which prevent heartbeats being received from the host to vCenter. Some common reasons include:
- A network connectivity issue between the host and vCenter Server, for example UDP port 902 not open, a routing issue, bad cable, firewall rule, etc.
- hostd is not running successfully on the host.
- vpxa is not running successfully on the host.
- The host has crashed.
A host can go from Not Responding back to a normal state if the underlying issue which brought the host to the Not Responding state is resolved. However, a host that is in the Disconnected state ceases to be monitored by vCenter Server and will stay in that state regardless of the status of the underlying issue. Once the issue is resolved, the user must right-click on the host and select Connect to bring the host back to a normal state in vCenter Server.
Disconnected: Disconnected is a state initiated from the vCenter Server side and suspends vCenter Server’s management of the host, and thus all vCenter Server services ignore the host.
A disconnected host is one that has been explicitly disconnected by the user, or the license on the host has expired. Disconnected hosts also require the user to manually reconnect the host. Ultimately, a host that is disconnected can become that way for three reasons (2 of which require manual intervention):
- A user right-clicks the host and selects Disconnect.
- A user right-clicks a host that is listed as Not Responding and clicks Connect and that task fails.
- The host license expires.
When a host becomes disconnected, it still exists in the vCenter Server inventory, but vCenter Server does not get any updates from that host, does not monitor it, and therefore has no knowledge of the health of that host.
vCenter Server takes a conservative approach when considering disconnected hosts. Virtual machines on a host that is not responding affect the admission control check for vSphere HA. vCenter Server does not include those virtual machines when computing the current failover level for HA, but assumes that any virtual machines running on a disconnected host will be failed over if the host fails. Because the status of the host is not known, and because vCenter Server is not communicating with that host, HA cannot use it as a guaranteed failover target. As part of disconnecting a host, vCenter Server disables HA on that host. The virtual machines on that host are therefore not failed over in the event of a host isolation. When the host becomes reconnected, the host becomes available for failover again.
Now that we know the difference, we would recommend that you start the troubleshooting from a Hardware layer and progress through Networking, vSphere layer and the Storage layer.
Let’s discuss them one by one.
- Verify the current state of the ESXi host hardware and power. Physically go to the VMware ESXi host hardware, and make note of any lights on the face of the server hardware that may indicate the power or hardware status. For more information regarding the hardware lights, consult the hardware vendor documentation or support.
- Depending on the configuration of your physical environment, you may consider connecting to the physical host by using a remote hardware interface provided by your hardware vendor.
- If the hardware lights indicate that there is a hardware issue, consult the hardware vendor documentation or support to identify any existing hardware issues.
- Determine the state of the user interface of the ESXi host in the physical console. To determine the responsiveness at the local physical console prior to taking any action.
- Press the NumLock key on your keyboard and observe if the NumLock light state changes. A successful light state change indicates that the BIOS is responsive.
- Check if there is any active disk or network traffic using status lights or other hardware monitoring on the disk drive array, network interface cards or upstream switches. Active egress traffic indicates that the ESXi host is still functioning.
- Attempt to interact with the server via a baseboard management controller (BMC) interface, such as ILO, DRAC or RSA. If aspects of this interface other than the console are also unresponsive, it indicates that the issue is hardware related.
- Reboot the ESXi host. Collect the Diagnostic Information using http://kb.vmware.com/kb/1008524
- Verify the right BIOS version of the hardware on the VMware Compatibility Guide. Go to the VMware Compatibility Guide at http://www.vmware.com/resources/compatibility/search.php and select Systems/Servers in “What are you looking for” and choose the ESXi Version installed. Choose the Partner Name and type in the Server Model in the Keyword and click Update and View Results.
- If things are fine on the Hardware side, lets proceed to the Network layer
- Verify PING connectivity using IP Address as well as DNS from both the ESXi Host as well as the vCenter Server. Using them as Source and Destination Address vice-versa.
- If the ping succeeds using the IP and not using the Hostname, then there is problem in the DNS. – If this doesn’t work – First fix this issue.
- If your investigation confirms that the host in question is using the old address and the DNS server is resolving the name to its new address. , then we will have to clear the DNS Cache in the ESXi Servers using “/etc/init.d/nscd restart” command from the SSH Session.
- Verify the state of the Network Cards on the ESXi Server
- Connect to the ESXi Host using PuTTY or if PuTTY is not working use the Remote Console to run the command “esxcli network nic list”. Make sure the Link Status shows “Up”
- Verify ESXi Host Networking using the command “esxcfg-vswitch –l”
- If you can ping the ESXi Host from the vCenter Server, can you open a direct vSphere Client Connection to the ESXi Host? If this works then remove the disconnected ESXi Host from the vCenter Server’s inventory and re-add – This should work.
- If Step 3 still doesn’t work then probably an incorrect Managed IP Address has been set.
- Log into the vSphere Web Client. The default URL is https://vCenter_Server_FQDN:9443/vsphere-client
- Navigate to vCenter > Inventory Lists vCenter Server.
- Click the vCenter Server needing to be verified.
- Select Manage > Settings > General
- Expand the Runtime settings field by clicking on the arrow next to it. Make a note of the vCenter Server managed address.
- To modify the Runtime Settings:
- Click Edit in the right-hand corner of the panel
- Click Runtime settings on the left-hand side of the window
- Modify the vCenter Server managed address and vCenter Server name accordingly.
- Click OK
- Make sure that the IP seen in Step 4.5 is same as the IP reported by the command “grep -i serverIp /etc/vmware/vpxa/vpxa.cfg”
- If the IP Address is correct, it could probably be that the vCenter Server and the ESXi Host are not receiving heartbeat packets which are usually exchanged through the default port 902, these packets are important for the Host to stay connected.
- Check the port through which the heartbeat traffic flows by issuing a command on the ESXi host using “grep -i serverport vpxa.cfg”. This will return with the serverport number.
- On the Windows based vCenter Server use the telnet command, for example, “telnet <esxi_ip> 902”
- On the ESXi Host use the netcat(nc) command as “nc –z <vcenter_ip> 902”
- Is the ESXi Host Licensed correctly?
- Even after verifying and making sure things are good from a Hardware and a Network perspective, if you are unable to establish a vSphere Client Connection to the ESXi Host directly, then it could be possible that the Management Agent of the ESXi Host is not functioning correctly.
- Issuing a ./services.sh restart will restart all the Management related services. Please be patient for atleast 4-5 minutes to allow all the services to be restarted after issuing the command. To verify if the Management Agent is started successfully you should see the following on ESXi 6.0.x Hosts, you will have to grep for “BEGIN SERVICES” on the /var/log/hostd.log file(you will see the line as below) -“2015-12-22T09:25:50.314Z info hostd[FF8BAA70] [Originator@6876 sub=Default] BEGIN SERVICES”
- IMPORTANT: If LACP IS CONFIGURED, DO NOT RESTART MANAGEMENT SERVICES USING services.sh restart instead use the commands in Step 2.5.
- Alternately just issuing commands
- /etc/init.d/hostd restart – This will restart the Management agent of the ESXi Host
- /etc/init.d/vpxa restart – This will restart the vCenter Agent on the ESXi Host
- Note there may be issues with Storage not visible and probably the Host might be experiencing APD or PDL and hence the above command might be stuck. If you are sure that there are no storage level issues, only then try this command. To make sure the Storage is accessible. Run the following commands:
- vim-cmd vmsvc/getallvms – This will list all the VMs running on the host with the Datastore details.
- ls /vmfs/volumes/[Datastore_Name]/ – This will list all the VMs in that datastore
- cd /vmfs/volumes/[Datastore_Name] – To get into the datastore directory
- touch filename – This is to check if you are able to create a file named as filename
- rm filename – This is to delete the filename file.
- If the above two commands show results then you are able to access the Storage.
- Are you running the right Network/Storage driver and/or firmware version based on the version of ESXi Hosts?
- Did you know you can run the command “/usr/lib/vmware/vm-support/bin/swfw.sh” to display the firmware and driver version of the hardware connected to the ESXi Hosts. This is the versioning information from the CIM Providers. Look for the VersionString for each of the InstanceID.
- If you think that the command in Step 3a gives you a lot of information and gets you confused(like me). We will have to individually run the commands to find out the versions for HBAs and NICs.
- Run the command esxcfg-scsidevs –a . This will give you the HBA devices with the modules associated(the second column lists the HBA module names)
- Run the command vmkload_mod –s <HBA_drivername_forexample_qlnativefc> | grep –i version . This command will give you the version.
- Run the command esxcli network nic list , this will list down all the network cards.
- Run the command esxcli network nic get –n vmnic# | grep –i version . Where # is the NIC number. This will give you the Version.
- Run the command vmware –v . This will give you the version of the ESXi installed on the ESXi Host.
- Go to the VMware Compatibility Guide at http://www.vmware.com/resources/compatibility/search.php and select IO devices in “What are you looking for” and choose the ESXi Version installed. Choose the Partner Name and type in the Name of the IO Card in the Keyword and click Update and View Results.
- You SHOULD be on the exact versions as stated by the VMware Compatibility Guide.
- Are you managing remote ESXi Hosts using vCenter Server? Are your ESXi Servers disconnecting approximately 30 to 60 seconds. Due to network or ISP limitations, it may be necessary to use NAT to connect to the vCenter Server. Please note that USING NAT BETWEEN VCENTER SERVER AND ESXi HOSTS IS AN UNSUPPORTED CONFIGURATION.
- As discussed in the Network layer, port 902 is critical for heartbeat functionality. Even if the port is open, the host still disconnect if the firewall on the Windows Server 2008 blocks Edge Traversal.
- To enable Edge Traversal, Go to Start-> Run and type wf.msc and hit Enter on the vCenter Server and locate VMware vCenter Server – Host Heartbeat Rule and select the Advanced tab and choose Allow Edge Traversal.
- For ESXi versions older than 5.5, it is possible that the hardware monitoring service(sfcbd) populates the /var/run/sfcb. For this issue, stop the sfcbd-watchdog service using the command /etc/init.d/sfcbd-watchdog stop and then cd /var/run/sfcb and remove the files in the directory using the rm command. Once deleted, start the service using the command /etc/init.d/sfcbd-watchdog start
Storage layer: Have you tried the steps listed in Step 2d of the vSphere layer? If yes and still facing issues, based on the type of Storage and the scope of the Problem, please select the right hyperlink in http://kb.vmware.com/kb/1003659
Hope the above helps you narrow down to the path of Why the Host went into a Not Responding State or a disconnected state. Having this information ready with you while filing Support Requests are going to help you reduce the total troubleshooting time and get you to a faster resolution.