The curious case of localhost! 503 Service Temporarily Unavailable in vRA

One of my customer is running a Distributed Setup of vRealize Automation.

Following is the setup:
1 Identity appliance
2 VRA Appliances(6.2.0) also hositng the Postgres Appliances.
2 IAAS web server component hosting the Web Tier(the Infrastructure Tab)
2 IAAS App server component(IaaS App Tier alogn with the Manager Service)
2 DEMs (IaaS DEM Worker and Proxy Agents)
2 vRealize Orchestrators
vShield Manager which is the Load Balancer providing VIPs for the vRA Appliances, IaaS Web, IaaS App, vRO.

Everything was fine and dandy until one fine day, we see that the VAMI Page of the vRA Appliances do not show any services as Registered. We also saw that the certificates were expired, since we were using SAN certificates, it would mean that the certificates were expired on multiple nodes and other products as well.
When we looked at the /var/log/messages
we saw
2015-11-05T12:26:13.000440+05:30 vcac01 vami /opt/vmware/share/htdocs/service/cafe/config-page.py: info Authenticated user: root successfully
2015-11-05T12:26:13.000666+05:30 vcac01 vami /opt/vmware/share/htdocs/service/cafe/config-page.py: info locale=en-US, id=certificateReplace, action=submit, controller=<type ‘instance’>
2015-11-05T12:26:13.000714+05:30 vcac01 vami /opt/vmware/share/htdocs/service/cafe/config-page.py: info Executing shell command…
2015-11-05T12:26:13.023630+05:30 vcac01 vami /opt/vmware/share/htdocs/service/cafe/config-page.py: info Returned vCAC host: vcac.corp
2015-11-05T12:26:13.024209+05:30 vcac01 vami /opt/vmware/share/htdocs/service/cafe/config-page.py: info Removing passphrase from key…
2015-11-05T12:26:13.031701+05:30 vcac01 vami /opt/vmware/share/htdocs/service/cafe/config-page.py: info Successfully removed passphrase from key.
2015-11-05T12:26:13.031751+05:30 vcac01 vami /opt/vmware/share/htdocs/service/cafe/config-page.py: info Importing certificate in vCAC KeyStore…
2015-11-05T12:26:15.116852+05:30 vcac01 vami /opt/vmware/share/htdocs/service/cafe/config-page.py: ERROR —BEGIN—#012Command execution failed with unexpected error: null.#012—END—#012#012Use -e option to get more details.
2015-11-05T12:26:15.116925+05:30 vcac01 vami /opt/vmware/share/htdocs/service/cafe/config-page.py: ERROR Error processing request: Error importing certificate in vCAC.
2015-11-05T12:26:15.117001+05:30 vcac01 vami /opt/vmware/share/htdocs/service/cafe/config-page.py: ERROR {‘cafe.host’: {‘action’: None, ‘value’: ‘vcac.corp’}}

We managed to get past the certificate replacement by following http://kb.vmware.com/kb/2106583. We were expecting that this should help resolve the issue and get our Private Cloud back up and running, but to our surprise we saw that the vRA services did not come up as registered.
So we looked into the /var/log/vmware/vcac/catalina.out file and saw a bunch of “503 Service temporary unavailable” messages like
“<timestamp> vcac: [component=”cafe:component-registry” priority=”ERROR” thread=”registryServiceNotificationExecutor-768″ tenant=””] com.vmware.vcac.core.componentregistry.service.impl.StatusServiceImpl.retrieveCurrentStatus:280 – Exception during remote status retrieval for url: https://vcac.corp/catalog-service/api/status. Error Message 503 Service Temporarily Unavailable.”

This made us re-think about the /etc/hosts file on the vRA Appliances and the vShield Edge Load Balancer.
So when we ping from the vRA Appliances to the VIP of the vRA appliances, it should point to the local machine’s IP, basically meaning that the traffic should’nt be going to the Load Balancer which held good in our case as well. So the /etc/hotsts file and the Load Balancer were doing its job.

We tried rebooting the vRA Appliances and then looked into the catalina.out log file to find out if the vRA Services were getting initialized and registered and saw that for some reason the component-registry service was not getting started.

Component-Registry is one of the Key services in vRA, it manages all application services as well as the 3rd party solution provider services. Other services interact with it to request data related to the service or the endpoint.
Looking into the log carefully we saw this Error –  No route to host
noroutetohost

So we wanted to check if the apache service also gets this error. hence we ran the command on the vRA Appliance
#wget https://vcac.corp/component-registry/services –no-check-certficate
we see the following output:
Resolving vcac.corp… 10.110.68.28
Connecting to vcac.corp|10.110.68.28|:443… connected.
HTTP request sent, awaiting response… 503 Service Temporarily Unavailable
ERROR 503: Service Temporarily Unavailable

Since this is resolving to the IP address on the local vRA appliance itself, we wanted to see if this error originated from the Apache2 on the vRA Appliance or the vcac server.
To check this we verified the access_log file at /var/log/vmware/vcac/access_log and /var/log/apache2/access_log files.

So when we looked into the /var/log/vmware/vcac/access_log file we see that we get “GET /component-registry/api/status HTTP/1.1” 200 926 – Basically the vcac access_log file states that it gets a HTTP 200, which is good, now when looking at the apache2’s access_log file. We see
“GET /component-registry/endpoints/types/com.vmware.csp.core.plugin.service.api/default HTTP/1.1” 503 416
“POST /component-registry/services/HTTP/1.1” 503 416

Further looking at the /var/log/apache2/error_log we see
[error] (110) Connection timed out: proxy AJP: attempt to connect to 10.122.91.218:8009 (localhost) failed
[error] proxy AJP: disabled connection for (localhost)
[error] proxy: AJP: failed to make connection to backend: localhost

We restarted that apache2 service using /etc/init.d/apache2 restart and then looked at the /var/log/apache2/error_log file it stated
[error] (110) Connection timed out: proxy AJP: attempt to connect to 10.122.91.218:8009 (localhost) failed

Wait a minute! We dont know what that IP Address 10.122.91.218 is!
When we did a nslookup on that IP we found out that there is a Machine on the DNS server with the name localhost.corp which points to that IP Address! Is’nt that strange!

The DNS in the customer’s infrastructure was managed and maintained by the parent company of the organization and the cloud architect or none of the other technical resources had access to the DNS Server.
So to bypass that we thought lets make sure that this IP is the problem child.

To do this, we modified the vcac.conf file at /etc/apache2/vhosts.d
We changed the line from “ProxyPass / ajp://localhost:8009/ nocanon” to “ProxyPass / ajp://127.0.0.1:8009/ nocanon” under #Tomcat AJP Proxy and then restarted apache2 using /etc/init.d/apache2 restart

We now checked the command
#wget https://localhost:443/component-registry/services/status –no-check-certficate
The output was:
Resolving localhost… 127.0.0.1
Connecting to localhost|127.0.0.1|:443… connected
HTTP request sent, awaiting response… 200 OK

We got a HTTP 200 basically meaning it started working! So what it means was that the apache2 was resolving localhost to a machine called as localhost.corp instead of the localhost which is the vRA Appliance itself.

We now checked on the VAMI page for the list of services and Voila! the services were registered 🙂

Apache2 is a proxy before the Tomcat server. So instead of using localhost as the name of the Tomcat Server, we used 127.0.0.1(since localhost was resolving to an IP 10.122.91.218 on the DNS). Bascially bypassing the DNS, we were able to startup the services.

After waking up the DNS Admin in the middle of the night and removing the rogue entry and reverting the changes done to the /etc/apache2/vhosts.d/vcac.conf we were successfully able to get the services back up and and running and eventually the Private Cloud was back to Business. It was like finding a needle in the haystack!

Lesson learnt: Be pristine with your DNS Server

>vRevealed<

Advertisements

2 thoughts on “The curious case of localhost! 503 Service Temporarily Unavailable in vRA

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s