OpenStack: Develop Your Troubleshooting Skills

By | April 12, 2014

If you’ve successfully built OpenStack, you probably know by now, it can take a few tries to get things working. Especially if you’ve gone beyond one of the all-in-one devstack deployments, and you’ve built yourself a dedicated controller and compute nodes, you’ve probably seen some problems along the way. Things look awfully complicated until you start to realize that most of the problems are due to common configuration errors. With a few techniques, we can identify the cause in short order.

The assumption is that you followed the documentation, or one of the many guides available on the web, and you’ve made some typos somewhere. They’re bound to happen. Anyway, this is not a deployment guide, just a basic guide for troubleshooting techniques.  If you’re looking for a deployment guide, check out my Havana Deployment Guide.

Before we can troubleshoot effectively, we need to understand the OpenStack services, what they do and how they interact. Then, we’ll be able to figure out where to look when things go wrong.

The OpenStack Services

The core services of OpenStack are:

  • Keystone (performs identity and access management functions)
  • Nova (performs deployment of instances onto hypervisors)
  • Nova-Network or Neutron (performs network management and allocates network connections to instances)
  • Glance (performs storage and retrieval of pre-built images)
  • Cinder (manages the provisioning of block storage volumes for instances)

There are more OpenStack services, but these are the core ones that you must have to deploy a working instance. A few more items that we’ll need to address are the underlying components that the OpenStack services use to store information and communicate with each other. These are:

  • The Database (usually MySQL)
  • The Message Queue Service (such as RabbitMQ or QPID)
  • Service Authentication

Configuration Issues

Most of your early problems with OpenStack will be related to configuration issues, but there are some basic mistakes that are very common. All of the services have configuration files that define various settings, but a few key settings are often over looked and cause immediate failure of the services to work correctly.

Each service must be able to communicate to its database, for storing its service-specific data.  Each service must be able to contact the message queue server, to pass messages to other services. And, each service must beable to communicate to the Keystone service for authentication.  Applicable configuration settings in the services configuration files are common across all of the services. For example, in the /etc/keystone/keystone.conf file (the configuration file for keystone), the [sql] section defines the connection to the MySQL database:

[sql]
# The SQLAlchemy connection string used to connect to the database
connection = mysql://keystone:password@localhost/keystone

The connection string indicates that MySQL is running on the same server as the keystone service (localhost). On other servers, we’d have to replace localhost with the IP address or DNS name of the MySQL host. This is often overlooked when services are deployed on additional servers. As a result, looking at the logs of the services deployed on the other servers, you’ll see database connection errors.

For example, if I deploy the Nova-Network on another server, and fail to specify the IP address of the MySQL host in the connection string, then I’d expect to see database errors in the Nova-Network log file /var/log/nova/nova-network.log.

# tail -n5 nova-network.log
2014-04-12 09:26:19.170 31033 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 187, in __init__
2014-04-12 09:26:19.170 31033 TRACE nova.openstack.common.threadgroup super(Connection, self).__init__(*args, **kwargs2)
2014-04-12 09:26:19.170 31033 TRACE nova.openstack.common.threadgroup OperationalError: (OperationalError) (2003, "Can't connect to MySQL server on '192.168.1.127' (113)") None None

The other common issue is failure to communicate with the message queue service. Again, common configuration settings that specify where the message queue is located are often overlooked. By default, the message queue is expected to be at localhost, but if you’re running more than one host, you need to remember to add the settings. So for example, in the /etc/nova/nova.conf file on the compute node, we add the following lines:

rpc_backend = nova.rpc.impl_kombu
rabbit_host = 192.168.1.128

This indicates that we’re running RabbitMQ (kombu), and the IP address points to the server where RabbitMQ is running. Again, a check of the logs will indicate errors, such as AMPQ and RPC timeouts.

# tail -n1 /var/log/nova/nova-compute.log
2014-04-12 09:31:39.033 31191 ERROR nova.openstack.common.rpc.common [req-805b6a87-c86d-4d56-8443-674b6cc6b63a None None] AMQP server on 192.168.1.127:5672 is unreachable: [Errno 113] EHOSTUNREACH. Trying again in 15 seconds.

The third thing that makes services fail is the inability to contact the Keystone service for authentication.  These configuration settings are usually in a secondary config file, which pastes various settings into API calls.  For example, Nova services authentication settings are in the /etc/nova/api-paste.ini file, the applicable section is shown below.

[filter:authtoken]
paste.filter_factory = keystoneclient.middleware.auth_token:filter_factory
auth_host = 192.168.1.128
auth_port = 35357
auth_protocol = http
auth_uri = http://192.168.1.128:5000/v2.0
admin_tenant_name = service
admin_user = test_nova
admin_password = password

Getting your database, message queue, and authentication settings for all of your services pointing to the right place goes a long way to get OpenStack working correctly.

Service Definitions

Keystone is the identity service. It is the central service that defines where each of the other service API endpoints are located, what user accounts each service uses, and the tenants, roles and end users that can access the system.

Before we can start a service, we need to create a user account for it, grant it the appropriate role, and and define its endpoint. Defining all of the service users, roles and endpoints is one of the most bewildering processes in OpenStack. Doing this work by hand is sure to give you a sudden debilitating brain aneurysm.

The best way to do this work is by using a script. For example, you can download these two scripts and modify them for your environment. Make sure to specify the correct user names, passwords, and IP addresses.

If you create an endpoint definition that points to the wrong URL, you’ll see this show up as an error in a service, a command-line tool, or the web portal, since the service will be unreachable via the invalid URL. These errors are pretty easy to spot as well, since they usually report the URL as being unreachable. Again, getting your keystone users, roles, and endpoint setup correctly at the outset will go a long way towards a smooth OpenStack deployment.

Basic Testing

When you’ve installed one of the OpenStack services, and performed the basic configuration for that service, you’ll typically use a command line tool to test the basic functionality of that service. For example, after you’ve installed Keystone, and setup your users, roles and endpoints, you’ll use the keystone command-line tool to test that the service is working. To get a list of users, we type:

keystone user-list

If all is well, you get a list of users back.

$ keystone user-list
+------------------+------------------+---------+-------+
|        id        |       name       | enabled | email |
+------------------+------------------+---------+-------+
|      admin       |      admin       |   True  |       |
|      cinder      |      cinder      |  False  |       |
|     demouser     |     demouser     |   True  |       |
|      glance      |      glance      |   True  |       |
|       nova       |       nova       |  False  |       |
| prod_east_cinder | prod_east_cinder |   True  |       |
| prod_east_glance | prod_east_glance |   True  |       |
|  prod_east_nova  |  prod_east_nova  |   True  |       |
|   test_cinder    |   test_cinder    |   True  |       |
|   test_glance    |   test_glance    |   True  |       |
|  test_keystone   |  test_keystone   |   True  |       |
|    test_nova     |    test_nova     |   True  |       |
+------------------+------------------+---------+-------+

If not, you might get an authentication error, or a failure to reach the endpoint.

# keystone user-list
Authorization Failed: <attribute 'message' of 'exceptions.BaseException' objects> (HTTP Unable to establish connection to http://192.168.1.127:35357/v2.0/tokens)

Authentication errors typically indicate that you haven’t set your environment variables correctly, including your OpenStack user name, password, and the URL of the Keystone service.

If all of these are set correctly, the Keystone service may be unreachable because the service is stopped, or the firewall is denying the connection. In these cases it’s helpful to check the service status, check that the service is listening on the specified port, and that the firewall port is open. To check if the Keystone service is running, type:

service keystone status

If the service is not running, we can try starting it (service keystone start) and then checking the status again, just in case it’s failing right after startup. If you find that the service has stopped again, a look at the logs should reveal the trouble.

tail /var/log/keystone/keystone.log

Another way to check service status is to see if the service is listening on the specified port.

netstat -ntl

This will show you a list of ports that the server is listening on.

# netstat -ntl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 0.0.0.0:8774            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:8775            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:9191            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:5000            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:8776            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:3306            0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:11211         0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:9292            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:34737           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:4369            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:3260            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:35357           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:6080            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:8773            0.0.0.0:*               LISTEN     
tcp6       0      0 :::5672                 :::*                    LISTEN     
tcp6       0      0 :::80                   :::*                    LISTEN     
tcp6       0      0 :::22                   :::*                    LISTEN     
tcp6       0      0 :::3260                 :::*                    LISTEN

We should see the keystone port (35357) in the list. Finally we need to check the firewall to verify that the port is open.

iptables --list-rules

You should find a rule that accepts connections on port 35357 or a rule allowing all connections.

Service Interactions

So now that you have some understanding of how to look at the services, the key thing to understand is how the service interact and when. Then, you’ll be able to guess what service to look at when something goes wrong. For example, if you can’t log onto the web portal, you may want to look at the Keystone logs.

A better example is when you deploy an instance, and it fails to launch. If the deployment results in an Error state, you can sit there and scratch your head, or if you understand the service interactions, you can methodically chase down the error.

When you deploy an instance, the web portal (or command-line tool) contacts the nova-api service. Nova-api selects a compute node on which to spawn the instance, and sends a message to the compute node. On the compute node, the nova-compute service contacts the glance service to pull down the image, and tries to deploy it. The nova-network (or neutron) service is contacted to provision an IP address, manipulate firewall ports, etc. If anything fails during the process, the instance may end up in an Error state. So you may have to look at the logs for nova-api, nova-compute, nova-network, which may be located on different servers, to see where the failure is.

Having a grip on where your services are, where their logs are stored, and using the various commands I mentioned above, how to view and address the errors, leads you to a general methodology for troubleshooting OpenStack.

There’s much more to it of course. Troubleshooting the network is a topic all its own. Hopefully, this article will help you get started troubleshooting your basic OpenStack configuration.

 

Leave a Reply