So far in this blog series, I’ve taken a look at how to provision and scale servers using five leading cloud providers. Now, I want to dig into support for “Day 2 operations” like troubleshooting, reactive or proactive maintenance, billing, backup/restore, auditing, and more. In this blog post, we’ll look at how to manage (long-lived) running instances at each provider and see what capabilities exist to help teams manage at scale. For each provider, I’ll assess instance management, fleet management, and account management.
There might be a few reasons you don’t care a lot about the native operational support capabilities in your cloud of choice. For instance:
- You rely on configuration management solutions for steady-state. Fair enough. If your organization relies on great tools like Ansible, Chef or CFEngine, then you already have a consistent way to manage a fleet of servers and avoid configuration drift.
- You use “immutable servers.” In this model, you never worry about patching or updating running machines. Whenever something has to change, you deploy a new instance of a gold image. This simplifies many aspects of cloud management.
- You leverage “managed” servers in the cloud. If you work with a provider that manages your cloud servers for you, then on the surface, there is less need for access to robust management services.
- You’re running a small fleet of servers. If you only have a dozen or so cloud servers, then management may not be the most important thing on your mind.
- You leverage a multi-cloud management tool. As companies chase the “multi-cloud” dream, they leverage tools like RightScale, vRealize, and others to provide a single experience across a cloud portfolio.
However, I contend that the built-in operational capabilities of a particular cloud are still relevant for a variety of reasons, including:
- Deployments and upgrades. It’s wonderful if you use a continuous deployment tool to publish application changes, but cloud capabilities still come into play. How do you open up access cloud servers and push code to them? Can you disable operational alarms while servers are in an upgrading state? Is it easy to snapshot a machine, perform an update, and roll back if necessary? There’s no one way to do application deployments, so your cloud environment’s feature set may still play an important role.
- Urgent operational issues. Experiencing a distributed denial of service attack? Need to push an urgent patch to one hundred servers? Trying to resolve a performance issue with a single machine? Automation and visibility provided by the cloud vendor can help.
- Handle steady and rapid scale. There’s a good chance that your cloud footprint is growing. More environments, more instances, more scenarios. How does your cloud make it straightforward to isolate cloud instances by function or geography? A proper configuration management tool goes a long way to making this possible, but cloud-native functionality will be important as well.
- Audit trails. Users may interact with the cloud platform via a native UI, third party UI, or API. Unless you have a robust log aggregation solution that pulls data from each system that fronts the cloud, it’s useful to have the system of record (usually the cloud itself) capture information centrally.
- UI as a window to the API. Many cloud consumers don’t ever see the user interface provided by the cloud vendor. Rather, they only use the available API to provision and manage cloud resources. We’ll look at each cloud provider’s API in a future post, but the user interface often reveals the feature set exposed by the API. Even if you are an API-only user, seeing how the Operations experience is put together in a user interface can help you see how the vendor approaches operational stories.
Let’s get going in alphabetical order.
Users can do a lot of things with each particular AWS instance. I can create copies (“Launch more like this”), convert to a template, issue power operations, set and apply tags, and much more.
AWS has a super-rich monitoring system called CloudWatch that captures all sorts of metrics and capable of sending alarms.
AWS shows all your servers in a flat, paging, list.
You can filter the list based on tag/attribute/keyword associated with the server(s). Amazon also JUST announced Resource Grouping to make it easier to organize assets.
When you’ve selected a set of servers in the list, you can do things like issue power operations in bulk.
Monitoring also works this way. However, Autoscale does not work against collections of servers.
It’d be negligent of me to talk about management at scale in AWS without talking about Elastic Beanstalk and OpsWorks. Beanstalk puts an AWS-specific wrapper around an “application” that may be comprised on multiple individual servers. A Beanstalk application may have a load balancer, and be part of an Autoscaling group. It’s also a construct for doing rolling deployments. Once a Beanstalk app is up and running, the user can manage the fleet as a unit.
Once you have a Beanstalk application, you can terminate and restart the entire environment.
There are still individual servers shown in the EC2 console, but Beanstalk makes it simpler to manage related assets.
OpsWorks is a relatively new offering used to define and deploy “stacks” comprised of application layers. Developers can associate Chef recipes to multiple stages of the lifecycle. You can also run recipes manually at any time.
AWS doesn’t offer any “aggregate” views that roll up your consumption across all regions. The dashboards are service specific, and are shown on a region-by-region basis. AWS accounts are autonomous, and you don’t share anything between them. Within an account, user can do a lot of things. For instance, the Identity and Access Management service lets you define customized groups of users with very specific permission sets.
AWS has also gotten better at showing detailed usage reports.
The invoice details are still a bit generic and don’t easily tie back to a given server.
There are a host of other AWS services that make account management easier. These include CloudTrail for API audit logs and SNS for push notifications.
For an individual virtual server in CenturyLink Cloud, the user has a lot of management options. It’s pretty easy to resize, clone, archive, and issue power commands.
Doing a deployment but want to be able to revert any changes? The platform supports virtual machine snapshots for creating restore points.
Each server details page shows a few monitoring metrics.
Users can also bind usage alert and vertical autoscale policies to a server.
CenturyLink Cloud has you organize servers into collections called “Groups.” These Groups – which behave similarly to a nested file structure – are management units.
Users can issue bulk power operations against all or some of the servers in a Group. Additionally, you can set “scheduled tasks” on a Group. For instance, power off all the servers in a Group every Friday night, and turn them back on Monday morning.
You can also choose pre-loaded or dynamic actions to perform against the servers in a Group. These packages could be software (e.g. new antivirus client) or scripts (e.g. shut off a firewall port) that run against any or all of the servers at once.
The CenturyLink Cloud also provides an aggregated view across data centers. In this view, it’s fairly straightforward to see active alarms (notice the red on the offending server, group, and data center), and navigate the fleet of resources.
Finally, the platform offers a “Global Search” where users can search for servers located in any data center.
Within CenturyLink Cloud, there’s a concept of an account hierarchy. Accounts can be nested within one another. Networks and other settings can be inherited (or separated), and user permissions cascade down.
Throughout the system, users can see the month-to-date and projected cost of their cloud consumption. The invoice data itself shows costs on a per server, and per Group basis. This is handy for chargeback situations where teams pay for specific servers or entire environments.
CenturyLink Cloud offers role-based access controls for a variety of personas. These apply to a given account, and any sub-accounts beneath it.
The CenturyLink Cloud has other account administration features like push-based notifications (“webhooks”) and a comprehensive audit trail.
Digital Ocean specializes in simplicity targeted at developers, but their experience is still serves up a nice feature set. From the server view, you can issue power operations, resize the machine, create snapshots, change the server name, and more.
There are a host of editable settings that touch on networking, Linux Kernel, and recovery processes.
Digital Ocean gives developers a handful of metrics that clearly show bandwidth consumption and resource utilization.
There’s a handy audit trail below each server that clearly identifies what operations were performed and how long they took.
Digital Ocean focuses on the developer audience and API users. Their UI console doesn’t really have a concept of managing a fleet of servers. There’s no option to select multiple servers, sort columns, or perform bulk activities.
The account management experience is fairly lightweight at Digital Ocean. You can view account resources like snapshots and backups.
It’s easy to create new SSH keys for accessing servers.
The invoice experience is simple but clear. You can see current charges, and how much each individual server cost.
The account history shows a simple audit trail.
The Google Compute Engine offers a nice amount of per-server management options. You can connect to a server via SSH, reboot it, clone it, and delete it. There are also a set of monitoring statistics clearly shown at the top of each server’s details.
Additionally, you can change settings for storage, network, and tags.
The only thing you really do with a set of Google Compute Engine servers is delete them.
Google Compute Engine offers Instance groups for organizing virtual resources. They can all be based on the same template and work together in an autoscale fashion, or, you can put different types of servers into an instance group.
An instance group is really just a simple construct. You don’t manage the items as a group, and if you delete the group, the servers remain. It’s simply a way to organize assets.
Google Compute Engine offers a few different types of management roles including owner, editor, and viewer.
What’s nice is that you can also have separate billing managers. Other billing capabilities include downloading usage history, and reviewing fairly detailed invoices.
I don’t yet see an audit trail capability, so I assume that you have to track activities some other way.
Microsoft is in transition between its legacy, production portal, and it’s new blade-oriented portal. For the classic portal, Microsoft crams a lot of useful details into each server’s “details” page.
The preview portal provides even more information, in a more … unique … format.
In either environment, Azure makes it easy to add disks, change virtual machine size, and issue power ops.
Microsoft gives users a useful set of monitoring metrics on each server.
Unlike the classic portal, the new one has better cost transparency.
There are no bulk actions in the existing portal, besides filtering which Azure subscription to show, and sorting columns. Like AWS, Azure shows a flat list of servers in your account.
The preview portal has the same experience, but without any column sorting.
Microsoft Azure users have a wide array of account settings to work with. It’s easy to see current consumption and how close to the limits you are.
The management service gives you an audit log.
New portal gives users the ability to set a handful of account roles for each server. I don’t see a way to apply these roles globally, but it’s a start!
The pricing information is better in the preview portal, although the costs are still fairly coarse and not at a per-machine basis.
Each of these providers has a very unique take on server management. Whether your virtual servers typically live for three hours or three years, the provider’s management capabilities will come into play. Think about what your development and operations staff need to be successful, and take an active role in planning how Day 2 operations in your cloud will work. Consider things like bulk management, audit trails, and security controls when crafting your strategy!