Let’s see how he stands up to the rigor of Four Questions.
Q: What principles of distributed systems do you think play an elevated role in cloud-driven software solutions? Where does “integrating with the cloud” introduce differences from “integrating within my data center”?
A: I believe we need to first differentiate “the cloud” a bit to figure out what elevated concerns are. In a pure IaaS scenario where the customer is effectively renting VM space, the architectural differences between a self-contained solution in the cloud and on-premises are commonly relatively small. That also explains why IaaS is doing pretty well right now – the workloads don’t have to change radically. That also means that if the app doesn’t scale in your own datacenter it also won’t scale in someone else’s; there’s no magic Pixie dust in the cloud. From an ops perspective, IaaS should be a seamless move if the customer is already running proper datacenter operations today. With that I mean that they are running their systems largely hands-off with nobody having to walk up to the physical box except for dealing with hardware failures.
The term “self-contained solution” that I mentioned earlier is key here since that’s clearly not always the case. We’ve been preaching EAI for quite a while now and not all workloads will move into cloud environments at once – there will always be a need to bridge between cloud-based workloads and workloads that remain on-premises or workloads that are simply location-bound because that’s where the action is – think of an ATM or a cashier’s register in a restaurant or a check-in terminal at an airport. All these are parts of a system and if you move the respective backend workloads into the cloud your ways of wiring it all together will change somewhat since you now have the public Internet between your assets and the backend. That’s a challenge, but also a tremendous opportunity and that’s what I work on here at Microsoft.
In PaaS scenarios that are explicitly taking advantage of cloud elasticity, availability, and reach – in which I include “bring your own PaaS” frameworks that are popping up here and there – the architectural differences are more pronounced. Some of these solutions deal with data or connections at very significant scale and that’s where you’re starting to hit the limits of quite a few enterprise infrastructure components. Large enterprises have some 100,000 employees (or more), which obviously first seems like a lot; looking deeper, an individual business solution in that enterprise is used by some fraction of that work-force, but the result is still a number that makes the eyes of salespeople shine. What’s easy to overlook is that that isn’t the interesting set of numbers for an enterprise that leverages IT as a competitive asset – the more interesting one is how they can deeply engage with the 10+ million consumer customers they have. Once you’re building solutions for an audience of 10+ million people that you want to engage deeply, you’re starting to look differently at how you deal with data and whether you’re willing to hold that all in a single store or to subject records in that data store to a lock held by a transaction coordinator. You also find that you can no longer take a comfy weekend to upgrade your systems – you run and you upgrade while you run and you don’t lose data while doing it. That’s quite a bit of a difference.
Q: When building the Azure AppFabric Service Bus, what were some of the trickiest things to work out, from a technical perspective?
A: There are a few really tricky bits and those are common across many cloud solutions: How do I optimize the use of system resources so that I can run a given target workload on a minimal set of machines to drive down cost? How do I make the system so robust that it self-heals from intermittent error conditions such as a downstream dependency going down? How do I manage shared state in the system? These are the three key questions. The latter is the eternal classic in architecture and the one you hear most noise about. The whole SQL/NoSQL debate is about where and how to hold shared state. Do you partition, do you hold it in a single place, do you shred it across machines, do you flush to disk or keep in memory, what do you cache and for how long, etc, etc. We’re employing a mix of approaches since there’s no single answer across all use-cases. Sometimes you need a query processor right by the data, sometimes you can do without. Sometimes you must have a single authoritative place for a bit of data and sometimes it’s ok to have multiple and even somewhat stale copies.
I think what I learned most about while working on this here were the first two questions, though. Writing apps while being conscious about what it costs to run them is quite interesting and forces quite a bit of discipline. I/O code that isn’t fully asynchronous doesn’t pass code-review around here anymore. We made a cleanup pass right after shipping the first version of the service and subsequently dropped 33% of the VMs from each deployment with the next rollout while maintaining capacity. That gain was from eliminating all remaining cases of blocking I/O. The self-healing capabilities are probably the most interesting from an architectural perspective. I published a blog article about one of the patterns a while back [here]. The greatest insight here is that failures are just as much part of running the system as successes are and that there’s very little that your app cannot anticipate. If your backend database goes away you log that fact as an alert and probably prevent your system from hitting the database for a minute until the next retry, but your system stays up. Yes, you’ll fail transactions and you may fail (nicely) even back to the end-user, but you stay up. If you put a queue between the user and the database you can even contain that particular problem – albeit you then still need to be resilient against the queue not working.
Q: The majority of documentation and evangelism of the AppFabric Service Bus has been targeted at developers and application architects. But for mature, risk-averse enterprises, there are other stakeholders like Operations and Information Security who have a big say in the introduction of a technology like this. Can you give us a brief “Service Bus for Operations” and “Service Bus for Security Professionals” summary that addresses the salient points for those audiences?
A: The Service Bus is squarely targeted at developers and architects at this time; that’s mostly a function of where we are in the cycle of building out the capabilities. For now we’re an “implementation detail” of apps that want to bet on the technology more than something that an IT Professional would take into their hands and wire something up without writing code or at least craft some config that requires white-box knowledge of the app. I expect that to change quite a bit over time and I expect that you’ll see some of that showing up in the next 12 months. When building apps you need to expect our components to fail just like any other, especially because there’s also quite a bit of stuff that can go wrong on the way. You may have no connectivity to Service Bus, for instance. What the app needs to have in its operational guidance documents is how to interpret these failures, what failure threshold triggers an alert (it’s rarely “1), and where to go (call Microsoft support with this number and with this data) when the failures indicate something entirely unexpected.
From the security folks we see most concerns about us allowing connectivity into the datacenter with the Relay; for which we’re not doing anything that some other app couldn’t do, we’re just providing it as a capability to build on. If you allow outbound traffic out of a machine you are allowing responses to get back in. That traffic is scoped to the originating app holding the socket. If that app were to choose to leak out information it’d probably be overkill to use Service Bus – it’s much easier to do that by throwing documents on some obscure web site via HTTPS. Service Bus traffic can be explicitly blocked and we use a dedicated TCP port range to make that simple and we also have headers on our HTTP tunneling traffic that are easy to spot and we won’t ever hide tunneling over HTTPS, so we designed this with such concerns in mind. If an enterprise wants to block Service Bus traffic completely that’s just a matter of telling the network edge systems.
However, what we’re seeing more of is excitement in IT departments that ‘get it’ and understand that Service Bus can act as an external DMZ for them. We have a number of customers who are pulling internal services to the public network edge using Service Bus, which turns out to be a lot easier than doing that in their own infrastructure, even with full IT support. What helps there is our integration with the Access Control service that provides a security gate at the edge even for services that haven’t been built for public consumption, at all.
Q [stupid question]: I’m of the opinion that cold scrambled eggs, or cold mashed potatoes are terrible. Don’t get me started on room-temperature french fries. Similarly, I really enjoy a crisp, cold salad and find warm salads unappealing. What foods or drinks have to be a certain temperature for you to truly enjoy them?
A: I’m German. The only possible answer here is “beer”. There are some breweries here in the US that are trying to sell their terrible product by apparently successfully convincing consumers to drink their so called “beer” at a temperature that conveniently numbs down the consumer’s sense of taste first. It’s as super-cold as the Rockies and then also tastes like you’re licking a rock. In odd contrast with this, there are rumors about the structural lack of appropriate beer cooling on certain islands on the other side of the Atlantic…
Thanks Clemens for participating! Great perspectives.