Tuesday, November 5, 2013 DOA

The President came up to Boston last week to tell us that his ACA (Affordable Care Act, aka "Obamacare") really is going to work. Maybe the eyes of the rest of America were upon him, but I think that greater Boston was more preoccupied with the impending game six of the World Series, in which the Red Sox might win the series in Fenway Park for the first time since 1918 (which did in fact come to pass). Giving a speech at the historic but unpronounceable Faneuil Hall ("FAN-ee-ull" in case you were wondering), on roughly the same spot where Mitt Romney (remember him?) signed the Massachusetts health care act into law in 2006, Mr. Obama apologized for the botched rollout of and assured us all that things will get better soon, reminding us that the debut of Massachusetts' own revamped health insurance framework was also pretty shaky, although it has ultimately worked out well.

Of all the things I pontificate about in these (electronic) pages, here, at last, is something I actually feel professionally qualified to ramble on and on about. I make my living as a project manager for information technology (IT) projects. I've been doing this now for 20+ years as my primary or secondary role in a number of different companies. I have run a lot of projects, many that were big successes, and a few that, well, gave me an opportunity to learn from my mistakes. When I first heard about the problems at the launch of the web site, the words that immediately sprang to mind were the ones that no project manager ever wants to hear from the users of the system of which he or she has just led the delivery: Didn't anybody test this thing?

Well, sort of open. And sort of not.

IT projects generally follow about the same lifecycle, or sequence of activities. First you define requirements: what exactly is this system supposed to do, what technical and organizational constraints need to be addressed and so forth. Then you create the design that will deliver on those requirements. Then you actually build the system and test it to make sure that what you built follows the design and meets the requirements. If testing shows that things aren't working right, you fix those things and then test again, and keep doing that until testing confirms that the system performs as it is supposed to. Then you roll out the finished product by training the people who will operate the system, maybe training end users as well, making sure that you have some kind of helpdesk function in place to field any user issues, loading the system with whatever data is needed to initialize it, making any other logistical preparations that are needed and then finally turning users loose on it ("going live", in IT jargon).

Variants of the "classical" project execution approach (the "waterfall approach") may devote months to each of these major steps, and there are rigid rules about when you can exit each phase and enter the next. Some projects follow approaches such as the "Agile" method in which you define a basic framework of intended capabilities and then develop the details by rapidly iterating through requirements, design etc. in multiple cycles. There's a lot more to either approach than what I've hinted at here, but I won't bore you with the details. Suffice it to say that the different approaches each work best for different kinds of IT projects, although as in any profession, there are adherents to each approach who will argue with religious fervor that one or the other is the only true path to IT nirvana. Here, for example, is a writer who blames the problems with on the use of the waterfall approach instead of Agile. To this I say only, beware of anyone who preaches the one-size-fits-all solution. I have seen Agile projects that were quite successful, and others that devolved into complete chaos and wasted a lot of time and money delivering a Frankenstein-like whatchamacallit that was rejected by users. You have to match the methodology to the circumstances of the project, and consider that often the problem is less in the methodology than in its execution.

Standard IT Project Lifecycle

Whatever the methodology, things can go wrong at any step in an IT project. If requirements are incomplete, or vague, or weren't reviewed and approved by whomever is the sponsor of the project and/or the ultimate owner of the system that will be delivered, you run a very high risk of expending a lot of effort, only to be told, "this isn't what we wanted; go back and build what we wanted, and don't expect us to pay for this other thing you built." If the design is poorly thought out, the final product won't deliver the intended functionality, or it will deliver it in a way that is incomplete, or runs too slowly, or is confusing to users, or crashes frequently, or otherwise renders it essentially unusable. If the individual components and the overall system aren't properly tested at each stage of development, you are pretty much closing your eyes and hoping for a miracle when you put it in front of end users as a supposedly finished product. And if the necessary preparations are not made as part of rollout, taking the system live with actual users can be pretty stressful.

Of all of these parts of the process, in my experience at least, testing is probably the most neglected. Requirements gathering is interesting because you get to talk to a lot of people and drink a lot of coffee and then write everything up in professional-looking documents. In design you produce cool-looking diagrams and have more meetings in which you can show off all of your technical knowledge and tell your war stories about how you cleverly solved this same problem back when you worked for General Electric or Sears or Bank of America. In development you do all the coding and configuration work and experience the satisfaction of having made some conglomeration of hard- and software bend to your will.

But testing is anything but glamorous; on the contrary, it can be mind-numbingly tedious. Testing consists of poring over all of the requirements and design documentation and boiling the intended functionality of the entire system down to a mass of test cases, each of which basically consists of this: If I do A, B is supposed to happen. Now I will do A. Did B happen? Yes? Test passed, move on to the next one. Did B not happen? Test failed, give it back to the developers and let them figure out how to fix it, then run the same test again when they claim it's fixed.

There are lots of different kinds of testing that need to be performed in a typical IT project. At the most basic level, you need to test each individual component of the system to verify that it functions correctly. Then you need to verify that components that interact with other components in some way do that as intended. Then you need to verify that the overall system functions the way it is supposed to in an "end-to-end" test wherein you simulate each of the intended "use cases", that is, specific scenarios or operations that the system is supposed to perform, expecting to see all of its individual components working together in harmony. 

At each of these levels you need to do not only "positive testing", in which you simulate ideal conditions to see how the system behaves, but also "negative testing" in which you verify that errors and unexpected events of whatever kind are properly handled. For instance, if a user enters his or her name in the field that is meant for a driver's license number, does the system return a message like "this is not a valid license number—please correct your entry"? Or does it just fill the user's screen with incomprehensible warnings, or crash the user's web browser, or maybe just do nothing, leaving the user wondering what to do next?

As already noted, if tests fail, the offending component needs to be fixed, and then the fix needs to be tested. However, it's usually not enough to just test the one item that was fixed; in principle, you need to retest the whole system. The reason for this is that fixes to software can introduce "regression errors", which is a fancy way of saying that you may have fixed one thing, but in a way that broke some other part of the system; you do "regression testing" to verify that you didn't fix one defect and inadvertently introduce a new one in the process. In principle, the system should not be declared ready for users until every individual test case has been performed on the final version of the system (i.e., after all fixes are deployed) and no test case has failed.

Other things need to be tested besides the pure "when I do A, I want B to happen" functionality of the system. Among other things, there is usability testing, in which you basically put test users in front of the system and verify that things like the way screens are laid out or the way you progress from one step of some process to the next makes sense to them. There's also performance testing, in which you verify that your system can meet the volumes of users and transactions that it is likely to encounter in real life; if you expect to have a thousand users accessing the system at any given point in time, you want to simulate that in testing before you turn users loose on the system and discover that it bogs down to the point of being unusable beyond fifty concurrent users.

Am I boring you yet? I guess not, if you've read this far. I did warn you that testing is about the dullest part of any IT project. But unfortunately, it's inescapable if you hope to deliver a reliable, working system. No shortcuts allowed! Something I see again and again is that some IT project is progressing toward the end of the testing stage, at which point the prospective owners think of a half dozen new features they want to add, and maybe a few things they want to change. When this happens, you really need to redo the testing of the whole system (remember the risk of regression errors we talked about). But more often than not, these changes get bolted on toward the end of a project; nobody feels like going back and redoing all that tedious testing, everybody just wants to finish the project and let it go live. So in the end, any defects introduced by those last-minute changes get discovered not by the project's testing team, but by the system's (increasingly irate) end users., and there's a mad scramble by the project team to fix everything under massive time and cost pressure.

So, why am I telling you all this? Well, as I read all the news reports about the mess that is, it appears to me that practically every one of the principles I've outlined above has been largely ignored. I'm exaggerating a little for effect here, but I think the thing speaks for itself. Reading the various analyses that have come out recently, experts who have looked at the technical design of the system think some pretty poor design decisions were made. Among other things, you can't just go onto the site to answer the simple question, "what kind of insurance is available and what does it cost?", like you would if you were shopping for, say, car insurance or a home loan or just a pair of pants of a certain size and color. Before you can get an answer to that question, you have to provide a large quantity of personal information that will be verified by the web site through a series of data look-ups in other systems. Besides giving a crappy user experience, the convoluted process requires a lot of communications between systems, and if any of these don't execute perfectly, the user is left sitting and wondering what's happening. 

Shortcomings in the design of the system no doubt are partly a function of just plain poor design decisions, but also a result of the underlying system requirements being changed repeatedly, as recently as a month before the system was to go live (as reported here, among other places). And as for testing? Testimony in the recent congressional hearings on's rocky rollout imply that testing was at best an afterthought. Hey, let's build a massively complex IT system that's going to provide a vital service for millions of users and only spend two weeks testing before we shove it out the door—what could possibly go wrong?

So… let me finally come to my point, which was… oh, yes: IT projects are hard. Lots of them fail abysmally and the bigger they are, the more spectacularly they fail. But this is not some new revelation, it's an established phenomenon you can read about here or here or here, or many other places. The President himself has tried to make the point that the ACA is more than a web site, but that's really missing the point. For those people who want to, or have to, sign up for insurance, is the ACA, or at least the primary manifestation of it in their own lives. It's also the most visible part of the ACA for the media; surely the Obama administration understood that if this web site was not working smoothly from day one, the administration was going to be pilloried in the press and it would be—fairly or unfairly—a major I-told-you-so moment for the Republicans. From what I'm reading now, for anyone on the inside of this project it must have been pretty clear, for a pretty long time, that it wasn't going to end well, and yet there are few or no indications that any sort of measures were taken to address that. This is not a technical failure so much as a management failure of the first order.

The administration's attempts at damage control have also been fairly laughable. First they tried to downplay the problems as "glitches", a cute word that sort of implies this is just a temporary and minor inconvenience; but it's not a "glitch" when the only reliable thing about the system is that the damned thing won't work when you try to use it. And trying to put things in perspective by talking about how the Massachusetts healthcare program got off to a slow and rocky start, or about how IT projects in general often have problems, is just making excuses—if you knew about these potential pitfalls, why didn't you take measures to keep from getting tripped up by them instead of just making the same mistakes as everyone else?

Then there were apologies and expressions of frustration from on high, and assurances that the whole thing will be working fine by the end of November—we shall see, but I'm not holding my breath because I think it will take a few months to do all the testing and rework that should have been done before the thing went live. We also heard that even if the web site isn't working, one can sign up by phone or mail, but what goes unmentioned is that the people who then do the processing  for you use basically the same unreliable system to do so.

Compounding the trouble with are the many reports that people who were repeatedly assured by the President that "if you like your insurance, you can keep it" are finding out that isn't true. It is true that most of the people affected are getting their coverage dropped because that coverage doesn't meet ACA standards, and what they can get to replace it is probably going to be a far better plan than the one they lost. But strictly speaking, what the President said simply wasn't true and so that becomes one more unnecessary black mark against the program in the eyes of so many.

I really want to see the ACA succeed. I suspect, or at least hope, that a year or two from now, things will be running reasonably smoothly, people who could not previously get decent, affordable health insurance will be quite happy, and like Social Security or Medicare, the ACA will be just another part of the social services landscape that nobody seriously questions. But for now, I'm just appalled at the amateurish way this thing was rolled out.