The fact that it talks about a ‘movement’ makes me a bit wary (“manifesto” anyone?), but there’s this thing called Devops that falls into my “freaking obviously good” category enough that I’m willing to ignore that.
I’ve talked about this sort of stuff a lot (though am way too busy/lazy to link all the relevant posts), but let me talk about a typical non-Devops example.
Separation of Duties
Working the last few years for all sorts of those “too big to fail” type companies, one of the standard things that you run into is “Separation of Duties.” Different groups exist that have different responsibilities, often because of some risk assessment made by some team that doesn’t really understand SOX. But, I digress.
Typically, you have at least 3 teams:
- Dev: the schmucks like myself that write code.
- Infrastructure: the people who set up the central infrastructure that runs the code.
- Migration: the people who migrate the code the Devs write into the Infrastructure.
Although I’m going to explain why this all sucks, there is at least a reasonable explanation for why this separation exists. In some typical environments, there are a lot of systems that interact with each other in often complicated ways. A *lot* of systems. Unix and Windows and mainframes. Oracle and SQL Server and sometimes Sybase. Java and .NET, and sometimes a wide range of batch processes, maybe a bit of Perl, and if you’re really unlucky, a bunch of C++. Additionally, you often have 3rd party vended apps that you have a limited amount of control over (in terms of being able to change underlying source code, for instance). On top of this, you often have multiple environments, usually separated into categories like “System”, “Integration”, “QA” and “Production”, each of which typically has significantly different configurations
It is highly unrealistic (or at least, statistically speaking, highly unlikely) to think that you can have a team of experts that is fully conversant with all of the different technologies, and the different ways in which they are are built at an infrastructure level, as well as the different ways in which the code that underlies it all is migrated. It could happen, but it is probably far from the norm, and unlikely to change. So, you have different teams that, ideally at least (though, statistically speaking, less likely than not), are experts in their areas, and have built out sets of ‘best practices’ in how to structure their areas.
Though it doesn’t seem to happen in practice as often as one might hope, the ideal situation should be fairly obvious: the various developers write well written code (or well written “enough”) that is supported on top of an infrastructure that has been solidly designed and maintained, which is then migrated according to procedures that have been developed over time and proven to be stable and maintainable.
As one last thing here, you have to take into account that there are often legal compliance issues to take into account. The average developer is often restricted from being able to see Production data. Rightly or wrongly, there is a notion that you need to spread these duties around to different groups to prevent too much critical information from residing in any one group of people. If you’ve ever worked in an environment where the “IT Team”, however defined, can be upwards of 100 people or more, it’s simply unmanageable from a resource perspective *not* to split up responsibilities.
That’s the idea anyway. Let’s take a look at where it often breaks down.
It’s Not My Fault
An obvious problem can occur to the extent that the actual soldiers on the ground, so to speak, don’t match up to the ideal. There are always people in the various groups who tend not to be experts in their areas or who aren’t quite as conscientious or who aren’t team players, or whatever. I’m not going to really dwell on that here.
An area where the ideal quickly and easily breaks down is when a problem arises (and I’ll stick here to talking about Production migrations) and the cause of the problem is unclear.
Most good employees/contractors/whatever want to be able to fix production problems and do so quickly, not just because it is to their benefit, but because they want to use their problem-solving skills to identify what needs to be done, as production problems, especially in “too big to fail” scenarios, are usually highly visible. If you’ve ever worked in a situation where traders cannot do their jobs, you know how that can be.
When a problem arises and the cause of the problem is unclear, the separation of duties often makes it totally unclear who is responsible for the problem and who should take the lead in driving it to a resolution. Is the code bad? It could be. Is this a new problem that is surfacing a previously undetected flaw in the infrastructure? It could be. Is it something that the migration processes have never uncovered before? It could be.
Since it is fresh in my memory, let me give you a specific example I dealt with recently.
I worked on a set of fixes to a production application that is used to support a trading team. The fixes themselves were clearly identified and the code required to remedy the flaws was, relatively speaking, non-complicated. The standard procedure of migrating the code “up the food chain” through the various environments before the actual production migration went smoothly, as the different teams responsible for their pieces did their jobs. When it actually became time to promote the code to production, the migration failed.
Since the migrations up and until the one that was to go into production went smoothly, the developer (in this case, myself) felt pretty strongly that there was nothing wrong with the code itself. If the code was flawed, it should have showed up in previous non-production migrations, and besides, how would bad code (that wasn’t being executed as part of the migration) cause a migration to fail? The migration team knew that the migration failed, but did not have full access to the infrastructure logs that might pinpoint the issue, so from their perspective, it didn’t appear to be a problem with their procedures. The infrastructure team, having successfully supported the non-production migrations, couldn’t immediately pinpoint any reason why the production one failed.
Although there was the usual vague ‘finger-pointing’, the reality was that we had a failure, and no one could exactly explain why. It was reasonable for each group to say after a cursory look at their area of expertise and the facts as they could see them “I don’t see anything wrong with what we are doing here.”
As the developer, I had no access to the production systems (well, next to no access), so I couldn’t see any relevant logs. The infrastructure team and the migration team could see their own logs but not each others. None of us had the blanket ability to log into any particular machine and see any and all relevant data, we could only see the data present to each separate team.
In the end, the cause of the problem, and its resolution, was one of those typically maddening and stupid things that, in retrospect, should have been easily identifiable at an earlier date. But more on that later.
The dark side of siloization
From the devops post:
“On most projects I’ve worked on, the project team is split into developers, testers, release managers and sysadmins working in separate silos. From a process perspective this is dreadfully wasteful. It can also lead to a 'lob it over the wall' philosophy - problems are passed between business analysts, developers, QA specialists and sysadmins.“
The problem with separation of duties is that, when enforced strictly, you set up these inevitable impasses where no one team is responsible, and no one individual, who may have the ability to resolve any issue from a technical standpoint, has the ability from an access standpoint to make the fixes required. Every problem that could be remedied before it becomes a critical issue can only be remedied after it becomes critical. This seems to be an odd paradigm. As a developer, once an issue becomes critical (and as such is raised “up the food chain”), I often then have the ability to do just about anything that I want to do (this is often called a “firecall” problem). What would have helped is the ability to have this power before it became a firecall and senior management was involved.
How does Devops help?
To a certain extent, Devops can’t help. The “separation of duties” mentality is so ingrained with so many organizations, that the obvious steps that one can take to improve things will meet with some resistance. So, to a certain extent, what Devops can do is simply “raise the consciousness” of people involved. Give the different teams the ability to “fix a firecall before it is a firecall” and work together in a more proactive manner.
From the post:
“So, the Devops movement is characterized by people with a multidisciplinary skill set - people who are comfortable with infrastructure and configuration, but also happy to roll up their sleeves, write tests, debug, and ship features”
It is understandable, and probably unavoidable, that separation of duties is something that won’t go away any time soon, but allow different members of the different groups to have greater input and greater access to areas that are currently blocked off. If this is allowed:
“Suddenly the technical team starts trying to pull together as one. An 'all hands on deck' mentality emerges, with all technical people feeling empowered, and capable of helping in all areas. The traditionally problematic areas of deployment and maintenance once live become tractable - and the key battlegrounds of developers ('the sysadmin built an unreliable platform') versus sysadmins ('the developers wrote unreliable code') begins to transform into a cross-disciplinary approach to maximizing reliability in all areas.”
The fact of the matter is that if this isn’t allowed, “nature finds a way.” On more than one occasion in my career, and I’m hardly unique in this, I’ve found a way to get around organizational blocks to solve a production issue. Especially when it allows a trading group to begin trading that was previously blocked, I don’t have a problem with taking the “ask forgiveness later” route, but the central point is that it shouldn’t be something that requires later forgiveness.
Devops, to me, is as much a statement that “this should not stand” as much as anything else. Organizations should strive to find a way to allow for a general separation of duties without making it so strict as to thwart successful and repeatable migration attempts. How this can be done will vary from organization to organization, but since most organizations allow for the strictness to be relaxed to a certain extent in firecall situations, they should be able to find some similar relaxation during migrations before they become firecall issues.
Addendum: what was the issue?
For the particular real world scenario that I mentioned, what was the cause of the problem?
As it turned out, for months and months, the production migration had been failing every single time. Because neither the migration nor the infrastructure team could determine the actual cause, they were ‘forcing’ the migration to succeed through whatever manual steps that were required to get code into production. Since they couldn’t pinpoint the issue, they didn’t officially raise it to any external group.
Well, in between the last ‘forced’ migration that they silently fixed and the most recent one that failed, the infrastructure team upgraded one of their systems that gave them additional logging that identified the issue.
For reasons that have yet to be explained, the production migration first attempts to migrate code into a “Pre-Prod” environment. The previous developer of the code had fat-fingered a “Pre-Prod” config file to have a duplicate entry that no one had noticed before. So, technically speaking, it was a code error. Making the problem exceptionally irritating is the fact that, technically speaking, there is no purely separate “Pre-Prod” environment, it’s a step carried over from other infrastructures that have separate hardware, etc. Every migration, we are asked to verify a successful “Pre-Prod” migration, but since there is nothing to test, we always automatically verify it as successful.
The person responsible for the migration, having discovered this flaw that had existed for 6+ months, reported it in an online system, demanded a new code package without explicitly telling anyone, and went home.
Fantastic. The dysfunctional corporate exercise that then ensued is a topic for another day.
But, if there had been a ‘devops’ style migration practice, the duplicate entry could have been removed without requiring a whole new build and a new migration, which required explicit management approval.
Nice.