Is Agile agile? Is SAFe safe?

In a discussion with a senior exec once, we got onto the topic of agile process. He at some point told me (rough paraphrase from memory!) “Agile is about knowing what your destination is and getting there faster”. I used a little personal discretion and let it pass, but I was an agile advocate (OK, maybe zealot!) at the time, and this helped me cement in my head what was different about agile methods vs. others.

First some base principles. There are of course a bunch of program management processes and lifecycles out there, and they’ve evolved over time. They all do some similar things: define the scope and requirements of a project, break down the required work into manageable chunks, provide a way to estimate resources and timelines, and tools for tracking progress as you go through execution.

Where agile (as well as earlier variants like spiral lifecycles, etc.) starts to differ isn’t just “scrum” – it’s in the assumptions when you make when you are in the early stages. There is of course a ton written on this (a good starting point is of course the agile manifesto), but in most approaches, the assumption that we do know, or need to know, the key answers up front. Who is the customer? What is their problem? What are our requirements? What is our architecture? What work is required? What (and when) are the milestones? In most processes I’ve seen, the model is that we need to have substantially complete answers to these questions early: either before you start, or the first phase of the program is to define them. This makes sense: predictability is generally a good thing, so knowing your scope/schedule/resources upfront makes the projects easier to budget and manage.

The problem agile addresses, though, is that it presumes that you don’t know many of these answers upfront, and in fact that you can’t know. We often do a lot of work to get confidence in program definition early, but as I look at projects I’ve been involved in that didn’t go as planned, it was rarely due to pure execution problems. It’s not that the plan was correct and we just didn’t execute – it’s that things came up that we didn’t anticipate. The customer’s problem wasn’t what we thought it was … we missed a key requirement … a piece of tech didn’t work like we expected (or new tech showed up that was better) … integration wasn’t as smooth as we’d hoped … a competitor did something to change our plan, etc. For many programs, these disruptions were seen as problems (or “thrash”) – changes to plan that upset our predictions, and could impact schedule and/or cost in bad ways.

Where agile differs philosophically is that these disruptions aren’t seen as problems – they are learning what is core to the process and critical for program success. Agile sets the mindset that we don’t know the answers up front, and in fact can’t know: thinking we know is dangerous self delusion. The emphasis shifts to developing a plan (there still is one!) for how we can validate the key, expensive assumptions as quickly as possible, with the fewest resources possible. If this sounds to you like the Lean process – yep, it is! As lean processes came on the scene, it always struck me that it was effectively the same as what agile was supposed to be.

Now for control freaks (myself often included) and finance departments, this sounds scary: we’re suggesting starting a project without having firm commitment to scope/schedule/resources – and we are right. The belief driving this, though, is that most of the time that we think we know up front, we really don’t. In traditional methods, we typically lock into a plan, with the ~~assumption~~ hope that it’s right. Thinking agile, though, requires us to step in assuming that we don’t know all the answers yet. With this assumption baked in, we can get to the real correct scope/schedule/resources faster, as we are testing them early and adjusting when it’s cheaper to adjust – not at the end, when it’s extremely expensive.

Practical Agile Process

So what does this mean for our actual process? The core of most agile processes is typically “scrum”. We have small teams, led by a Product Owner, that break work up into dev/test cycles called sprints. In many ways, this is similar to most product development approaches … but the key distinction here is not just the frequency (i.e., how short are your cycles), but that the test results of each sprint causes a planned reset of the scope/schedule/resources before the next sprint. For a lot of agile organizations, though, we’re really in waterscrumfall: we are using scrum as a way to break up and schedule work, but the feedback cycle that changes our plan doesn’t exist, if not be seen as a problem. If we’re getting agile right, feedback-induced change is not a problem – it’s the process by which we get to the right answer the fastest at the lowest cost.

Sprint length is often seen as a key metric for how agile teams are – i.e., the shorter your sprints, the more agile you are. This isn’t totally wrong, but I’d suggest it’s not our development cycle frequency that defines our sprint length – but rather, it should be the test/feedback/change cycle. If we are doing weekly dev sprints, but only test and adjust plan once a month, our real sprint length is a month, not a week.

If this sounds like I’m saying testing matters more than dev, I kind of am! We of course have to do development to have something to test, but if we’re really getting the core principles right here, our schedules would be built around testing, not development. A key aspect of this is what testing is. For waterscrumfall programs, per-sprint testing early in the program is often only technical: i.e., unit tests to validate expected behavior, etc.. Other testing – integration, security, privacy, much less customer feedback – happens only at the end of the program when most development is complete. For agile, this is backward … we want to get to full system test and customer feedback as early as possible, well before we’ve done much of the development. We then add to these tests progressively as we add more functionality, but we’re getting full system feedback the entire time, not just at the end. This can require some creativity in how we do testing, but it’s well worth it.

This may all sound like advocating that we don’t do any planning or architecture up front. We definitely should, but I think our mindset around this is different in agile. We do need a plan, and we do need to develop (initial) architecture. The distinction is that we think of this initial plan and architecture as a starting hypothesis – call it v0.1. It’s the starting plan and architecture, with a set of assumptions that we use to model our business success. This plan then defines the first sprint(s) work to validate these hypotheses, and progressively advance the plan and architecture as we learn. These changes to the plan and architecture aren’t thrash – they are the key outputs of the first few sprints.

Whither SAFe?

A lot of what I’ve talked about so far has been in the context of individual scrum teams, which of course limits the scope to what a single team can accomplish. For a lot of larger enterprise programs, though, the scope is well beyond what an individual team (at least a 2 pizza team) can accomplish. SAFe is the hip approach to this for a lot of organizations these days, and at a high level, it’s great stuff: it gives us methods to apply the agile principles and approaches discussed above. It does, though, apply a lot of extra process and constraints on individual teams, and often brings it’s own flavor of waterscrumfall. Sometimes this is justified or required – but I’ve often seen it applied a lot broader than it needs to be. In most cases, before jumping into SAFe (or applying it across all teams), we should be looking at our organizational and technical architectures. Do all of our teams really need to be synchronized, or is this a flaw in our organizational or technical architectures? The more we can keep teams decoupled so they can develop, test, and deliver independently, the more efficient and less process bound we can be.

A Caution

Agile isn’t an excuse for bad decision making. It should not be an enabler for late, chaotic changes (though it can handle them better). The trick is for change to be based on real customer feedback/signals/data. Data drives decisions, decisions don’t choose their data (now that’s a novel thought these days!). We shouldn’t avoid decisions, or change requirements based on whim – but rather, frames decisions as hypotheses declared up front. We then test to validate (or change) them based on real customer feedback and data. Managers that use “being agile” as a way to make late, non-data based decisions aren’t really being agile – they’re working outside the process, not inside it.

My idealized flavor of agile may not be the answer for every project, and there are certainly projects where it’s just not possible (or safe!) to do “in the wild” testing of some products early in their lifecycles. That said, we should have a little self-introspection on our “agile” programs: are we really getting what we think we should be from them, or are we unconsciously harboring some cognitive dissonance with agile methods and terminology inside processes that are really not agile at all?

Making IoT Matter … not just a taped-on trinket

4 Replies

In my years as a systems architect for HP, I often worked on what we referred to as “cyber-physical systems” – mechanical hardware devices that get connected to online systems. While HP is known first as a HW company, the amount of software developed as part of this is pretty impressive, and though the core of many of these products remains physical, a lot of the value provided is through software, often cloud connected. HP’s “IoT fleet” is at a pretty impressive scale.

IoT as a meme has been around for a while of course, and though in some ways it’s maturing, there are some themes that are evolving. How hardware is connected to the cloud will be a big factor in any company’s long-term success. Hardware companies, which traditionally think of software in terms of firmware, face a special challenge. They may, without realizing it, carry their legacy hardware mindset into their new cloud connected systems.

I’m going to explore three ways I have seen this manifested, and how companies can flip the script on each. I’ve added a fourth pattern that I think is coming, though I’ve not seen it much (yet). None of these ideas are new, but they should be given fresh consideration as technology companies plan their next steps in a cloud-connected world.

Bugs & Quality

Whenever technology is developed, there will be bugs – stuff happens! Long before we connected our hardware to the cloud, there was a general pattern that hardware (including circuits on chips – ASICs) locks first and is the hardest to change. If we find a bug at that level too late to get into production, we push the fix up to firmware. And from there, if it’s too late to fix in firmware, we push the fix to client systems, and now to the cloud. In many ways, this makes sense: it’s often faster and cheaper to deploy fixes in the cloud, and this can save schedule and money. It does, though, have a dark side: as we push these defects upstream, we create a legacy of exceptions in our cloud systems that creates a lot of error-prone complexity and is a long term drag on delivery and test productivity. A different pattern to consider is a qualification test model: cloud systems connecting to HW create a set of qualification tests that the HW programs use during their development, and test vs. cloud requirements early in their development cycles. This is similar to the Microsoft WHQL (Windows Hardware Quality Lab) model: rather than push device-specific exceptions into cloud systems, HW dev isn’t “done” until it passes the qualification tests, keeping the cloud systems clean.

Value Delivery

For many hardware-centric companies, the traditional product delivery model emphasizes value propositions through the HW product: this makes sense, as are often the primary path of monetization. This approach, though, drives us to maximize differentiation between products, which complicates cloud development and delivery. If we look at more cloud-native systems, though, much of the core value and differentiation is system based. As an example, Amazon Kindles exist primarily to deliver system value – i.e., Amazon content consumption. The emphasis from a system architecture perspective is that hardware products deliver cloud-hosted value. This shift allows us to dynamically add value to our products over time, and limits cloud work tied to specific hardware products. We of course will continue to provide value and differentiation through hardware (e.g, the equivalent of different Kindle models), but the system architecture emphasis is on a common set of value delivered through all hardware.

Delivery Cycles

Many companies snap to hardware product release cycles today and tie their cloud deliveries to those hardware product cycles. This is obviously critical for hardware supply chains and channels, but it does leave value on the table that could be exploited, especially with long term customer engagement. If we contrast this with Tesla … their “hardware” is considered a long term investment as a platform for ongoing system delivery through software and cloud systems (as an aside … there’s a theme in here about value delivery timeframe impedance matching between long term “capital” purchases vs. short term “value” acquisition – but that’s another post!). Tesla can both push new features to in-field customers, as well as push customer-specific patches, which flips the notions of what a “finished” product is. Rather than pushing everything onto a single schedule aligned with the hardware delivery, a “core” feature of the initial delivery are the mechanisms for granular, dynamic delivery in the field. The lifetime value of the product then is expanded over time, on the schedule and timing of cloud system delivery.

Extra Credit … no more firmware! (sort of)

We today treat firmware and software (or cloud) development as very different disciplines, with different tools, languages and processes. Given modern cloud (and often client) development becoming container based … what if we push certainly classes of FW development into a container hosting model? There are obviously constraints here, and some parts of FW (e.g., real time mechanical control) may never move to this model. Much of what we call firmware though is really just software that happens to run on-device. If we can get a common deployment model in the cloud and in devices, we can have teams effectively develop “both sides”. Much of our system complexity (and often our defects) involve coordinating development, and codifying interfaces and data transfer specs, between firmware and cloud/client systems. Rather than spread functionality expertise across teams due to deployment impedance, teams could own and deliver full “features” deployed across the system. There is definitely a lot to work out in this kind of model, but doing so could make us much more agile in system feature delivery and increase our software development flexibility as we reduce the extent to which our developers are siloed in “firmware” and “software”.

It’s pretty clear that more and more of our devices will continue to become cloud connected, and I’m looking forward to seeing how this evolves. There’s a lot, though, to getting the value in this transition beyond just creating a cloud add-on to a hardware product. Re-framing how we do IoT from “stuff we tack onto our hardware” to being integral to the product will lead us down a much more valuable path.