Poacher Turned Gamekeeper

When the only perceived way to get things done is through a centralized coordination, a program manager is typically seen as the way to get…

Apr 10, 2022

Go Team Building — http://www.goteambuilding.co.uk/spaghetti-tower/

When the only perceived way to get things done is through a centralized coordination, a program manager is typically seen as the way to get things delivered. Depending on your context that may mean an emphasis on deliverables and output over outcomes.

We see this in programs and projects with looming deadlines, overrun budgets and a lack of output. People tend to focus on doing work and overloading the system (maximising utilization), rather than getting each item of work into a usable state and validated by a user / customer / downstream consume.

One of the reasons for this is often a fragile, tightly coupled set of value streams where huge, complex, and often manual testing coordination is required to give people supposed confidence that if you start a transaction at one end of the company it can end up correctly calculated on a financial report of some kind.

This is an area where people have been burnt in the past and are overcompensating by putting huge emphasis on what the financial service industry refers to as ‘front to back testing’. This often involves lining up many systems, their respective environments (UAT / QA), getting production data and effectively running the entire process end to end to make sure that the end results end up as expected. It isn’t as simple as aggregating multiple feeds, processing calculations and then reviewing the results but at a macro level that is effectively what happens.

In this case is that there’s many aspects being tested: the functionality, the batch files being send, the calculation logic, the numbers reconcile to an acceptable tolerance and that people can somehow explain or at least sign-off that the results are correct. I’ve often wondered how this approach is viable especially when this requires such a huge coordination effort to happen and is often planned well in advance even if the upstream systems have their changes ready to release 2 months in advance.

In a world where people are funding programs / projects rather than products and capacity people often do the work whenever they can rather than when it is needed. This could be due to many different reasons, but we know from lean that code sitting in inventory (not released into production) is a form of over production and waste. Unless that code is released into the wild and validated to not break anything, it simply sits there accumulating potential technical debt and may even be in a feature branch veering further away from the latest version of the code into production. This becomes a veritable ticking time bomb if the only time your code (from 2 months ago) can be fully tested is when everyone else is ready you have effectively constrained your system of work to the bottleneck which often be the furthest downstream system in the value stream.

Often this is a back-office function like a finance organization who need to generate the output to a regulator, and they cannot test until all other code from all upstream systems is fully ready to be tested.

This puts everyone in a very tense situation as everyone wants to get their code out of the way so that they aren’t the bottleneck for a regulatory committed date, in doing so, if upstream teams complete the work early, they cannot validate that the code works, and if they wait until the last responsible moment, they may not be in a position to effectively get the work done on time.

Either way, this is a modern-day game of chicken for many teams and the only perceived way of ensuring that things will all work together (front to back) is through an extensive regime of testing across all systems. This is often supported via a centralized traffic controller effectively orchestrating the whole thing.

Depending on how eager the upstream teams are, they may have committed their code months ago and who knows how robust their tests are from a downstream perspective (they often lack a way of knowing if their changes break anything downstream until they test across multiple systems) things could break and they may not even know why. These late breaking identified issues are often due to a lack of comprehensive downstream impact testing. Every time an issue is identified late in the game, it is akin to finding a needle in a haystack of needles.

To be fair to many people in the industry this is a complex effort where it is hard to have deterministic predictability between each step in the value stream so it isn’t like you can offer a quick fix or their individuals involved would have tried that already (or at least they tried 5–10 years ago, gave up and left).

What are the options?

In general, I try to stay away from solutioning problems, especially when the problem domain is complex, but we know that there are practices and patterns that can be leveraged because other similar organizations with the same regulatory commitments have been able to solve things differently.

I often refer to two analogies when I think about this problem space, the first once is single piece flow based and easy to understand but not sufficiently complex for people to appreciate the permutations and variables of things that could happen. The Pachinko machines in Japan are closer but they don’t result in the same predictable output each time so I’m not sure it’s appropriate but either way — imagine needing to coordinate an effort for either of these 2 examples to work in a deterministic way across 30+ teams all whilst those teams are working on multiple other activities and priorities.

The Rube Goldberg machine

2. The Pachinko machines from Japanese arcades

That’s great, how does that help?

Now that we have an appreciation for the complexity involved, it might allow us to start thinking about potential practices and ways of dealing with the problems in new ways to support the incremental delivery of value.

I do not envy anyone who has such a complex environment to coordinate across especially when there’s competing priorities where the perceived value is limited at a product level. There is still the need to coordinate such activities and ensure we are gradually paying down our risks and incrementally delivering on the underlying needs . That doesn’t stop the work from being incrementally delivered and concepts like continuous integration and delivery are modern ways of addressing these challenges — even if it means having a separate environment you run daily when people are looking to test the changes. If the downstream interface needs can be determined and agreed on in a contractual sense it should allow the upstream teams to work to that interface even if the interface isn’t ready yet. This type of abstraction allows people to get on with the work in a more independent way. We know that companies that create hardware, firmware, drivers and software take this approach — Tesla is the latest example but this was famously done in HP with their printer firmware decades ago (also described in Lean Enterprise).

That is easy to say and hard to do, when you have a series of monolithic systems in places e.g. many transactional systems, financial ledgers, risk calculators and systems that centrally own reference and static data (things like company codes, legal entities, cost centers etc…) you end up with a terrifying number of things that could go wrong. So that is where things like feature flags, bi-temporality, parallel running systems can be used so that you can get as close to running in production as possible as early on as possible. That still leaves test data management but if your reference and static data systems have versioned interfaces that support bitemporality and context. The interfaces should allow the systems to define the context (e.g. a stress test, a what-if analysis, a production transaction, a test in a QA environment etc.) so that it can call on data from the future as part of its testing criteria and continue to test other criteria in parallel.

I am not an expert in these topics but there is bound to be a lot of information on these topics and how they can be pulled together to address a maintainable approach. Having the context covered in your interfaces and your mocks / stubs you can test with higher levels of predictability and speed to support continuous integration and ideally delivery.

The topic of bitemporarily is often scary if your systems doesn’t need to know about such things but if you can abstract some of this complexity it is possible to continue to support the needs and account for the given context where bitemporal awareness is needed (downstream). Anyone used to test complex time-dependent / context specific work, will know that it is unavoidable and often a way out of the tar pit if you want to keep things manageable.

It should be possible to abstract some of this complexity if you start treating each step in your systems process as independent and abstracted. That also allows you to start working and testing independent of your static / reference data systems and test much more quickly as you are building the code.

If you must wait for external systems to be ready and be available that makes the entire process quite painful and slow, that is ultimately where you end up needing high levels of coordination in place of having versioned the alternatives described above. If you need the assurance of testing between systems, we live in a cloud world where you can start up and shut down an instance as needed so why not have a daily run to start where whoever needs to test against an upstream or downstream system can do so and on trigger without an ‘environment manager’ to be there and align the stars?

Summary

How does this blog tie back to the title I hear you asking, I was asking myself the same as I got to the summary? There is value in terms of coordinating complex work but the nature of that coordination needs to change if it is to enable the fast flow of value. Chasing teams for status updates, having rigid processes for testing ‘front to back’ that require people to make others be available on schedule dates etc. isn’t sustainable or healthy. To create a more sustainable approach, taking a value stream view of the work and prioritizing the work along the value stream allows the entire system to benefit. That means fixing and investing effort in capabilities that allow the rest of the value stream to benefit should be taken rather than simply making changes to be compliant for a regulation.

Often programs end up having minimal budget for work as they have to cover many teams and systems so they aren’t interested in paying to large changes and improvements and often have to justify / explain their budgets and project plans to regulators so they are typically walking a fine line between seen as spending enough money on the regulation to be taking it seriously but not too much money that the money is being used to gold plate the entire organization. This is often a goldilocks-zone scenario.

Regulators in different countries have distinct levels of experience when it comes to incremental delivery but taking an incremental approach to delivering value and moving to a more deterministic approach to testing within boundaries is a way of decoupling the work and reducing risk in the process. That may mean choosing to invest in changes that benefit the whole value stream rather than locally in a given system to have a clear view of the value stream and its underlying components will allow you to optimize for flow rather than simply handing out cash to many teams who are often too busy to understand the work at the time of estimation and are making commitments far too early to know what the actual work entails.

Consider having a separate value stream for the regulatory work if the efforts are long lived or at least multi-year to ensure the investments in the regulation are to be optimized based on value rather than slotted into a backlog that the product manager may keep pushing back to the last minute as they don’t see the value in the changes in their product. This is a classic case of local vs global optimization. Quite often both are needed and both add value. Identifying the impediments and bottlenecks in the value stream will however allow for everyone to benefit, not just a local product / application.

Program managers and business analysts often have massive amounts of insight when it comes from a front to back regulatory commitment. You should make use of their knowledge and skills to give out a clear view of the value stream and treat it as a product of its own. The shape and interaction of work should move away from chasing and move to enabling — you cannot do that unless you optimize for the flow of value. Whilst an engineering culture is important in terms of practices and how things get built, those engineers don’t necessarily want to look up at the wider view, this is where optimizing for local and global can be useful, even then you may have multiple granularities that you are focusing on e.g. system, region, customer segment, value stream, business, organization etc.

I’ve recently heard of the idea of a value stream architect (links below) and that is the future to consider. Optimize the flow of value in the value stream and good things can happen! If you have a complex system and only optimize locally and forget about the global system you may create gaps (think Rube Goldberg machine).

Think about the complexity in your systems / value streams and determine if they are essential, accidental, or incidental and plan / adjust accordingly!

Contextual Healing

Discussion about this post