CI/CD with Robert Erez

Overview

This episode is a practical tour through modern CI/CD with Rob Ayers, who has spent more than a decade working on deployment systems and joined Octopus Deploy early on. The conversation moves from the basics of continuous integration, delivery, and deployment into the harder parts teams hit in practice: progressive delivery, GitOps, feature flags, roll-forward strategies, and how AI may change release pipelines.

Key Takeaways

A clear distinction runs through the episode: continuous integration means merging code and testing it often, continuous delivery means the software is always in a deployable state, and continuous deployment means it actually goes to production automatically. Rob’s point is that teams do not all need the last step. Regulation, staffing, and operational risk can make full continuous deployment a bad fit, even when strong delivery practices still make sense.

Progressive delivery comes up as the next step after getting a decent pipeline in place. Instead of releasing to everyone at once, teams release in controlled slices, watch the system, and expand only if the signals look good. Rob uses canary releases as the classic example, including the old Skype habit of treating New Zealand as the first real-world test group.

One of the more useful arguments in the episode is that feature flags are often better than deployment techniques like blue-green or canary on their own. A deployment moves code; a feature flag controls exposure. That separation gives teams a safer way to release changes gradually and turn them off fast when something goes wrong.

Rob is also skeptical of the sloppy way people talk about GitOps. His take is that GitOps is less about Git itself and more about declarative state, versioning, pull-based updates, and continuous reconciliation. The name pushes teams toward an "everything in Git" mindset, but that breaks down with secrets and other cases where Git is a poor fit.

Another strong point is the advice to stop centering rollback as the default recovery plan. Rollbacks sound clean until schema changes and stateful systems get involved. In those cases, rolling forward with a fix is often safer and more realistic than trying to restore the past.

Practical Steps

Get to continuous delivery before chasing continuous deployment. Make sure builds, tests, and deployment steps run reliably enough that production release is a choice, not a scramble.
Start progressive delivery with one feature flag on one meaningful feature. Use it to separate deployment from release and get used to controlling exposure in production.
Add ownership and expiry to every feature flag. Rob says his team tracks which team owns each toggle and when it should be removed, then uses CI checks or notifications to clean them up.
Build observability before leaning on canaries. If you cannot tell whether the new version is healthy, a partial rollout does not buy you much.
Plan for roll-forward recovery. Test how you will patch a bad release quickly, especially when database changes are involved.
Treat GitOps as a model for desired state, not a rule that every operational detail belongs in Git. Keep secrets and other sensitive runtime concerns in systems designed for them.
Consider ephemeral environments for branch-level testing. They give engineers and reviewers a short-lived, shareable environment without the contention of a single shared test box.

Notable Quotes

"The reality is it doesn't really suit every company." - Rob Ayers, on continuous deployment
"Nothing in any of these pillars actually talks about Git." - Rob Ayers, on GitOps
"If I've got a failure in version two, my rollback isn't to go to version one, it's to go to version three." - Rob Ayers

If you ship something and there’s a problem and you can reach immediately for the toggle and switch it back off, you’ve stemmed the bleeding and can calmly understand what’s wrong. — From the episode

Full Transcript

Source: openai 1h 14m runtime

CICD remains one of the hardest things to get right in software engineering. But why? Rob Ayers is a CICD expert, having worked in this field for more than a decade. In the early 2010s, we were teammates on the Skype for Web team, and then Rob joined Octopus Deploy as one of the first engineers 10 years ago. In today's episode, we cover progressive delivery and practice, canary deployments, blue-green, and why feature toggles are often still better. What is GitOps, and why it's not about Git, and where that everything-in-Git mindset breaks down. How to prioritize rollbacks less, and focus on roll forwards. And many more. If you want hard-earned lessons about CICD, progressive delivery, and what's coming as AI changes how much code we ship to production, then this episode is for you. This episode is presented by Antithesis. Verify your system's correctness without human review or traditional integration tests, and avoid bugs or outages. Today's episode will be about CICD. CICD at scale is one of the hardest infrastructure problems to get right, and the teams who nail it know that the details very much matter. This is where I need to mention our season sponsor, WorkOS. WorkOS brings the same rigor as many of us use with CICD at scale to Enterprise Auth. SSO-skim, RBAC, production-ready, battle-tested, and built to handle real load and real compliance requirements. To add Enterprise Auth without the infrastructure project, visit workos.com. Rob, it's awesome to have you here on the podcast. Hello, Gugge. It's good to be here. I'm loving Amsterdam. Yeah, it's been like, what, 11, 12 years since we worked together? Yeah, yeah, I think 2015, 2014, 2015, I think I left the UK. Yeah, that's a while. And Skype, when there was still Skype. Our team somehow inherited the Outlook.com plugin, which had like 400 million users per month or something like that. Yeah, it was crazy the amount of usage. It was massive scale. So this was an interesting job. Deployments were very much a case of, you know, you ship once a week and you have to go to a CAD board, you know, a change advisory board, and you have to get sign off and approval. And I always found that really weird, right? Like we're building this piece of software. It runs on the web. We can ship it whenever we want. It was running on Azure at the time. And so, you know, we've got full access to push whenever we want. And we make these changes through the week, but we'd kind of have to hold them back. I guess to Adela, our manager, both of our managers at the time, we kind of, I guess, worked around the system. When the code was ready, we'd build it and ship it through the week. And it was really sort of impressive and proud of this process that the whole team had kind of put together, right? Where we'd commit the code, the test would run several kind of layers of testing, it would go to staging, et cetera, and then it would get shipped to production. So we're kind of, I guess, executing a form of, I guess, you know, continuous delivery at a time. And we would then ship ourselves, you know, once a week. I kind of always like to tell this story that at the time, you know, when we'd have a build ready to go, you know, we do a form of canary deployments. And so this is where you kind of roll out to a small percentage of your customer base. And we always found that the customer base that would be our test subject was New Zealand. So New Zealand was always our canary. A bunch of reasons for that, you know, they're in the, you know, the first country to kind of reach this, you know, new date. So they're always the first ones to kind of roll out into a time. When it comes like, you know, like midnight passes, it's like 1 a.m. First country is New Zealand. Bang, exactly. So the first country that's of, you know, significant size. They speak English, so if there's any bugs or issues or reports, it's kind of easy to understand. But, you know, to be honest, New Zealand is small enough that no one really cared if we shipped a bug and had to fix it quickly. So sorry to all the New Zealanders listening. But yeah, I think that's kind of this good example of using a continuous delivery technique to ship the code faster than what we otherwise could have if we had these kind of big bang releases. And this whole process, I guess, opened my eyes to, you know, what progressive delivery, what good CI, CD could be. And yeah, I guess from there, I spent a few years there at Skype. And eventually wife and I decided it was sort of time to come home to Australia. And then back in Australia, you went to start to work at Octopus Deploy? Yeah, eventually I came back and actually worked at a place with a friend of mine just for a little while, just to kind of get back on the feet. And I remember they were using Octopus Deploy there. And so Octopus Deploy, for those who don't know, is a deployment tool that was built and developed originally in Brisbane. So there was a strong kind of Brisbane attachment to it. Yeah, that's right. And then when I found out that they were hiring, I thought, okay, why not? I'll give it a go. I like CI, CD. I like this space. I think there's, you know, a lot of interesting problems in this space. So I applied and joined. And at the time I was employee, I think employee number eight or nine or something like that. So it was very much still a bit of a startup culture. Definitely not startup in the sense of, you know, Silicon Valley, wild parties and, you know, ridiculous spending, but startup in the sense that everyone who I worked with was an engineer. Paul Stavel, the CEO, he's an engineer. This is kind of where it started from. And so we'd all be working on code together. I mean, if someone had an idea, you'd have a bit of a chat about it and ship it. So we were the marketing, we were support, we were kind of a bit of everything. And yeah, obviously the company has grown a lot since then. The company was focused, Octopus Deploy from the start, they were focused on deployments, right? Can we talk a little bit on whenever I think about deployment, I always say CI, CD, Continuous Integration, Continuous Delivery. Why was there a focus on deployments? And is that the same as continuous delivery? Yeah, interesting. So you're right. Like quite often people talk about CI and CD as this kind of interchangeable. They're either interchangeable or the word is CI, CD. It's just attached to itself. It's hard for me to imagine a CD without a CI, Continuous Integration. That's right. And I guess the way to look at it is, you know, you've got sort of multiple stages of maturity of software teams as they kind of move on their way from, you know, initially CI, which Continuous Integration, this is the idea that... Well, initially it's YOLO. Initially it's YOLO. Yeah, that's right. YOLO and launch, you just deploy to prod or SSH to prod. We've all worked in places where we've done that. And that's the starting point. So you're right. YOLO is the first stage. The second stage is, you know, Continuous Integration. And so this is this idea where you want to keep integrating, merging your code changes into a single branch and you want to be continually running tests against it. Now, Continuous Delivery is kind of the next stage where, you know, we talk about testing our code and there's, you know, unit tests and integration tests, et cetera. But what you also really need to test is your deployment process itself. Right? So Continuous Delivery is this idea, okay, you want to make sure that at any point in time when I click the button to deploy, I want it to go to production once we kind of get to this place. The next stage beyond that, which, you know, not all companies necessarily reach is Continuous Deployment. Right? So this is the idea that not only are your changes being merged and merged together at the same time and ready to go, but they're also being shipped to production, essentially. So the stages we have is first YOLO, then Continuous Integration, then Continuous Delivery, and Continuous Deployment. That's right. What is the difference between Continuous Delivery and Continuous Deployment? The big difference, I guess, is the question of do your changes go out to production automated? Does it kind of flow through without any intervention, I guess? And then for Continuous Delivery, they go out but not necessarily to production, right? That's right. And so that's why you'll have environments like, you know, Dev environment or testing or staging or whatever. Now, it's possible that, you know, some parts of that process may also still be manual. You know, maybe you only update the test environment once a week so the testers can play around with it again. But the key principle is that you could, you can kind of push it through sort of automatically the whole way through if you want. And what teams would not want to do Continuous Deployment, right? Because it seems to me Continuous Delivery, you kind of want to get to because then you just get more and more feedback, right? But then it is a kind of a good question, like, should it go out immediately? This is the question, you know, everyone always asks. Like, it's almost ready to go out. Why can't we just push it to production? As engineers, you want to push it out as soon as possible. It's ready, right? The reality is it doesn't really suit every company, right? So, you know, it may be the case that, you know, some companies really do still have, you know, review boards where you need to validate, is this good to go out? Particularly if you're in an industry that has a lot of regulation and compliance problems, problems? Compliance requirements. And they need to make sure that when it does go out to production, it's sort of done at the right time, with the right people available, et cetera, et cetera. It's not necessarily true to say that everyone should be going to Continuous Deployment because that's, you know, sometimes just not viable for various reasons. But if you at least got to that point where you're sort of continually seeing your changes go through all the testing, you know, you're promoting it through the different environments, which is, you know, you're therefore testing the process itself. If you can only click that button to go to production once a week or whatever, okay, that's fine. You know, you've done a lot of that hard work. You've mitigated risk, which is what a lot of this process is about, right? Is feel the pain as soon as possible and de-risk anything that could go wrong right up until that last point. So I know you're deep into CI, CD, or Continuous Integration, Continuous Delivery, Continuous Deployment. You've been doing this for like, what, 10 plus years now? But I was pretty surprised to see that when I checked Octopus Deploy, it said Deployment, it says Continuous Deployment, Continuous Delivery, but it also says Kubernetes. How has Kubernetes kind of arrived in the topic of CI, CD, and in general infrastructure? What happened there? Yeah, Kubernetes is the platform of the moment. If we take a bit of a step back, Kubernetes came out of, you know, Google. I guess they originally had Borg, you know, they were using it to host and run their infrastructure. They ended up releasing Kubernetes partly, I'm not going to, you know, pretend I can read their minds and know exactly why, but partly as a way of helping to level the playing field between them and some of the other cloud vendors. Yeah, so like, Under Steve's desk. But even when we talk about, you know, small computers, some of our customers have Kubernetes clusters basically in their point of sale systems. So they have hundreds and hundreds of stores and they have little, Kubernetes clusters essentially running them and each one's independent. And they, you know, run into their own problems with that because particularly at scale, when you've got, you know, thousands and thousands of clusters and you know, these, these customers are following various GitOps practices, et cetera, where they're pulling the actual state from a Git repository. So the Git repository itself becomes the bottleneck or they start getting throttled. And so they have to resort to other mechanics to try to sort of mitigate and work around that. I was talking to another one of our customers actually just the other day at KubeCon there, who they are deploying, they've got Kubernetes clusters running on research vessels and those research vessels As in boats. As in chips. That's right. I'm not going to pretend to know exactly what they're doing on those ships. We didn't quite get into that detail. But they've got Kubernetes clusters out in the open sea, which is apt to give Kubernetes a name. The problems they run into though are a little bit different, right? So for them, you know, those boats might be out at sea for, I don't know, weeks, months at a time, whatever that might be. So when you want to do a deployment, that the ship's not available. So when that ship comes back into port, it needs to get the update, right? So we'd be talking through how you would achieve this and how that process would work. This is super interesting. And I love how you kind of get a peek into so many different types of teams through the fact that, you know, like you're talking with them with how they do the deployments, but you're probably see some other things that they're doing or things they're struggling with. What are some trends you're seeing across the industry in terms of this wide range of companies you work from, startups to like finance companies to like these research vessels? Yeah, I guess one of the big trends these days is a lot of focus on GitOps. So GitOps is this. What is GitOps? What is GitOps? That's a good question, Greg. Let's take a step back for a minute. So, you know, we mentioned, we talked about Kubernetes earlier, and we talked about the fact that it's kind of got this internal continuous reconciliation process where you say to the cluster, please spin up, you know, five pods and it takes that desired state and ensures it always sort of is true in the world. And so there was a lot of products around that that were doing similar thing. You know, Terraform does that for infrastructure, et cetera. And a bunch of people started wondering, why can't we sort of take that process and pull it back further so that not only is Kubernetes just dealing with desired state, but we can pull it sort of directly out of Git. And so, you know, I can, as an engineer, make changes to that, that Git definition, that desired state. And I'd have some process that essentially pushes that to the cluster and ensures that it remains in line with what I'm asking, what I'm expecting. And so the term GitOps was coined by, by WeaveWorks in, I think it was 2017 or so. And as a, as a general practice, it sort of started picking up steam, particularly in tandem with Kubernetes, because at its core, Kubernetes is very declarative, right? Later on, sort of in the early 2020s, it was kind of formalized a bit more and there were sort of four key pillars of GitOps. The first being essentially declare, you want your state to be declarative. So this is the idea that you want to define what you want the state of your infrastructure to look like. This is to basically make things a lot, I guess, simpler to, to understand what the state of the world is going to be when a deployment takes place. So if you think about deployments that are a bit more imperative, that has sort of a process, the end result is sort of the result of multiple steps. But when you're wanting to just update some infrastructure, that desired state kind of works really well at, particularly in the Kubernetes space. And then in GitOps, the desired state will be just describing like how many nodes I want or like how many, I don't know, replicas do I want on a database or how many web servers or like load balancer, how to be connected, that kind of stuff. That's right. Yeah. So it's, it's basically a way of being able to say, I want my infrastructure to have whatever state it is. And then the, the GitOps agents, GitOps products basically ensure that remains the case. So they'll keep applying it to Kubernetes. So you've kind of got this, this situation where Kubernetes keeps its internal status in sync with reality. And now you've got these GitOps tools that take the declarative configuration in sync with what Kubernetes is expecting. So they will take the whatever I put in Git and whatever format I use, and they kind of translate it into something that makes sense for Kubernetes and now Kubernetes can apply it. Yeah. I mean, ideally you want it as close as possible to what sort of, I guess, Kubernetes is expecting because Exactly. That's right. And so what you're describing there, I guess is the continuous reconciliation. And so this is the idea that these, these GitOps apps will essentially, as we said, take that state and apply it. And if there's any drift from Kubernetes side, so for example, someone, you know, runs Kube control, you know, delete pod or a delete deployment or whatever the case might be, because your desired state is now stored in Git in this case, that will kind of self-repair. The second pillar of GitOps is that that desired state that you've sort of defined should be stored somewhere that's immutable and versioned. And so this is the idea that once I say that I want to have this state, I want to have sort of something I can point to a pointer, and that might be a tag or a commit shard or whatever. And I want to basically use that to define what that actual state should be. And I don't want that to be able to change, right? Because otherwise that kind of defeats half the point. By having it versioned and immutable, it also makes things like auditing a lot simpler, right? You can see the transition of that, that desired state over time. What's interesting though, is a lot of people will point to that and go, yes, version and immutable. I know what that is. That's Git. I was about to say that because Git gives you definitely gives you versioning or it gives you commit history. I'm not sure if it gives you versioning and immutable in the sense that, I mean, the past can not be changed. That's, that's right. Actually, can it? Because you can rewrite the history. You're right. So you, depending on how you sort of configure your GitOps agent, you know, you certainly can rewrite history. If you have it pointing at a tag, for example, you can change tags. And so I suppose, you know, best practices around that, I guess kind of, you know, wiggle the finger a bit if you're using tags to manage that sort of state. But what's interesting though, is really nothing in the, in these pillars, and very quickly, the third one being pull versus push. And so this is the idea that your, your GitOps agent will pull the state from GitHub and put, or Git, I should say, and put it into the cluster. And the fourth being continuous reconciliation. But nothing in any of these sort of pillars actually talks about Git. And I think that the naming of GitOps is kind of kind of gets people to already have this expectation that everything has to be in Git. I mean, why would you not have that expectation? That's what I assumed. That's right. I think the problem though is, not everything should be in Git, right? So you've got this constant kind of conversation within that community about, you know, where do you put secrets, for example? So no one would say, That's right. Do not put it in Git. And so that's the thing. So, you know, there's been all these solutions to try to put it in Git. So there's things like sealed secrets where you encrypt it and put it in Git. Sounds like a terrible idea. But I guess what's really, this is highlighting is the reality that some things don't need to be in Git, right? As long as you can have this sort of control over the versioning or immutability of it, then that's, that's completely fine. And then the trend around GitOps is, is what you're seeing that a lot more infra teams are moving from, okay, a few years ago, they might have just like made definitions for Kubernetes and now they're moving over to GitOps. So saying, okay, we'd like to control infra in, in, in a tool in a way that's, that's described, that's in version control. Is that the trend or what is the trend around GitOps? Um, I, I guess it's more just the trend of, of the growth in general of GitOps in, in enterprises, right? So not every company out there is using Kubernetes today. And as they sort of approach Kubernetes and they're looking at, well, how do I, how do I, you know, perform the deployments? How do I manage that process? GitOps becomes the sort of de facto process. And to some extent it is giving rise to this idea of using it to manage other things outside of Kubernetes. And there are a few examples of projects and experiments that will use things like Terraform and there's a continuous And test them and testing is hard, so you can't stop. Unit testable, so it's now a different job. I remember when I was on earlier teams where, typically on a mobile team, you have a mobile team of five people and one of them, one of us had to kind of specialize in Jenkins configurations because Jenkins is oftentimes the or used to be the mobile CI/CD. And it's kind of like half a person dedicated to that. And it was more like, you know, like we had to draw a stick on who's going to do it because we wanted to build stuff. You want to write code, right? You just want to focus on writing code. And so if you're spending a bunch of your time sort of managing infrastructure and pipelines and things, you know, that's no fun for anyone. And so platform teams have come about as a new way of solving that problem where it's different to kind of, you know, this idea of a DevOps team or Ops team that kind of own the whole process. They more sort of define best practices and they provide ideally a self-service mechanism where application teams can essentially use, often what's called as an IDP, an internal development portal, and they'll be able to essentially self-service. And, you know, maybe they want to spin up a new project and they're able to use a template that the platform team have generated. And so the platform team are able to sort of create these standards throughout the, throughout the company and they can be responsible for sort of, I guess, the definitions of those processes and the best practices and how to achieve that. But the ownership of the actual running operational sort of element is still within the teams, right? So they still get those benefits of, you know, DevOps being close to the, close to the real code and feeling the pain if there's a problem and et cetera, et cetera, et cetera. But they don't need to spend all that time becoming experts in, you know, all the different ways that you can deploy the software they've got. And so this has become really common now where particularly as you sort of get to a larger size, platform teams are a great way of solving that problem. Now, that's not to say that every company everywhere should have a platform team. I mean, if you're a smaller company, sometimes it's, you've just got the app team and they sort of are doing, you know, quote unquote DevOps. But this is certainly something that as, as you sort of start seeing larger organizations with multiple teams and multiple projects, these platform teams are a way of basically bringing some, some sanity and control and focus, I guess, to, to the whole space. One trend across the industry, of course, is AI. Everyone's, I, it's hard, it's hard to see any teams where devs are not using AI agents specifically to code, you know, product managers will, will, will be using these things. And of course we have a lot more code produced as a result. When it comes to CI, CD systems, what are you seeing changing there because of AI? And this is the, this is the, the elephant in the room, right? The, how does AI affecting sort of this, but the reality is, I think, to be honest, it's, it's still very early. I think what will happen is the impacts of CI, CD are really tightly coupled to how development teams end up using AI. So there's going to be some sort of like a, I guess, a lagging process there. But we're finding a lot of people, a lot of teams are starting to use AI in their development process. And so we're starting this process of going out and looking and talking to customers and learning, what's the way that they're handling AI in their, in their, in their teams, in their application teams. And then how we can best leverage sort of the CI side to, um, to support that. But then in addition to that, use AI within the, the pipeline itself, again, in the right place. So one of the things we've been, um, I think pretty, pretty, um, keen on at octopus is this idea that, um, you know, at Coocon, we were probably one of the few companies there that didn't have, you know, AI plastered all over it. Like we, we tried to be very, you know, that's, that's what gets the sales, right? That's what gets the sales in. That's what's, you stand out now these days. That's right. By not having AI. Um, I mean, we, we've got AI in octopus, but what we've been trying to do is think about, well, how do we actually use it in a way that's actually useful for, for, for our customers, right. For, for engineers, et cetera. Um, and so we've been slowly adding capabilities within octopus to provide, um, you know, AI support, whether it's a MCP server, uh, whether it's a, a recovery agent that can review logs and tasks and all that sort of thing. But that's within the product itself. Some of the bigger changes will depend on, like I said, how, how actual application teams use, use AI. What I think, you know, we're talking about, we'll find is there's going to be a lot more velocity. I think that's one of the big, big changes, right? There's, there's just gonna be a lot more code coming through. I think one of the questions is, okay, what does that mean for your pipeline? Um, one of the things you often talk about when, you know, human, there's a human element to the pipeline is speeding up the cycle to get that feedback quicker. You know, if you've got engineers sitting there waiting for their code to run tests, they can get back to it and fix it. The, the shorter and shorter you can make that feedback loop, the, the better it becomes because they don't need to context switch, et cetera. I think in a world where the majority of your code is being developed by AI, that becomes perhaps less important. You know, if you can, um, kick out your, your build and test process, um, and it takes 30 minutes versus 20 minutes, does it really matter if the engineers are already long gone, moved on to the next problem, and the, the actual AI agent themselves, itself can kind of babysit the process and review the problem that came up and issue a new fix. I guess that there'll be a de-emphasis, I think, on some of the speed of the pipeline itself and more on increasing sort of, or decreasing risk, right? The risk that comes from having AI engines generate code. And so exactly what that process looks like, I guess, remains to be seen. I think what we'll see a lot more use of is things like progressive delivery. And I think particularly feature toggles, um, are going to be a really common tool in, in the tool belt of application teams, partly because it allows you to ship that code as, as fast as you can or as fast as you want, but manage the rollout of the actual feature set or changes sort of independent of the deployment. So it decouples your deployment from, from your release. And so in a world where, you know, we've got a lot more AI agents generating code and being involved in perhaps part of the build process, those agents themselves being able to use toggles to react to it quickly, I think then become a lot more important than perhaps what we see today. Can we talk about progressive delivery? What it is and what are the most common ways to, you know, like to, to de-risk getting your code or your software out there? The progressive delivery is the next evolution beyond continuous delivery. Um, so, you know, with continuous delivery, it's this idea that, you know, I've made a change to the system and I want to ship it to, um, dev or staging or typically, you know, if it gets to production sort of in one hit, right? With progressive delivery, um, you're, what you're trying to do is basically release those changes in a little bit more of a controlled way, typically through things like a Canary deployment. So this is where you might deploy to some subset of, of your instances that are out there. So what is a Canary? What is a Canary? Um, Canary deployment is, this is New Zealand, basically. New Zealand's our Canary. So this is, as we said before, this idea where, um, you select some subset of your, your customer base or, or whatever that might be, and you would typically route traffic to a new instance. So you'd ship the, you know, you've got version one running and you want to release version two. You essentially ship version two side by side and you might use, you know, most common one would be some sort of network um traffic manager to route some percentage of your traffic to, to that new instance. And you gradually roll that up. Typically, you know, as you do it, to sort of do this process properly, you, you should have a fairly mature, um, observability, um, mechanisms in place to see that, you know, you can roll up or roll down. And I guess this whole thing comes from a canary in a coal mine, right? That's right. Yeah. Yeah. So the idea being that, um, you know, in the old days when you'd be in a coal mine digging away and it would release, um, you know, all sorts of toxic fumes and things like that. Canaries were um a lot more sensitive to it. So they have a little canary in a cage. Um, and if that canary sort of died, I guess got knocked down. I think the canaries, uh, as I understand, they were like chir We all know that in a standard process, you want to go dev, staging, prod, and maybe you've got approval processes and it slows down, et cetera. But if you've got a significant bug that you need to kind of quote unquote roll back, sometimes the safest thing to do is actually make a hotfix to that version and push it out sort of as quick as possible and your bottleneck might be the build pipeline or whatever, but depending on sort of your appetite for risk there, you can resolve that sort of a lot quicker. Now, obviously, if the failure itself is just from some mechanism in the deployment process itself or somewhere further down that chain, then your time to recover is going to be a lot quicker. But it's this idea that, you know, if I've got a failure in version two, my rollback isn't to go to version one, it's to go to version three and make sure I've got that fix in version three. It's the sort of thing that, you know, when we talk to customers and some of them go, yeah, yeah, we roll back, you know, we roll back all the time if there's a problem. And then when you ask them, what do you do if you've got a schema change, they kind of stop and realize that they've never, it's just sheer luck that they've never sort of run into that, right? Is it fair to say that you want to roll forward if it involves business logic or something that is not stateless? Because if it is stateless or if it's application logic, you know, you have a coder that says if this else then, and you realize there's a bug there, you can just revert it as long as it doesn't, you know, touch the schema or the data. Yeah, I mean, in an ideal world, you're reverting is through a feature flag, right, that you click and you're essentially reverting by changing the code path. And this is why I always say feature flags are kind of a nice, a nice tool to use for doing this progressive delivery because, you know, it's just as easy, just as easy as to roll out that feature, you can typically roll it back. Now, you're still gonna have some of those problems with schema issues, et cetera. If, you know, if you're making a change and you've got parts of your code path that expect one and not the other, you're going to need to account for that. But you can even account for that inside the feature flag. That's right. Yeah, so that's, that's the way you sort of ideally sort of manage that. So that within regardless of which path you go down the feature flag, it's kind of self-consistent with whatever version of the actual sort of database schema that's out there. So I guess the more feature flags you use, the fewer surprises you might have, but it's a bit of extra work both to build and also to remove. You get stuck with still feature flags all across your code base once you start using it a lot. I saw this at Uber. Yes, yes, a hundred times yes. So when you're adding a feature toggle to your app itself, so we at Octopus, we obviously use feature toggles in our code quite a lot and we use OpenFeature as like the the framework, the SDK to interact with it. But we essentially have built a wrapper around it where the toggle itself within the code is sort of, we provide some details about which team owns it and that team sets an expiry on it. Now the expiry itself, when that time passes, nothing bad will happen. But through parts of the CI process, if that time has passed, we can send a notification to that team and say, hey, it looks like this toggle is no longer used. So the specific mechanics don't matter as much, but it's more a matter of making sure that, you know, if you're adding feature toggles, it's really easy to forget about it because you start rolling it out and you kind of forget about it and, you know, you want to keep it in there just in case for a while in case you need to roll it back and having the ability to understand how long the toggle has been there is a kind of a key part of helping to maintain that hygiene. Now, the reality is even at Octopus, we've got a bunch in there. I know I've got a bunch in there that I'm sure if I was to log in, I'd probably get a bunch of notifications to remove. You know, when we use that gardening metaphor in code, right? This is this is one of those sort of operations. This is weeding, right? You need to just kind of keep on top of it. There are some mechanisms around, even in, in lieu of the AI side, which will, you know, ideally, if you're using feature toggles, you've probably got a bunch of observability and metrics and logging around it. And there are some, some tools out there that will allow you to keep track of when the last time a toggle was kind of evaluated. And that kind of gives you that, that signal. Similarly, you know, you might remove it from the code because typically when you want to remove a feature toggle, you want to remove from the code first before you touch your actual sort of toggle system. And so having a mechanism so that once you remove it from the code, you know, it might take two weeks before it makes all the way out into production. So you don't want to delete it before then. By that time, you've kind of forgotten about the fact you removed it. Oh yeah. And so having mechanisms that will keep track of that change, I guess going through the system and when it reaches the environment where, you know, production where it's actually being used can kind of go, okay, that code's gone out. That's you know, removed the toggle. It's, it's fine and safe to actually remove the configuration because you've got that feature toggle information in two places, right? You've got it in the code and you've got it in your, your, your platform. Can we talk about how development environments evolve? We talked about CI/CD, but I'm interested more in, you know, you, you go from like, you have one environment. Later, you might have staging or something. And what evolution have you seen across the, all the teams that you work with, all these hundreds or thousands of teams? Yeah, I'm not sure if there is one particular pattern there. I mean, I think, you know, most common is, you know, Dev, test, prod. So these three different environments. Yeah. And I mean, even that I think is probably a gross simplification of all the different kind of mechanisms. And Dev meaning my local machine. Dev in the case of CD is often like the, the first point of integration. So it's kind of test. Often customers will keep tests kind of reasonably in sync with, let's say, production or some sort of sanitized data source. So that way that whether it's the QA testers or the product team or whatever, it can go and review the code. Dev is almost like the first point of integration that is it actually, is the deployment process just at its core actually working or is there anything fundamentally broken at all? I think more and more now we're finding that Dev is less useful in that respect. And what we're seeing is more the growth of things like ephemeral environments. And so this is the idea that, you know, I, as an engineer, I'm writing some sort of feature on a feature branch. And I want to kind of evaluate that it's actually doing what it, what we're expecting it to do. But not only that, I want the rest of my team to be able to see it working. And, you know, if I've got it running on my machine, it's not exactly easy to sort of, you know, give other people access, I guess. And then I want to move, I may want to, you know, completely context change, move on to something, something completely different. So ephemeral environments is this idea that from my branch pre-merge, I want to spin up a whole environment essentially from scratch, ideally with whatever dependencies are required to sort of run this particular component that I've been building. And I want to basically deploy my app into that as if it was a normal full fledged environment. As once that's available, I want to sort of have access to, you know, if it's a web app, maybe it gives me the URL and I can poke around it and hand it around and other people can kind of evaluate. And then the moment I kind of merge that PR, tear it down again. You know, it's quite common to have multiple test environments because, you know, I've got a lot of stuff going through my pipeline and I've got three testers, so let's have three environments. So they can all sort of have one at once or often you'll see a single test environment and a bunch of tests and they all kind of need to collaborate to see who's got access to the system at the moment, et cetera, et cetera. Whereas with ephemeral environments, it doesn't roll off the tongue, but ephemeral environments, you can essentially have a full fledged deployment per, per feature. And so again, that's about speeding up that, that feedback process, right? Again, all of these processes are all about speeding up that feedback process to get the, the, the catch those failures or issues or, or bugs or whatever sooner. There was a time a few years ago where cloud development environments were really talked about a lot, which was the idea as a developer, you have an environment spinn up in the cloud. You're, let's say you're, you're Visual Studio code connects to it, or, or maybe you just log in online and it spins up all the dependencies, oftentimes done with containers, which reminds me of this as well. And there's also like preview environments, but somehow it feels that both that discussion and this has come from, you know, fully embracing feature toggles as part of that process. I think we're getting a little bit braver in terms of, you know, removing capabilities that perhaps older customers may miss, but I don't think that in the long term, self-hosted will kind of go away. This is one of those sort of things again, where I think it's really common to hear, you know, everything's in the cloud, we're all in the cloud. Again, the reality is there's a lot of companies out there where for them, it just doesn't make sense or it's not viable or it's not, you know, it doesn't meet compliance requirements or whatever the case may be. Also, it's kind of a reminder, I think, that you actually might have a lot less competition if you build infrastructure software that also runs on-prem, because it sounds like there's demand where companies are like, we want to give you money in order for us to run on-prem. And I'm sure some of them would do SaaS if there's no other alternative, but for SaaS, it's easier to build anyway, so there'll be more competition. So if you're an entrepreneur or if you're a software engineer thinking to do a business or start a business, it might give you an edge. Yeah, that's right. And this is the thing. I mean, I remember when I worked in, like when I worked in a previous job that used Octopus or any of us who have any other sort of, you know, software that you've got running, potentially you've got running locally, if it just works, why touch it, I guess. And so it's kind of the bane of our existence because it annoys us that we want to ship the features and give them all these great new things. But on the flip side, you know, particularly for something as critical as, you know, their deployment system, a lot of customers, once they've got it running, they kind of step away and go, okay, let's just let it let it be. And it keeps happening with AI as well, in the sense that, uh, for example, I just read that Cursor, their latest coding model, it's, it's, it's updated, like, I think every five hours, which is amazing. It just keeps getting better. However, you know, there are customers who once you have an LLM and it works for you, you kind of tuned it, you have the instructions. Great. But oftentimes what happens, a new version comes out of a model or major version and it stops working. And I, I assume that there will be more teams, companies, businesses who are like, look, it would be worth for me money to kind of pin this thing or to run it on my own infra and just have it stay as is. And then I will decide when I want to change it. As long as it, you know, if, if, if it's, if it ain't broken, don't fix it. That that's right. And, um, I think to Octopus's credit, I think we have a, um, a really good history at sort of helping customers, even when they're kind of on those older, sometimes to the extent of wanting to say to the support team, just they're on old. Like tell them that to, to get the fixed upgrade. Um, but support team are, you know, I think second to none in terms of their, their willingness to help. And as you said, if they're willing to pay us, who am I to, to say no? Yeah. I mean, it's a business strategy, but I think it's just a nice reminder that there is not just one size. And like, even though I think SaaS is eating the world and we're hearing that and we're seeing it, it's nice to see that it's, it's not just that. As closing, uh, if I'm a software engineer and I would like to move beyond continuous delivery, continuous deployment and go into progressive delivery, what pointers can you give me? Yeah, I guess just, just start with something, right? So start with adding one feature toggle. It may be scary at first to kind of go, oh, it's in production. If I, you know, toggle this, I'm going to break something in production. You know, it's nice and comfortable to know that you're kind of well to the left of, of the running systems. And if you ship code, everything will be caught by the tests. But you know, if I toggle it, what will happen? It's kind of like a drug, right? Once you start doing it, you don't want to stop. And that's, that's why we've got this, this hygiene problem for things like feature toggles, right? It's really easy to add them and actually end up with the opposite problem of how do you, how do you kind of control yourself? How do you stop? So I'd say just, just kind of start doing it, add one and, and keep an eye on kind of as you roll it out and you look at the results from it. And the reality is, you know, I've shipped features behind feature toggles where I've shipped a bug, right? And it's one thing to ship something internal and feature and go, okay, cool. Customers have it. It's a very different thing when you do the opposite. If you ship something and there's a problem and you can reach immediately for the toggle and switch it back off, you know, the amount of times you kind of in the past, you have this kind of panic of, oh no, I've shipped something. It's, I don't know what's going wrong. And particularly when you're in that state, you know, maybe you've got called up at 2am because you've got an on-call and you know, you don't know what the next step is to do. And you kind of got a panic mind and should I, you know, build a new thing or do I somehow force a redeployment? So having the capability of being able to sort of flick that switch just allows you then calm right down and go, okay, I've stemmed the bleeding. Now come back and reanalyze it and understand what's wrong. And so having that capability, once you sort of experience that and realize the value that not just rolling things out, but sort of, I guess, rolling that individual feature back off. Yeah, you'll, you'll want to use it for everything. What's one or two books you would recommend and why? I'll give two kind of, I guess, technical ones and more of a fun. Phoenix Project is still for me a good one. This is one that. By Gene Kim. Yeah. Yeah. And, um, I, I can see, uh, you know, okay, you kind of remember that we got that in, in, in Skype. This was one that Abdella kind of gave to everyone. Yeah, our manager gave it to everyone. Yeah. And, um, you know, it's, it's, you know, parts of it are maybe a little bit outdated and, you know, some of the practices have changed a little bit, but at its core, this idea of as an engineer being involved in that whole sort of operations side of your, what you're shipping and, and the value that gives to not just the company, but to you is amazing. So I think that book has kind of a core, you know, it's one of those core foundational ones that sets the, sets the, the, the context for everything we talked about today. The other one, um, from a more, um, I guess, organizational and communication side of things. Um, Radical Candor by Kim Scott. Allows you to communicate more efficiently and with more compassion with your peers and other people around you. It's, you know, really common. You know, I'm an engineer, so I, I know sometimes it's really, you kind of look back on what you said and you feel like, okay, maybe I'm, I'm, I can be a little bit blunt. Whereas Radical Candor teaches us to think about, you know, you want to have those communications that are both sharing that you're caring and empathetic, but also direct and you know, the benefits of that and kind of the, the inverse of that where, you know, you're perhaps, like I said, you're very blunt. You, you're sort of being honest about it, but you're missing that, that empathy. Um, so I've found that book really useful and interesting as, I guess, not even just as an engineer, but as a, as a person working with other people. From the, the more fun side, basically anything by Greg Egan. He's an Australian sci-fi author. He writes, um, pretty crazy and mind-bending hard, um, hard sci-fi. So if you're really into that, I'd say read, um, like Diaspora or, um, Charles Letter. They're the sort of books that actually took a second read to get through and he's, he's a mathematician as well. So, you know, he's got a whole bunch of background and mathematics on why a certain, certain part of his story goes the way it is. He, he wrote an entire story on the premise of, um, what if the speed of light wasn't absolute or something like that. This one premise and it kind of breaks out into, and then this is what happens to, to energy and therefore molecules work like this and da, da, da, da. And, um, as a, you know, I'm a, I'm a tech nerd, um, that sort of science stuff, you know, really appeals. Same, same. When, when sci-fi, there's some science involved, it's actually, I find it way more fun. Rob, thanks very much. Thank you, Gregor. This was great. What an interesting conversation. I hope you enjoyed having someone like Rob who has been building and thinking about CI, CD at scale