Designing Data-intensive Applications with Martin Kleppmann

Overview

This episode is a wide-ranging conversation with Martin Kleppmann, author of Designing Data-Intensive Applications, about how his experience building startups, then working on Kafka-related infrastructure at LinkedIn, shaped the first edition of the book and why a second edition was necessary nearly a decade later. The discussion also explores what has changed in backend systems—especially the rise of cloud-managed services—and what fundamentals still matter, including distributed systems trade-offs, formal methods, and decentralized or “local-first” software.

Key Takeaways

A central theme is that although cloud services abstract away much of the operational burden, engineers still benefit from understanding system internals. Kleppmann argues that knowledge of fundamentals—such as storage engines, replication, partitioning, and consistency—gives developers intuition for debugging performance issues and making informed trade-offs about cost, resilience, and latency. The goal is not that every engineer builds databases from scratch, but that they can reason effectively about the systems they depend on.

The second edition of the book reflects a major industry shift: many systems are no longer built around machines with local disks and self-managed infrastructure. Cloud primitives like object storage and serverless platforms have changed architecture patterns, especially by making it easier to scale down as well as up. Kleppmann highlights that “scalability” should not just mean handling huge growth; it also means running tiny workloads efficiently and cheaply.

Another notable insight is that some problems have become less common not because they disappeared, but because hardware improved or abstractions got better. For example, sharding across machines is still important at very large scale, but many workloads now fit comfortably on powerful single machines. Replication, however, remains essential because it is about fault tolerance, not just scale.

The conversation also turns toward the future. Kleppmann sees formal verification becoming more practical and more necessary in an AI-assisted development world. If AI generates more code than humans can realistically review, then stronger automated correctness guarantees—beyond testing alone—may become increasingly valuable, particularly for security-sensitive or correctness-critical systems.

Finally, he makes a strong case for research into decentralized and local-first software. These systems are harder to build because they avoid relying on a single central authority, but solving those harder problems can create meaningful alternatives to centralized platforms and expand long-term options for users and organizations.

Practical Steps

Learn internals selectively. If you use managed databases, queues, or analytics systems, spend time understanding the basics of how they store, index, and replicate data. Focus on concepts like B-trees vs. LSM trees, quorum writes, partitioning, and columnar vs. row storage.
Use architecture trade-offs explicitly in planning. When choosing cloud services or multi-region setups, document the trade-offs among availability, consistency, operational complexity, and cost rather than treating them as purely technical defaults.
Re-evaluate what “scalable” means in your systems. Check whether your architecture can scale down efficiently, not just survive peak load. Serverless or highly elastic infrastructure may be a better fit for low-traffic services.
If you work on critical systems, explore formal methods gradually. Start with model checking tools like TLA+ rather than jumping straight into full proof assistants. Use them for subtle workflows, distributed protocols, or security-sensitive logic.
Consider broader risks, not just technical ones. Include geopolitical, regulatory, and societal risks in architecture decisions—such as cloud dependence, data protection, and harmful unintended consequences.

Notable Quotes

“If you know just a little bit about how the storage engine works internally, you’ll be in a much better place to use it in a way that gives you good performance.” — Martin Kleppmann

“Generally, you just want the cost and the computing capacity to be roughly proportional to the load that you have.” — Martin Kleppmann

“If you want to change the world, then thinking about the impacts that your technologies have on the world is part of your job.” — Martin Kleppmann

Full Transcript

Source: openai 1h 25m runtime

Designing data-intensive applications has been the go-to book for anyone building large backend systems. 9 years after publishing this book, the second edition is here. Martin Klotman is the author of this generational book. I sat down with him and today we cover how working on Kafka at LinkedIn directly shaped the ideas that became the first edition of the book, what's new in the second edition and why things like MapReduce got removed from this updated version, formal methods, local first software, decentralized access and many more. If you care about how large systems work, where they're heading and what the fundamentals are that don't change, this episode is for you. This episode was presented by StatSec, the Unified Platform for Flags, Analytics Experiments and more. This episode is brought to you by Sonar. Sonar, the makers of SonarQube, understands that code quality is about more than just avoiding syntax errors. It's about long-term maintainability by protecting the structural integrity of the system. As agents generate code at massive scale, they often ignore your system's structural integrity. This creates tangles, duplicated code and other maintainability issues. These issues turn a modular design into a big ball of mud, making it increasingly difficult to extend. But here's something that's really helpful, SonarQube's architecture management. It moves architectural governance out of static wikis and into your automated workflow. It allows you to visualize your current architecture, define architectural boundaries and manage architectural issues in real time. Whether it's a human or an AI agent at the keyboard, Sonar acts as a circuit breaker for structural decay. It ensures every commit respects the system's blueprint, protecting the long-term health of your most complex applications. Head to sonarsource.com slash pragmatic to find out more. So Martin, welcome to the podcast. Hi, it's great to be here. It's amazing to have you here. I don't think you need introduction to many software interns, including myself. You're the author of this iconic book that I've had on my bookshelf for probably about 10 years, not much longer after it came out. Before we get into this book, which we're going to talk about, how did you get into the technology field? Yes, well, I did undergraduate computer science like many others. And then after that, I wasn't quite sure what to do with my life, but I thought, well, starting a startup seems like an interesting thing to try. So I started a startup having no clue what I was going to actually do and then spent the first while searching around for things that might be interesting. The first startup didn't work out that well, but through that, I met some others who then became my co-founders for the second startup, which worked better. And we sold that one to LinkedIn. And then after that, I started being interested in like teaching these distributed systems concepts. So that's when I got into writing the book. And then during the writing of the book, I also switched over from industry back to academia. Can we talk a little bit about your first and second startup? Yeah, go test it. So this was like 2008 or something like that. It was the age where people were having really difficulties getting their JavaScript working cross-browser. Internet Explorer was still pretty big at the time. Chrome had just come out. All the browsers were incompatible with each other. And so go test it was a cross-browser automated testing service for websites. It was based on Selenium, an open source project that still exists. And the idea is you would write like test scripts that automate the user clicking through the various interactions with a website and then just check that the right behavior happens. And so, yeah, it was based on Selenium, but just as it provided as a hosted service. So people wouldn't have to run various VMs with various operating systems themselves. It worked technically, but I found it really hard to actually get adoption for it. A lot of people building websites, like in theory said, oh, yeah, this is great. We need to test cross-browser and in practice, actually, it was really difficult to get them to integrate it into their workflow and just get in the habit of using it and investing in writing the test scripts. So that ended up not really going anywhere. So it's like there wasn't like a business to be done or like revenue to be generated in a meaningful sense. Yeah, well, there's at least one other, maybe two other companies from that same era that did manage to make a business. Source Labs is one that managed to actually succeed. But even for them, it was a pretty slow running business. I think it was not an easy business to be in. And for the startup, were you in the UK building it? I was in the UK at the time. Was it bootstrapped? Did you raise some kind of funding? How big was the team? How can we imagine this? It was mostly bootstrapped. So I did a bunch of consulting in order to fund hiring some people and then hired some like friends on the cheap to help contribute to actually building the product. So it was done all very cheaply. I had a very small amount of angel money in there, but mostly bootstrapped. And then when you decided to not go forward with this, how did the next startup come? Reportive, right? Yeah, the second one was reportive. That went a lot better. So that was putting social media inside Gmail, basically. So the idea was that if you get an email from someone you don't know, we had a little browser extension which manipulated the Gmail web interface so that on the side next to the email, we'd show you a summary social profile with like a profile picture and like a job title pulled from LinkedIn and recent tweets pulled from Twitter and maybe recent Facebook posts or things like that. Just whatever we could find about that person and put that as a social summary next to the email. We started in 2010 or something like that. It was then pretty quickly became quite popular. And so on the back of that, we were then able to raise some money from Y Combinator, which was still fairly young at the time. That was very young. You must have been one of the very early batches. Yeah, I can't remember exactly when they started, but it was certainly in the early years. I think Y Combinator had already built up quite a good reputation at the time, but it was still fairly small. And then as part of Y Combinator, did you have to fly from the UK to San Francisco to attend that 10-week program, if I remember? Exactly. Yes. So we initially came for the three months or whatever it was of the Y Combinator, but then we were able to get US work visas for ourselves and set up permanently in San Francisco. How was that shift from the UK, where you spent going to university, your first startup, the first part of this, to coming to San Francisco? It was very exciting because, you know, it felt like, you know, going to the center of where it was all happening, really. And we, at the start of that, not knowing anybody at all, we knew like one or two people in the entire Bay Area, but we like contacted them and they introduced us to more people and they introduced us to more people. And so we were able to pretty quickly actually build up a network. And that's something that I really appreciated, that it was actually so open to outsiders like us who could just basically turn up with an idea and an early stage startup. And we managed to raise some money and managed to actually become somewhat established in the Bay Area. And can you tell me how the company grew and at what point did the LinkedIn acquisition offer come in? And how can we imagine you were a founder of this company? It was about in 2012 that we sold it and we were five people at the time. So it's all still pretty small, not vast amounts of money involved, but it was a success, I would say, for everybody involved. The acquisition process itself was fine. As always with these kinds of transactions, there was like twists and turns and moments where we thought it would all fall apart. And then we were almost running out of money and hadn't really succeeded in raising another round. So we kind of had to sell or shut down, so we were under quite a bit of pressure. We couldn't reduce our own salaries because to do so would have violated the conditions of our visas. So we were in a slightly stuck situation. Given our lack of leverage in that situation, actually, I'm pretty happy how it all turned out. Yeah, it's nice that, you know, like from 10 plus years, we can talk about this, honestly, because oftentimes you see an acquisition by LinkedIn and of course, you might ask the founders and they would say like, this was either our dream or our goal, or we will do so many things together. But some things that you don't often hear is, well, that there was a pressure involved as well. So did you go into this wanting to sell the company because you saw that things were getting a little tricky? Either you need to raise a new round or you sell to someone and then you found LinkedIn to be the best stuff or the only or the best option to go into? We tried a little bit to see like what revenue generating options we had and hadn't really managed to make that work. So we were just burning money and our user growth was OK, but not really enough to go and raise a big round. So we were like a little bit stuck there and selling the company seemed like the least bad option there in a way. And I'm pretty happy how it turned out because, you know, LinkedIn was great, actually. They were very good to us. They allowed us to operate as essentially like an independent team within the company. So your team stayed together? Our team stayed together. We continued working on the product that we wanted to make. You got to keep working on Reportive? Yes. Well, actually, so Reportive, the Gmail browser extension, sort of got put on live support, but we were working on a new product at the time, which did eventually get released under the name LinkedIn Intro. It kind of got a slightly weird reception at the time and it ended up getting shut down shortly after we released it. There's kind of longer background story there, but I'm still really happy with LinkedIn, like how they gave us the freedom to do this and allowed us to launch this product. And even though it didn't succeed, they were very good to us throughout that process. And then after that got shut down, then our team got disbanded. But we had a good run within LinkedIn building this product. What tech stack did you work at the time, what did you use? Reportive was fairly unexciting. It was a Rails app with a Postgres database, basically, and some Redis and some similar things like that mixed in. So actually, you know, nothing particularly revolutionary. We essentially built a graph database on top of Postgres, so there was a little bit of technical interest in there, but, you know, nothing particularly outrageous. And then you spent time after LinkedIn Intro, you still work inside LinkedIn. As I understand, you worked on data infrastructure, right? Yes, data infrastructure. After our team got disbanded, I switched over to the stream processing team. So Kafka had just been developed at LinkedIn and had just been open-sourced at the time. Oh, it was just being open-sourced. Yeah, I think it had just been open-sourced. And then I got to work on SAMSA, which was a stream processing framework on top of Kafka. I always wanted to ask this question. So this comes here. Why did LinkedIn? Then in the dark where we were having performance problems with our database and we had no idea what to do basically because we were totally lacking the foundations to actually understand what was going on and how to diagnose the issues. And so I felt that, well, if I had had a bit more background on how these data systems actually work internally, then I could have had an intuition about how to debug these kinds of performance issues. And then after a while, after I'd learned more about how data systems work, I thought, well, okay, it's time to write this down so that others don't have to learn it the hard way, but can hopefully just get a better idea of how these systems work and thus be better at managing their own data systems. To start with, how did you learn about, for example, how databases work? Because again, from your story at Reportive, you built systems, you've had some performance issues at a smaller scale, to be fair, compared to LinkedIn. Then you worked at LinkedIn and you saw a little bit of how the sausage was made. But I know a lot of software engineers who have been in this path and they still don't really know how the fundamental systems work. They just know, okay, we have a platform team inside our company and they build it. I could read the RFCs, but it's a lot of work or the planning docs. I could look at the source code. It feels to me that even at that point, you just went down and tried to dig in. What resources did you use? How did you find out those basics which you later put into the book? A lot of it was just kind of being curious and talking to people actually and just asking them lots of questions. At LinkedIn, there were like a bunch of senior data systems engineers who understood this stuff very well, but hadn't maybe necessarily written it down. And so I just talked to a bunch of them and quizzed them and that way started building an image in my own mind of how this stuff works. And then once I sort of got the basics from these conversations, then I was able to go and read research papers, for example. They go into much more detail of exactly how and why things are designed in such a way. But you know, it is time consuming to read those things. So then what I tried to do was like pull out what are really the essential ideas. I just read a ton of blog posts as well. And so the reason why you see so many references at the end of each chapter in the book is, well, that is actually the material that I myself used in order to understand what was going on. And then I thought, well, okay, well, if I found these things useful, then I'll also cite them in the book as a way for anyone, any reader who wants to go beyond the basics covered in the book. Here are some good sources to further reading. The structure of the book, this first book at least, it's foundation of data systems, distributed data, and derived data, if I understand, these are three big parts. Did you already have a structure in mind when you started writing the book or did it shape as you went? This three-part structure is not that critical in the design of the book, really. That's sort of more after the fact. I thought, oh, well, it seems like we can group the chapters into roughly this sort of structure. But the topics of the chapters were more or less what I had envisaged. So I knew that I wanted to talk about, like, what a transaction actually is. I knew that I wanted to talk about replication. I knew that I wanted to talk about sharding or partitioning. I knew that I wanted to talk about, like, consistency and consensus. So there's sort of high level topics, I think, were clear from, like, my initial book proposal to the publisher. The details within each chapter, that is something that I often figured out once I got to that chapter. So I wrote one chapter at a time and started each chapter work with just a lot of background research to actually get up to speed on the topic myself. And it's often only then that, say, for then replication, I decided, okay, well, it seems like the three major ways of doing this are single leader, multi-leader or leaderless. Okay, I would decide on that structure essentially when I started writing each chapter and then try to fit the various points I wanted to make into this narrative structure. As a fellow author who also wrote a book, one thing I've noticed, there's a lot of parallels between estimating a book and estimating a software project and that you come in with an estimate and if you've never done it before, you tend to be wildly off. How was this in your journey? And in addition, you also had a publisher and publishers are a little bit like project managers. They, you know, they like to have a schedule. They like to try to keep you on track. They like to ask, when is it done? How did you manage that part as well? And in the end, how long did you estimate it would take when you started and how long did it actually take? As always, it takes vastly longer than expected. It's the same for software projects as it is for writing, I think. So I think it took me about four years to write the first edition. And that was not four years of full-time, maybe like two and a half years of full-time equivalent or something like that, but written over the course of about four years. So it definitely took a long time. The publisher deadline I missed by a ludicrous margin. I think I missed it by about two and a half years or something like that. But fortunately, O'Reilly were pretty laid back with the first edition and were happy for me to just take my time and make it good. When it came to the second edition, then actually O'Reilly got a bit more aggressive and pushy about uh sticking to deadlines. I guess by that point, the book had been established and people were waiting eagerly for the second edition. So I kind of understand the desire to want to accelerate it. But at the same time, I really appreciated the freedom that I had for the first edition to work on my own schedule. And I had a bit less of that with the second. The tagline for the first edition, which I believe is the same as the second edition, the big ideas behind reliable, scalable, and maintainable systems. Reliable, scalable, and maintainable. What do these objectives mean to you? Yes. So they're all slightly vaguely defined, right? So there's not a formal definition of those things. But for me, reliability means fault tolerance primarily. So meaning that a system should on the whole continue working even if, like, a network link is interrupted or a node crashes or something like that. So a lot of the book is about techniques that support fault tolerance, like replication, for example. So that's reliability. Scalability is one of those terms that get thrown around a lot. And it's sort of so much. And it's like fashionable and cool to make things scalable, you know, because it's the jet success and millions of users. And so that's, of course, everyone wants things to be scalable because everyone wants success. For this book, he had tried to take a bit more dispassionate kind of approach and said scalability is just like what mechanisms we have for dealing with changes in load. If load increases, how can we add computing capacity to a system, for example, so that the system still continues working? And then the techniques that you use to achieve scalability, well, they are like sharding, for example. And but in this case, scalability, your definition, do I understand that you're mostly referring to horizontal scalability, so that you cannot compute up or down pretty much? Yeah, I guess because that's the more interesting one. Like, yes, you can always buy a bigger machine. And what's interesting about that. And exactly. There's just, there's not that much to be said about it. I mean, there are details of how you scale even on a single machine. But I think like part of what is become interesting about, like, modern cloud services and just back-end services in general is like how they've introduced this idea of horizontal scalability and shared nothing systems. So we can build systems that, you know, are able to cope with very high load, even if the individual components are just fairly cheap commodity machines. But maybe sort of part of the scalability story, which I wasn't thinking about as much at the time, but started thinking about more recently is not just scaling up, but scaling down as well. So actually, how do you run a service in such a way that if it has a very small amount of load, it's really cheap to run it. That's sort of it. In a way, the same question as how do you continue running a service if it has very high load? Generally, like, you just want the cost and the computing capacity to be roughly proportional to the load that you have. And at the low end, that means actually being able to scale down to something that is extremely cheap to run. And that's like not so necessarily given. That's something that is hard with on-premises software, for example, because like, if you've got a machine, a physical machine, that's like a unit of deployment. And yes, you could carve it up into two dozen virtual machines and make those small virtual machines. But it still requires like some sort of resource allocation. So part of what's interesting about some serverless systems, for example, is actually their ability to scale down and say, like, okay, if you're going to handle just three requests per day, that's just fine as well. Can you tell me about the second edition? When did the idea come about? Yeah, it'd been clear for a couple of years that a second edition was needed just because the Let's get back to Martin and the trade-offs that come with using cloud services. And so those people will have to then specialise even more in actually the details of how you engineer those cloud services, how you make them reliable, how you operate them, and so on. The skills are still there, it's just a bit of specialisation happening that some people can worry about the higher level things without having to concern themselves with the lower level things, some people focus on the lower level things and treat the higher level aspects as their customers. Interesting, so it sounds to me that if you're an engineer who is utilising a lot of these services, you might not need to know how they exactly work. Yes, and I would say like the underlying philosophy of the entire book is to give people insights into just the sort of essence of how the systems work internally, so that if, for example, they start having weird performance behaviour, you can have a bit of intuition for why it's doing that and how you might solve it. So for example, say the storage engine chapter tells you about how B-trees work and how log structures, LSM trees, storage engines work. And the book is not intended for people who are going to actually build their own databases and implement their own storage engines. If you want to do that, you have to go much more, much greater depth than this book covers, but the idea is that as an app developer, if you know just a little bit about how the storage engine works internally, you'll be in a much better place to use it in a way that gives you good performance, for example, and to diagnose any issues. That philosophy we've kept also in the context of cloud services where, yes, like cloud service hides some of the operational details that app developers don't need to think about anymore, but they should still know a bit about how they work internally just so that they can use them effectively. I guess I argue about the trade-offs deciding on which service to use, which characteristics to look out for, for your use case, right? Exactly. And you know, there are huge differences of say if you're doing analytics, whether you're using row-oriented storage or column-oriented storage. That's a bit of a technical distinction and takes a little bit of background reading to even understand what that means, but it has a massive performance implication in terms of the final behaviour of the system. And so those are those places where I feel like knowing a bit about the internals is actually like a superpower. Yeah, and I guess engineers, the one thing that we always need to argue about or should need to argue about is at the very least, cost versus performance and by performance, I mean latency to the user and of course resilience of if something happens, you know, like a region goes, like a zone goes down, a machine goes down, zone goes down, region goes down, how our product is affected and what's acceptable. The basic idea there seems to be like how much availability risk are you willing to take on versus the like the overheads in terms of the system itself, like the computational overheads, but also the human overheads actually designing and operating the system and the cost overhead. Yeah, exactly. And so yes, you can have a system that is more able to tolerate various types of faults, but which is more expensive to design and operate versus a simpler system that, you know, might go down a bit more often, but which is cheaper and there's no right and wrong with that. It's everyone needs to figure out where they sit on that trade-off space themselves. And I would say that like multi-region is like pushing in the direction of like higher availability because it means you could tolerate the outage of an entire region, but then it has implications on the consistency model that you can get across different regions, for example. So that's a trade-off that The book tries to make very explicit to help people reason that through of like what is the right choice for them. In terms of multi-cloud, for example, one thing that I've been concerned about just in the last month, really, is European dependence on US cloud services. Yes. So what if geopolitics was to go horribly wrong and tensions escalate and Europe finds itself suddenly locked out of US cloud services? I hope that doesn't happen. I still think it's fairly unlikely, but it's no longer unthinkable. And as a result, I coming sort of from this European perspective have been thinking a fair bit about how can we engineer systems to be resilient against that sort of thing? And that's, you know, not just like a regional outage, but it's like a business risk, essentially. And multi-cloud setup could help mitigate against that sort of risk. So that at least, for example, if one company locks you out, then you could still have systems on another company. Again, that that's very much towards the expensive, but higher availability risk reduction end of the spectrum. But for the people who have, you know, really critical workloads where they think this sort of geopolitical risk is a significant enough risk, I think it's seriously worth considering that kind of setup. I'm thinking that as engineers, we do have the responsibility because who else will do this? Yes, totally. But I totally agree with you as well that this understanding what the risks are and communicating what the trade-offs are, I think is going to be a core part of our role as engineers moving forward as well. Maybe as AI writes more and more of our code, it's less about like the details of how you express logic in a particular programming language and much more about those kinds of high level trade-offs. How has the definition of scale changed in this book? Because as we talk with cloud, before cloud building a scalable system, it sounded pretty involved because building a horizontally scalable system, it's complicated. All the pieces you need to put in it. In the first book, you detail a lot of this with cloud. A lot of the services actually, they do define how they allow horizontal scaling, what the trade-offs are. Do you feel that it's made a lot easier to reason about scale, scalability when you are using these primitives? So I think achieving really high scale is still challenging because even though we have cloud services like object storage, for example, which provide you this very elastic storage model, at least you don't have to worry about capacity planning on your disks anymore and running out of disk space because those kinds of operational things, they're taken care of. But if you need sharding, for example, that's something that actually does reflect on the application code as well. You can't really make that entirely transparent. And so you're at a sufficiently large scale. The sharding is required because a single machine is not powerful enough to process your workload. Then I think even with cloud systems, you still have to do quite a bit of engineering thinking of how to realize that. where I think the cloud has helped quite a bit is actually at the lower end of scaling down. If you want to have a very lightweight service that processes only a small number of requests, what we've got with serverless systems being able to very quickly spin up and spin down an instance very lightweight. That's quite a good innovation that has enabled those very low scale services. And that's something that would be much harder to do without cloud services because you would have to statically allocate a certain amount of memory and certain CPU resources to a particular virtual machine. I love serverless. I have a small website that runs on serverless and on my bill is like 13 cents per month because it has very little load. Absolutely. It's just making more efficient use of computational resources. Let's talk about sharding. And in the first book, and when you wrote the first book, when I was working at Uber, we talked a lot about sharding and there was a lot of internal implementations. Our interviews involved asking about sharding because we were designing systems that were sharding. I did sense that over time, again, as cloud systems start to become available that give you turnkey solutions that act more like platforms. You send the data and it takes care of these things. Fewer engineers have to actually implement sharding. With cloud native systems in your research, what have you seen? What are the cases where putting sharding in place is still important and where are the places where it might have just disappeared as a concern? I mean, it's still nice to know, but you might not have to implement it. I think it's probably less of an effect of cloud and more of just hardware getting more powerful. That actually, like a big machine nowadays, can do a lot on a big machine. And that means that more and more workloads you can just run on a single machine and that is sufficient actually to achieve quite significant scale already. There are still concerns of like how do you actually efficiently make use of hundreds of CPU cores that you have on a single machine. So there's still parallelism is still a required thing to think about there and sharding is one way of achieving parallelism. But at least this sort of sharding across multiple machines has maybe become less of a pressing issue just because more and more workloads can just run on a single machine. Some people still have very large scale workloads that do have to be sharded across multiple machines. So it's not going away entirely. And replication is still relevant even at smaller scales because that's for fault tolerance. That's not for scalability. You have a chapter called the troubles with distributed systems, which goes through a lot of things that can go wrong without going through the whole chapter. Can you recall some of the things that are memorable to you or some of the things that you feel are important to remember? Yeah, the whole idea of this chapter is that in distributed systems theory, there are certain things that we tend to assume. Like, for example, we just assume that there's no upper bound on how long it might take for a message to go over the network. So you send a message, it might arrive within a hundred microseconds or it might take 10 years. It might be that the products are very much geared towards essentially data harvesting, collecting behavioral data, because that's what can be monetized in the form of advertising. And there seemed to be just very little reflection on what was good and bad about these sort of things. So I really just wanted to encourage a bit of thinking there. Not really wanting to prescribe too much, like a particular approach there, but at least to point out, you know, there is this thing such as data protection legislation now, which we do have to think about in the architecture of our data systems. And there is an ethical responsibility. You know, people say that you get into tech in order to change the world. If you want to change the world, then thinking about the impacts that your technologies have on the world is part of your job. It's a really essential part, really. And something that engineers are often prone to ignoring as we focus just on the technology and less on the effects that that technology will have out in the real world. And so this chapter is really just an attempt to get people thinking about it a bit. And it's sort of a reflection of my own process as well, because as I started working on these systems, I didn't really think about ethical things particularly either. So I felt like I had to put that section in there for myself, as well as for the readers, because it was my own way of grappling with these questions a bit. Is it fair to say that as engineers building these systems that will have an impact on on a wide range of things, potentially societal wide impact, we are just in such a good position to directly influence and maybe even change course. So do I understand that this section is a bit of a reminder that by building it, we have a huge opportunity to shape these? We probably have a lot stronger voices, maybe as strong voices as later on the regulator might have years down the road, right? Exactly. I think engineers have a very strong voice there. And like we talked about earlier, engineers need to articulate trade-offs in such a way that business leaders can then make educated decisions about how to address those trade-offs. And part of those trade-offs is pointing out risks and risks include not just technical risks, like the data might get corrupted, but they include societal risks as well. For example, like what's negative effects, what harms might arise from this technology, what sort of unintended consequences possibly, or what like a risk for reputational damage if it turns out that a technology has some harmful effects, you know, that can reflect badly on the company that made it. And that has to be part of the trade-off discussion. And I just want people to make intentional and deliberate decisions about those kinds of things and not just sweep it under the carpet. One of the hot topics these days is, of course, AI. And you've written a very interesting post about this, just in December, about formal verification, how your conviction that formal verification might be more important with AI. Can we talk for those of us engineers who have heard formal verification, can we talk about what this is and how you envision this becoming more important? Yeah, so there's a whole range of formal methods. One approach is to, for example, use a specification language like ISB or TLA plus or something like that to describe the expected behavior of a system at a high level and then use a model checker, which is essentially like a randomized test case generator to just play through a lot of scenarios and see whether the system has those desired behaviors in all the different scenarios. That's like the sort of intro level formal verification, I would say. The more advanced level is to use actual formal proof. And in that case, you can write a specification of some system in a formal language is usually using mathematical notation, and then make a mathematical proof that a certain algorithm or a certain implementation always satisfies that specification. And the distinction to testing there is that, well, in testing, you just try through a couple of examples, give the algorithms some example inputs and check whether you get the expected output in those particular examples. But a proof can reason about potentially infinite state spaces. So it can tell you things about like every possible thing that could possibly happen in the entire universe, show that, for example, a certain safety property is always given in those. Formal verification is a lot of work. I never used it in my time in industry because it's just too time consuming, basically. I only got into formal verification when I was in academia and I could afford to take the time to spend a few months proving an algorithm correct. But there I've started finding this very useful, especially if I was working on very subtle algorithms where it's very hard to tell just from reading the implementation whether this actually is always correct under all possible cases. But if it's an important algorithm where, for example, it will corrupt data if there's a mistake in it, or it will have a security vulnerability if there's a mistake in it, then when it's high stakes, things like that, then I feel it's worthwhile to have formal verification and to really make sure that the code really is correct. And so I've done some formal proofs using the Isabel proof assistant, for example. There are a couple of others as well, like Rock and Lean and so on. These proofs are really hard to write. It takes a long time to learn the language of writing those proofs. And then even once you know the language, it's just really laborious in order to actually write the individual proof steps. And when you say it's hard to write, just as someone, I know how to code on so many different languages. Can you just explain what it means to hard to write? Is it, does it feel like a strict programming language with all sorts of rules or lots of math formulas? What makes it hard for you to learn it and get good at it? Yeah, so you're trying to make a proof that a certain piece of code always satisfies a certain property. In some cases, that property might be quite easy to specify. Let's say as a really simple example, you have two lists and you want to concatenate them and then you want to prove that the length of the concatenated list equals the sum of the two individual lists. You know, very, very simple property. How would you prove something like this? Well, you would have a function that concatenates two lists and then you would probably do a proof by induction over one of the lists that shows that, okay, well, if you have one list of length i and another list of length zero, well, then the sum of the two is i. If you have a list of length i appended with a list of length one, well, then it's i plus one and so on. And then by using a proof by induction, you can then show that the length of the concatenated list is i plus j, where i and j are the lengths of the two input lists for every possible value of i and j. And this is something that, you know, in a test case, you would, in tests, you would maybe test it for the cases of j equals zero, j equals one, and j equals five, and then you're done. And j equals integer max. Yes. Like the edge cases, that's what we do. That's how I write my unit tests. Exactly. And so this is a trivial example, like list concatenation. You can easily just read the code and convince yourself that it's correct. But if it's a much more complex algorithm, then your brains just can't like grok the algorithm well enough to really convince ourselves that it's correct if you don't prove it. And that's where these proofs then become handy. If I'm an engineer and I would be interested in getting started with formal verification, for example, because I have the notion that it will be more important with AI, of course, it will be easier to write these things. Where would you point engineers to get started or how did you get started in this field? I would suggest starting with model checking. So something like TLA plus or FisBR are much friendlier to getting started with compared to proof assistants like Isabel, Rock and Lean. These proof assistants just require a whole lot of additional knowledge and the resources for learning about writing these formal proofs on, to be honest, not particularly good. I haven't really found really great books on it as well. The way I learned it was by working with some colleagues in my lab who had learned it through years of prior experience. And I just sat down with them and paired with them at a desk where like I described the thing I was trying to prove and they showed me how to prove it step by step, how to break it down. I'm interested to see if you're thinking will be correct, which is this thing will go more mainstream. And hopefully we'll have better books and resources for it as well. Yes, I do hope so. So the reason I think that, I believe that this formal verification could become more important in the future is there are kind of several aspects to it. One is that the LLMs are getting increasingly good at writing these proofs. And if we don't have to write the proofs by hand as humans, it just becomes feasible to do them in situations where previously it would have not been economical. But also LLMs increase the need for these formal proofs because, you know, we're vibe coding a bunch of stuff. If we have to manually review all of that code, then that will become the bottleneck. So we can't really have humans reviewing all of the generated code either if we really want to get the benefits of AI. So we need some automated way of checking whether the code is correct. And writing lots of tests is a very good starting point. But the thing that proof can do that tests can't, is to consider absolutely every possible thing that could happen. And that's really important in a security context, for example, where it just takes one little bug to create a vulnerability that destroys the security of the whole system. And so I feel In a decentralized setting where we don't have just a single server that can make that decision, in a centralized setting, you know, you just have one server. It decides, did the edit to the document come first or did the replication come first? And that one server makes that decision. But if you have multiple servers, they might make different decisions. So then you could have a consensus protocol, but then consensus is messy because it requires, like, some quorum votes and requires nodes to be online. And so we've been trying to do the whole thing without doing consensus, but while preserving high availability, while preserving the ability for users to work offline, preserving the ability to synchronize peer-to-peer without any servers, for example. That just makes the engineering challenge a lot harder. And it's solvable, and we're close to solving it for AutoMerge, which is the CRDT library that I work on. But it's just much less straightforward than it is in the centralized case. But that's a nice example of where interesting engineering challenges arise from this desire to get away from centralized services. And then we were just talking about clocks earlier, but an obvious thing that came to mind is, well, if all of them had the same clock exactly to the microsecond, you could just use a clock. You could use this timestamp. But as you said, in distributed systems, we cannot always trust the clocks are always synchronized. So I assume you just have these, a lot of the things that you have been researching and writing about are just coming back to— Absolutely. And in this particular setting of, like, a user getting their edit permissions revoked, if a revoked user still wants to, say, vandalize a document, they can just backdate their edits, give it an earlier timestamp. So relying on clocks is absolutely useless here because people can forge the timestamps from those clocks and thereby then potentially undermine the access control mechanism. So in this kind of system, we have to worry about potentially maliciously generated actions as well when the actions come from end-user devices. This is fascinating because it feels to me that you're solving a hard or maybe even harder engineering challenge than some startups would do because the startups would go the easy route. They would take on a constraint, in this case a centralized server, which makes business sense, makes revenue sense. But because you are not doing this, you now need to look for a solution for a harder problem. And if you solve this harder problem, you can give a building block that can move the industry forward. Just give an option for either a business or an individual or an institution to, you know, like have an option not just to use centralized, but use this decentralized, local-first approach. And then, of course, reason about the trade-off and decide whichever makes sense. Exactly. And that's what I mean with this long-term thinking. This is an example of it where, because it's research, we can afford to take this idealistic, principled stance and say, yes, we're going to solve this harder engineering problem because we think decentralization is a valuable feature. And we know perfectly well that most startups are not going to solve this problem because they will just do the easy, pragmatic thing, which is the right thing for startups to do. But we have a different set of incentives and we can afford to put in the time to try and solve those hard problems. And as you said, if we can solve them, then it creates more optionality for anyone and users of this technology. They can, if they want to, choose to use this decentralized tech. And there are still trade-offs around it, but at least if they're not having to invent it from scratch, it'll be a lot easier to adopt this kind of decentralized tech for those who want to use it. So inside in academia, you're also teaching. What courses do you teach? At the moment, I have a concurrent and distributed systems course for the undergraduates and a cryptographic protocol engineering course for the master's students. And then additionally, this year, I have a seminar course on security and I'm teaching also the undergraduate operating systems course. So I've got quite a lot of teaching this year. The distributed systems course, it's available on YouTube. Can you summarize what people who would go through this course, which again, is freely available. Thank you for you and the university for making it available. What would they learn throughout those courses? Yeah, so that distributed systems course, it's a bit more theoretical than what is in the book. So it's more focused on algorithms and sort of how we convince ourselves that the algorithms behave correctly under the assumptions of distributed systems that we talked about, of like nodes may crash, communication might be unreliable, clocks might be wrong, et cetera. So that's really it. It's not a very long course. It's just eight lectures worth of material, but it's it goes into substantially more detail on the algorithms than the book. So, for example, one of the lectures goes through the entire raft consensus algorithm, which is pretty complex. But I really wanted to show the students exactly how it works, because it's just such a nice illustration of the challenges of distributed systems and the various measures we need to take in order to handle the various types of edge cases and failures that can happen. And showing that those problems can be overcome. It's not easy and the algorithms are very subtle and it's very easy to have bugs in them, but it is possible to solve consensus in a way that works pretty well. And so that's really the sort of message I'm trying to get across with this course. And you mentioned that when you're when you're writing the book together with Chris, you brought a lot in the industry inside and being up to date. And you brought your experience of teaching and what works. I don't think I have a particularly like unique teaching style. I just in lectures, I will go through slides. I like to annotate the slides by hand during the lectures. So just draw on an iPad to make it a little bit more interactive. But other than that, it is fairly theoretical. That's partly the way the Cambridge system works. It kind of favors theoretical and pen and paper courses over, say, implementation, practical courses. I think it would be possible to certainly to do a practical course on this. And I may incorporate a bit more practical exercise in the future. But right now it's mostly a theoretical pen and paper course. And that is fine. The cryptography course that I do is that's much more hands on. So that's about actually getting the students to like implement some elliptic curves from scratch, for example. And how have you seen it in your time in academia, which has been it's now a longer time period. How have you seen computer science education changing? How do you think it might change further in the future, especially as we're seeing AI be part of industry and probably the world as well? Yeah, I mean, prior to AI explosion happening, actually, rate of change is very slow in in computer science teaching. Partly that might be Cambridge, you know, Cambridge is over 800 years old. Like everyone thinks on longer timescales. People don't tend to rush into the latest fads and instead try to focus on the fundamentals and the ideas that a lot of the fundamentals of computer science were developed in the 1930s already and are still true today. And, you know, Lambda calculus and those types of things, for example. And so we have quite a bit of a focus on those sort of fundamentals rather than chasing the latest fashionable thing. That said, AI has totally changed the way we can assess coursework, for example, because, of course, now we we can try banning AI, but it's impossible to actually enforce such a ban. And also, it's kind of counterproductive because we do want students to engage with new technologies and figure out how to use them productively for themselves. But we want to somehow do that in a way that supports their own learning and doesn't undermine it. So how do we get the students to use AI in a responsible way, in a way that's mature? And we can't necessarily rely on the students being mature enough to know for themselves what is a helpful use of AI and what is a form of use of AI that undermines their own learning? Because some of them are quite mature and able to decide that for themselves, but many are not. And so we need to provide some guardrails for them. And we do need to make sure that when we have assessed work, for example, it's fair and it's perceived as fair by the students. And if the students feel that some of their co-students are getting really good marks without doing any work, that undermines the trust in the entire system. And so we have to be very careful with how we approach this. And to be honest, we don't really have good answers yet. So we do now, for example, have a boot camp right at the start of the first year for the new students to expose them to basic software engineering skills, which is like this is version control. This is unit testing. This is generative AI. And the sort of basics that really everyone should be familiar with. And then the hope is that they will use that throughout their degree in order to just improve the work that they do. But how exactly we handle things for assessment, for example, we're still in the process of figuring out. So it sounds like the pace of change is going to be fast in the industry and also in academia will probably adopt it and we'll see what comes after. Yes, there's a difference, though, which is in the desired outcomes. I think with industry generally, the desired outcome is like a working product, for example. In academia, the actual artifacts that the students produce, like an essay that a student's write, that's not really the point. We don't ask the students to write essays because we love reading their amazing essays. We asked them to write essays because we want them to go through a thought process which helps them learn something. And it's that Yeah, my feeling is that they're not really that mutually exclusive, or rather, some of the best PhD students I've worked with, for example, actually have a few years of industry experience. So they might have done an undergraduate, maybe done a master's, then spent a few years in industry developing like actual doing real software engineering, learning about the real world, and then maybe at some point got bored and thought, Oh, actually, you know, I want to work on maybe more idealistic things or have more freedom to choose their own research topics and then start getting interested in doing a PhD. And that I find is quite a healthy route. You do get people who go, you know, straight from their undergraduate degree and master's into doing a PhD. But sometimes those people can just lack a bit of the breadth of perspective. And so I think having seen a bit of just real world engineering is actually really helpful for people, even if they then want to stay in research. But in the opposite direction, I think it can work very well too, because in research and academia, we just get to think things through a lot more carefully than people often do in industry. Often people in industry, I feel like sort of have short circuit reasoning, like don't maybe don't quite reason something through from first principles, but just like, oh, I heard this from a conference talk. I'm just going to go with that. And what academia can teach is the sort of nuanced and critical thinking to really reason through trade offs, for example, and to really like justify why something is true. And so I think it's really good, actually, if people can weave in and out of industry and academia a bit and not regard it as like two totally mutually exclusive career paths, but actually have a bit of switching between the two. Well, Martin, thank you very much. I expected us to talk a lot more about your book, which we did, but I have a newfound curiosity and respect for all the important and interesting academic work that you and everyone else is doing. So thank you so much for this. Thank you for the great interview. This was really interesting. I hope you enjoyed this rare conversation with Martin Kleppmann. I found it interesting to learn that the first edition of the book assumed that you have machines with local disks, but actually today this is not how most engineers build systems anymore. Cloud native primitives like S3 change how you build systems. And this is why this book just needed a refresh. I also appreciated Martin's take on whether engineers still need to understand system internals when they're using managed services. If you're building business logic on top of these services, you probably don't need to know every detail, but it can become useful to be able to look deeper, especially when you need to debug your system. By the end of our conversation, I gained a lot of appreciation for the academic research that Martin is doing. The local-first software work, the access control problem in decentralized systems, using cryptography to verify supply chain emissions. A lot of these are hard engineering problems that few startups would take on. It was nice to understand how academia is in a good position to do work that has a long-term focus. Do check out the show notes below for related the Pragmatic Engineer deep dives. If you've enjoyed this podcast, please do subscribe on your favorite podcast platform and on YouTube. A special thank you if you also leave a rating on the show. Thanks, and see you in the next one.