How Braintrust uses AI agents, evals, and CI to ship better software

Overview

This episode is a technical discussion about where coding agents actually help senior engineers: not on toy tasks, but on ugly, expensive problems like query performance, infrastructure changes, schema migrations, and AI product evaluation. Claire Vo and Ankur Goyal argue that the right way to work with agents is to define the outcome and tests clearly, then let the model run many experiments that a human would never have time to do by hand.

A second thread runs through the whole conversation: evals are not a side topic for AI teams. Ankur’s view is that evals are the modern version of a PRD - a way to define what success looks like in examples and measurable checks, rather than trying to spell out every implementation detail up front.

Key Takeaways

Ankur’s main point is that strong models change engineering work from specifying the exact method to specifying the target and the checks. In his case, that means taking slow real-world query patterns, reproducing them, and using coding agents to test combinations of indexing strategies, column-store formats, and execution engines. The value comes from scale and persistence. An agent can run benchmarks for days, compare many options, and keep digging where a human would run out of time or patience.

Both speakers push back on the idea that AI falls apart on hard engineering work. Claire says this is the first setup she has used that can handle long-tail technical problems when the goal is precise and the evaluation loop is strong. Ankur goes further: he says no staff engineer is manually running as many benchmarks, trying as many algorithms, or checking as many ideas as someone using agents well.

They also draw a useful line around autonomy. Claire describes an "agent line": if the information discussed in a meeting could be handed to an agent and it would reach the same decision or produce the same result, that work should probably move below the line. In her view, that line keeps rising.

On evals, Ankur gives the clearest framing in the episode. Evals are a way to encode "what good looks like." They are not just for model research. They belong in product development, coding workflows, and internal tools. He compares them to PRDs with examples attached and scored. Once those success criteria exist, models have room to search for better solutions.

There is also a cautionary point: speed creates clutter. Ankur says product building now looks more like carving than constructing. Since it is easy to add features and code, teams need more discipline about removing confusing functionality and investing in CI so faster output does not turn into more broken output.

Practical Steps

Pick one painful technical problem with a measurable outcome, such as query latency, migration accuracy, or cost per request.
Build an eval or benchmark first. Define success with tests, examples, and thresholds before asking an agent to generate solutions.
Use production or production-like data when safe. Both speakers say this gives better signals than clean synthetic cases.
Run several agent sessions in parallel, but stay within your own review capacity. Claire says her practical limit is about four.
Move heavy experiments to cloud or remote development environments if local machines become the bottleneck.
Invest in CI before chasing more AI throughput. Ankur’s view is that strong CI is what earns a team the right to move faster.
When an agent keeps failing, stop restarting the conversation endlessly. Close the session, improve the eval, and start again.
Remove confusing product behavior instead of layering on more controls. Use complaints to simplify the product.

Notable Quotes

"If you create the right tests and success criteria for a model, then it can be really creative." - Ankur Goyal
"Evals are actually the modern version of a PRD." - Ankur Goyal
"Product building and code writing now looks like carving rather than constructing." - Ankur Goyal

Evals are a methodology for you to say, this is what success looks like, and then you let a model figure out the how while you stay focused on the what. — From the episode

Full Transcript

Source: openai 40m runtime

I'm still in, as I say, the year of our Claude 2026. I still talk to engineers that say, AI on our most complicated things cannot do a good job. I so viscerally disagree with that. There's no staff engineer who is running as many rigorous benchmarks and trying out different algorithms and analyzing ideas manually than someone who's using an agent. Everyone should take a hard look in the mirror and reevaluate how they spend their time. There's a lot of interactions that you have or direction that you're giving or decisions that you're making. And I think like many of these things, to me, fit below the agent line. And to me, the agent line is like, if I or whoever would be at the meeting or whatever, if we equivalently took the information that we're discussing and we just gave it to an agent, would it solve the same problem? And I think the agent line keeps going up. Why do you think this concept is so important to understand and how can you demystify it for folks who are a little intimidated by it? Now that models are so good at actually writing code, one of the best things that we can do is create really hard evals. And if you create the right tests and success criteria for a model, then it can be really creative and it can work on this stuff in the background and actually try to improve a bunch of things. I have a lot of people saying, wow, if I go as so far as to turn my own taste or my own skills or my own expertise into a system, I'm functionally just building my own replacement. We're able to have David's palette applied to more things. I think the quality bar that we're able to hit is higher because we're able to get more things to that bar. Welcome back to How I AI. I'm Claire Vo, product leader and AI obsessive here on a mission to help you build better with these new tools. Today I have Ankur Goyal, the CEO of Brain Trust. And this is a technical one. So if you're a senior or staff engineer or a VP of engineering or CTO, this is one you're really going to want to pay attention to. And we're going to talk about how coding agents can help you bite off really technical architecture and infrastructure work in a way that no other human engineer could before. We're also going to demystify evals for folks and just show you exactly how you can use them to make your AI products better without having to touch a thing. Let's get to it. This episode is brought to you by Guru, the AI layer of truth for your company's knowledge. Here's the problem. Your AI is only as good as the information you feed it. Most companies are getting confident but wrong answers from AI because their underlying knowledge is outdated, incomplete, or just plain incorrect. Bad information doesn't just slow you down. It costs you money. It puts you at risk. Guru solves this by adding a verification layer between your company's knowledge and AI tools. Instead of just hoping your AI gets it right, Guru automatically scores content for accuracy, flags outdated information, and ensures your team gets trustworthy answers every time. It works with the tools you already use, so you don't have to change how you work. Thousands of companies trust Guru to keep their AI accurate and compliant. Ready to stop playing Russian roulette with your company's knowledge? Visit getguru.com to learn more. Welcome to How I AI. I'm excited to have you here. I'm super excited to be here. Thanks for having me. So I'm going to make you laugh, but I recently did an episode about the recent GPT 5.5 model release. And I know you and I use Codex. And one of the funniest comments in that post was, Claire, can you do an entire episode about tech debt? And we were talking before we got on the recording, you're like, how technical and how nerdy is this audience? And I'm like, bring it on. So we are going to talk a little bit about how you approach engineering and then how you use AI to do things like optimize slow queries. So let's let's hop in. Tell me, tell me about your approach to to software engineering in the age of AI. You know, I spend a lot of time working on software for doing evals and observability, and that's kind of shaped my own perspective about software engineering. Like now, now that models are so good at actually writing code, one of the best things that we can do is create really hard evals. And not I'm not talking about like AI evals. I mean, things like, why is this query so slow? And if you create the right tests and success criteria for a model, then it can be really creative and it can work on this stuff in the background and actually try to improve a bunch of things. So one of the things that I spend a lot of time on right now is making the queries that people run in our product faster. And people can just write arbitrary queries. Like, you know, they can, there's an example, someone is trying to find like needle in a haystack of some specific kind of interaction someone had in their product. And they're looking at like billions and billions of traces and they want to find like the 5,000 or something that match. And this is over like a 90 day period or something, like a lot, a lot of data. And that's one example of a query and like, okay, there are all these things that you can do in database literature, like different indexes you can build and different ways you can prefetch data and blah, blah, blah, all this stuff. But how do you try all those things? And how do you run all the experiments required to actually do something like this? So what we do and what I've personally spent a lot of time working on is trying to figure out, you know, manually is fine, but automatically is even better. Like, what are the patterns of queries that people are running that are slow? And then we will reproduce those things and use a coding agent to try out a bunch of ideas from database literature. So like download a bunch of data locally and then maybe try different, in this case, right now, I'm trying out different column store formats. So we use an index underneath the scenes called Tantivy, which has a built-in column store, but it's not that great. Like the thing overall is great, but their column store is not like that great. And so what we're doing right now is like exhaustively trying every open source column store format out there and then exhaustively trying every column store execution engine out there and sort of computing the matrix of this. And it's, you know, it's like, it's amazing. I completely agree. As somebody who has led engineering organizations for a really long time, when you're trying to make infrastructure platform core component changes in your application, because of both the cost of implementing those being very high and then the unknown unknowns being quite risky, teams are actually pretty risk averse in terms of making big platform shift shifts or changes to their core implement. It's like the thing that you shipped is the thing that you get stuck with, certainly on the engineering side. And what I love about AI right now and these coding agents in particular and then Codex in particular, particular, is it has been the only setup, Codex plus these GPT models has been the only setup where I have been able to set up a very similar process, which is the outcome I want is XYZ. We need to programmatically test against pretty long tail data structures to figure out which of these potential solutions are going to get us closer to the outcome we want. In your instance, it's database query speed and latency. In my instance, I was doing a very, you can appreciate this, very complex data migration of stored, structured and unstructured data generated by AI. So it was all like messed up to begin with. And then I had to migrate it to a schema. And so it was like schema to schema migration, millions and millions and millions and millions of rows and lots of edge cases. And doing that as a human takes forever. You know, you can script it and you can like bang some systems against it, but then your human ability to manage those cycles and say, yes, that's right. Or no, that's wrong. Or this gives us indication that we should go left or that gives us indication we should go right. And so I do feel like this combination of like a very precise outcome and an agent that's smart enough to bang its head against a really, really long tail of problems with a guided sense of the technical space, it does really well. And I have not heard this on the kind of like data store side. It's really interesting, but I just think, hey, engineering leaders out there, I've had, I've been in so many debates about what we're using for our data store, how we optimize performance, what technologies we should bring into the stack versus not. And you can run those like very, very iterative loops on, I'm presuming you're using production-like data or real representative queries to test that. Is that right? You can actually use production data too, but for some subset of things and with the right engineering in place, you can just run on production data. Yeah. And in many ways, it's a lot safer than having humans test on the production data because no one, no one's looking at it. Yeah. And this is where I have so many staff engineers be really, really cynical about does AI have a place in their, their, their coding tools? I'm still in, as I say, the year of our Claude 2026. We still, I still talk to engineers that say AI on our most complicated things cannot do a good job. Oh, I, I so viscerally disagree with that. Same. Tell me why you disagree. Well, I, I mean, I think, so I've been working on databases for almost two decades. There's not many things that staff, whatever, It is, I don't judge them for it. Like, I don't know what else they would do. But it's kind of crazy. And then I also have remote ones. So here's one where I'm working on trying to improve our column store performance. And this is running on not real data, but close to real data. And it's running remotely. And it's, you know, it's running like much more scale and many... And I mean, if I ran this on my computer, it would probably die from just how much compute it's using. But I'm able to, in this case, test, like, what's the real latency between EC2 and S3 if I'm trying to do like 4000 concurrent reads? Is it enough? Is it not enough for this workload? Can I interleave things whenever properly? And I've been running this experiment for several days just trying to figure out, like, what's the best? You know, right now, I'm talking to it about what the indexing lifecycle should be because I think we figured out how to make the queries fast enough. Some people are going to be listening to this being like, oh my gosh, this is so technical. I don't have these problems. Let me take a step back for folks and tell you what I think I'm seeing here. Which is, one, you're using codex, right? Yeah. Codex for hard problems, people. I'm telling you. Just that... I think it's currently the only model that will disagree with you regularly. And I think if you're working on hard problems, it's very important. And then for you, what I'm also hearing is you're using foreground agents. You basically have a personal concurrency limit of, like, let's call it four, which is about what about what I can do as well. So I think people ask me all the time, how do you handle all this context? I'm like, I don't do more than I think I can do at any one time. And I also... I have more trivial problems than you, so I think you're right in that the current sort of commercial background agents, I would call them, that you can buy off the shelf, work very well for web, like standard web apps. I'm very happy with them. If you are not using one of them as an engineering organization, maybe it's like doing classic SaaS, highly, highly, highly recommend. But I am hearing more and more from teams to things that you called out. I am hearing more and more people are just building their own background agents. So it's happening. It's happening in teams, very, very big and very, very small. I think the primitives are there to start experimenting with it. And so I don't think it's going to be as surprising to us to hear about people building their own internal coding background agents, even if, like, core infrastructure is something from the big models, model providers. I think the second thing that I'm hearing a lot, and we heard this from the Stripe team, is investment in cloud development environments and remote computing. Again, because if you were to run some of this stuff, especially the data heavy stuff on your computer, it starts to sound like an airplane taking off. It's no good. And then the last thing I heard you say, which is like ports. I joke with everybody. I say, work trees everywhere. Ports 3000 through 3009 accounted for, like, I am just like every everything. And I have to call out Chris Tate at Vercel released a thing called Portless, which just makes managing multiple ports localhost ports on your local machine a little nicer. So for simple things, I would go look that up. We'll link it in the in the GitHub show notes. But, you know, common problems that I think people have running concurrent engineering processes on their own machine. And then the like meta thing, which is just like make time to code. You need it. Yeah. Everyone. I also don't take meetings after one. Sometimes I'll do podcasts in the early afternoon for folks, but all afternoon, I'm just like in my real state, which is hoodie on. Bad posture. I think that I'm sure you feel this too, but like there when I was handwriting most of my code, I would enter this sort of like euphoric flow state where I, you know, I just completely focused on a problem. And then when I started doing a lot of agent coding, I lost that for a little bit. But now when I'm writing code, you know, late eight just released a new album yesterday. You should listen to it. Put on your hoodie and your headphones. I'm like way. I'm like totally back in that state now just doing a different workflow. Yeah. And I'll give folks the sort of, you know, AI mom of the Internet that I try to be, which is I do feel like a lot of people are, they kind of go into two camps. They are having more fun than they've ever had before. And they're back in the flow state of like what got them into software engineering or building or technology or whatever. Or they're approaching like cloud anxiety, burnout breakdown because they feel this like productivity anxiety. And they're not, I think what I see is that people feel like if they're in a meeting and they're not kicking off agents, they're doing something wrong. Or if they're talking to somebody and they're not kicking off agents, they're doing something. And I just say, like, I like the idea of chunking your time with AI a little bit more. I think it just narrows you on the more productive pieces of it. And it's also just a more enjoyable way to get stuff done. Yeah, I had a phase, which I think I'm over. You know my wife, Alana, where I, we would have, we have dinner together usually like pretty much every night. And so I had a phase where my laptop was not at the table, but open and on the couch. And I think I've progressed beyond that phase now. So now the laptop is closed. And I think it's an important, it's an important thing. I agree. When I was first using OpenClaw, I installed it on an old MacBook and it would like stay open on our kitchen island, which is where all our plugs are. And it would like hover over us at dinner and hover over us at, at breakfast. And if it got moved, I was like, where is Polly? Is she alive? Is she open? Is she closed? So yes, close your laptop, people. Close your laptop. All right, so, you know, we covered the first half of this episode, which I think is very interesting for technical folks. How to have kind of like long running or just really diligent agents run against technical problems to give you real benchmarks about performance on changing things. I love that. Second thing is just your core workflow on how you do coding, both how you dedicate time and then technically just what your workflow looks like. Let's talk about evals, because I feel like this is something that's very intimidating to a lot of people. And obviously you built a product that supports this. But taking a step back, why do you think this concept is so important to understand? And how can you demystify it for folks who are a little intimidated by it? Machine learning specifically shifts the task of programming from being about the how to being about the what. And this is true, like forget about LLMs. Like, you know, it's true with, let's say, like you're back in like middle school, you're doing like, remember statistical regression? You're not defining the, you're computing what the slope and the y-intercept should be. You're not defining it, but you give it all the points, which are the, you know, the what, not the how, which is the slope and the y-intercept. And I think that, you know, the cool innovation around like transformers and the next token prediction task, which lets you, you know, ablate tokens and do all this cool stuff. It's all about saying like, okay, here's like the compute substrate and here's the what, which is the outcome. It's predicting the next token. Can you go and use a lot of GPUs and figure out how to achieve that? And I think that if you take that as inspiration for anything you do with AI, then you're able to be more productive. And I think that applies to traditional programming, like what we just talked about. I'm not dictating exactly the implementation or even the set of algorithms that we're using to solve problems. I'm just trying to define very succinctly what the problem is and why it is a problem and how to assess the solutions to the problem. It also applies to building AI software. And that's what evals are all about. Evals are a methodology for you to say, this is what success looks like. In my opinion, evals are actually the modern version of a PRD. So a PRD, you would say, hey, in prose, this is what success looks like. Evals are also often written in prose, but you supplement that with examples. So, you know, the best PRDs, they have good examples. Like they, maybe someone's made a demo or written out like a user story or something. It's the same thing. It's just the difference with evals is you encode those user stories in a way that can be quantified to some extent. And then you let a model or whatever figure out the how and you are really focused on the what. Give me an example of how you use this in product development, just to make it a little bit more tangible for folks. Yeah, let's start with something that I think is quite straightforward. And then we can venture into the less straightforward stuff as we go. So this is our UI and like I'm working on a very simple task here, which is I'm trying to create a prompt that will be part of an agent that is good at answering questions about BrainTrust documentation. So or my own expertise into a system, whether that system is like the David eval, the David in a loop judge, or something else. I'm functionally just building my own replacement. And I am presuming, because I do and it sounds like you do too, you value David more in this system. Oh, yeah, yeah, yeah. We're able to have David’s palette applied to more things. I think the quality bar that we're able to hit is higher because we are able to get more things to that bar. I love it. Okay, so this has been a powerhouse episode, one of my favorites. We've talked a lot about, you know, solving really technical problems with AI. We've demystified evals a little bit for folks and shown how, in a safe space, you can actually let AI, I think that's one of the meta themes of this, is in a safe space, you can let AI run with a lot of autonomy and you'll, you know, throw a lot of data at it and you can get higher quality outcomes much more so than if you were to manually fix things or even manually evaluate things. I'm gonna do a quick lightning round, and then we'll get you back to, I mean, it's almost noon, so back to coding. It's time to code. Time to code. One, I have a question. When you say there is no excuse, there's no excuse for bugs, there's no excuse for little design knits, there's no excuse for that. How do you feel like you practically, I maybe have two questions that you can answer, there'll be our two lightning round. How do you practically manage the velocity to customers, which is, do you ever get customers being like, wait, what's this? Wait, what's that? Like, too much features, just consumed as a customer? And then two, how do you technically manage the throughput into the system? Product building and code writing now looks like carving rather than constructing. So it's very fast to create something that has too many features and too many buttons and too much code, and you need to spend a lot of time removing stuff. And so we actually, I would say, 90% of the time someone complains about something, we remove the thing that was causing confusion and just make the system work better. Because we understand now that the person complained their point of view and we're able to build a product that doesn't even need the complexity that led them to the confusion in the first place. I'll give you an example. If you load a trace and you imagine hitting Command-F, you might in your brain think that that's just searching what's on the page. But what's on the page might be hundreds of megabytes of text and it's virtualized and then there's, it's across spans and there's also a table. So we had a very powerful search implementation that would search across the spans and rank everything and, you know, blah, blah, blah, all this cool stuff. And then a lot of people complained and they were just like, why is this, you know, I just hit Command-F, I just want it to show the thing. And we've just, we've really simplified it over time. So I think, I think we try to carve. And then in terms of technically managing it, we spend a lot more time working on CI than we used to. And so I think that a lot of platform effort has shifted so that if we are really good at CI, then we are able to move faster. And if we feel like we're constrained, then instead of shipping a bunch of crappy stuff, we're like, okay, let's pause and improve CI so that we earn the ability to move faster. Okay, again, for the VP of engineering in the back, invest in CI. I've told everybody, they're like, how do I accelerate my engineering velocity with AI? I was like, fix your CI. Yeah, yeah. I mean, I think every engineer is now building a platform and upon the platform, agents are doing the work that the engineers were doing manually, right? And I think that applies to evals. Like if you're an engineering team and you're building an AI product, the number one job for you is to build a feedback loop, meaning you have a pipeline that allows you to summon from the ether of real world data and turn that into evals. And as an engineering team, that is your number one job. It is not prompt engineering. It's not picking an agent framework. It's not rewriting your database, whatever. It's creating that pipeline. And the same is true. CI is that same idea, but applied to software engineering. Well, and I'll give one other tip, which is you think that those evals, people are always like, oh yeah, for my AI product, I need that. I have seen, again, I think the Intercom team has run a bunch of evals on their internal use of Claude code to figure out where engineers are hitting pain points, where people are giving up, where the agents are asking for permissions that have to be escalated. And I think that sort of analysis on your team is very, very important and ultimately gets you to these better outcomes. Okay, last question. You seem like a very reasoned person, so I'm presuming I'm going to get a very reasonable answer, but I ask everybody. When AI, when one of your four tabs is not doing what you want, when the evals are failing the David test, what is in your back pocket prompting strategy that you rely on? Do you yell? Do you bribe? Close the session. And then I improve the evals and then I try from scratch again. This is a man who is on message. Yeah, I mean, I'll give you like an example. We have this open source use case, I'm sorry, a use case where we run open source models and we're running like millions of tokens per second. It's very, very high scale. So every cent matters and every bit of optimization matters. We are trying to change right now from model A to model B. And I, again, I am someone who builds software to write evals. I vibe coded an eval script and it went, it just was getting stuck. And then I read the code and it's like 3000 lines of complete trash. And it had like all these scoring functions and all this crap and it was getting confused. And so I, on Saturday, I hand wrote like no, no co-pilot, no autocomplete. I just, partly to improve my own understanding of the problem. I hand wrote the eval and then by the end of Sunday, the problem was solved. So you shut the session and you do it yourself. Yeah, just for the evals, just for the eval. Great. This has been so great. Where can we find you and how can we be helpful? If you are interested in evals or you're trying to solve AI observability problems inside your company, please check out Braintrust. We're at braintrust.dev, at Braintrust on X, or I'm at A-N-K-R-G-Y-L. I'm very happy to chat. We're also hiring if you like working on these problems and you like maybe pushing the boundaries of rigor and stuff and you found this kind of stuff interesting, we'd love to work with you. Well, thank you so much for joining. This was great. It was a lot of fun. Thanks so much for watching. If you enjoyed the show, please like and subscribe here on YouTube, or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review, which will help others find the show. You can see all our episodes and learn more about the show at howIAIpod.com. See you next time.