← Return to Index Archived December 18, 2025
The Lead — Dec 18
JUST NOW POSSIBLE · TERESA TORRES

Automating the Full Customer Support Iceberg: How Gradient Labs Built a Multi-Agent Platform

1h 01m / December 18, 2025 /businesstechnologyproduct / Transcript sourced from openai
All episodes from Just Now Possible →·Podcast website →·Listen on Apple Podcasts →

Overview

This episode features Theresa Torres in conversation with Jack (product engineer) and Ibrahim (AI engineer) from Gradient Labs, an AI-native startup building agents to automate customer support in fintech. They unpack what “agents” mean in practice: a coordinated system that handles inbound support, back-office actions, and outbound follow-ups across long-running customer workflows.

Rather than treating agents as standalone chatbots, Gradient Labs describes an agent platform with shared architecture—procedures, skills, orchestration, and guardrails—that can be specialized per channel (including a newly shipped voice agent) and per customer.

Key Takeaways

  • Customer support is more than “answering tickets.” Gradient frames inbound support as “the tip of the iceberg,” with the larger opportunity in back-office tasks (e.g., dispute operations, fraud investigations) that consume significant human time and require action across internal systems.
  • Three-agent model enables end-to-end automation. Their vision ties together: inbound (handles the initial user request), back office (executes internal processes/tools), and outbound (proactively collects missing information and updates users).
  • “Procedures” translate human SOPs into agent behavior. A central design choice is letting non-technical subject matter experts write natural-language procedures (like internal docs) that the agent follows, including embedded tool calls. This reduces translation loss versus engineers coding business logic from scratch.
  • Outbound is uniquely hard because “done” is ambiguous. Unlike inbound (where users signal completion), outbound requires explicit success criteria and handling cases where the procedure ends but the goal wasn’t achieved (e.g., user didn’t provide required answers).
  • Architecture is deliberately constrained to reduce LLM chaos. A deterministic orchestrator (“state machine”) manages conversation state and triggers “turns.” On each turn, only a scoped set of skills is available, limiting unsafe or irrelevant actions.
  • Guardrails are treated as evaluated classifiers. Guardrails run as LLM-based classification checks on both user input and agent drafts (e.g., vulnerability, complaints, financial promises), tuned with precision/recall tradeoffs based on risk and monitored for drift via flag-rate spikes.

Practical Steps

  • Document your support work as procedures first. Start with existing SOPs and convert them into step-by-step natural language instructions that specify: intent, required data, decision branches, and when to escalate.
  • Define “done” explicitly for outbound workflows. For each outbound procedure, write measurable completion criteria (e.g., “3 answers captured,” “KYC status updated”) and include fallback paths when users don’t comply or respond.
  • Separate orchestration from reasoning. Use a deterministic workflow/state machine to manage turns (customer message, tool result, silence) and invoke reasoning modules only when needed.
  • Scope tool/skill access by context. Limit what the agent can do on a given turn (e.g., greeting-only on outbound start) to improve safety, predictability, and debuggability.
  • Implement human-in-the-loop as a “tool.” Where APIs don’t exist or approvals are required, route actions to humans with a structured summary and approve/reject flow (e.g., Slack/web queue) to still automate most of the work.
  • Build eval pipelines for safety-critical checks. Treat guardrails as classifiers: collect labeled examples, measure recall/precision, prioritize high-recall for high-risk categories, and use post-conversation auto-eval to surface cases for manual review.

Notable Quotes

  • Ibrahim: “The inbound part is the tip of the iceberg and we’re trying to now start addressing the big chunky stuff that sits underneath that.”
  • Ibrahim: “Procedures…just look like a notion document…telling the agent, what are the steps that you need to follow to resolve a particular type of problem.”
  • Ibrahim: “With Outbound…deciding when you’re done is a bit trickier because you can’t rely on the customer telling you when they’re done.”

Full Transcript

Source: openai 1h 01m runtime

Welcome to Just Now Possible with Theresa Torres. I'm Jack. I'm a product engineer with Gradient Labs. I have been working here for coming up to a year and a half, and I work on building out the web app that our customers use to manage their agent. My name is Ibrahim. I'm an AI engineer at Gradient Labs. I've been here for around about a year now, and my role as AI engineer is to basically work on building out our agent and its logic and reasoning capabilities to be able to handle the customer support conversations that we have. Excellent. Two engineers. Do you not have a product manager that you work with? We don't currently. So we have this kind of blend, which I think is becoming more common in tech of a product engineer. And so I think that role encompasses the engineering side, but also the product thinking. But to be honest, most of the team thinks about product. That's the prerequisite for working in a small team is that everybody is very hands-on with product. We didn't want that bottleneck of a product manager to be the person that makes these decisions. I think we're very collaborative in the way we work. Yeah, I love this. I'm a big fan of product managers. I don't mean that as a dig on product managers, but I think everybody needs to have a product mindset, and it really is about how do we, as a team of humans, regardless of our role, build products together. So that's great to see. Tell me a little bit about Gradient Labs. What does Gradient Labs do? So Gradient Labs, we are a AI agent company, effectively trying to automate all customer support, particularly focused in the fintech space. The main thing that we're trying to do is not just handle the inbound, frontline customer support. Our CEO, Dimitri, sometimes refers to this as the tip of the iceberg, which is what's most visible. But there's this whole other world below the water, like below the iceberg, which is actually a lot of the customer support tasks that need to be done in companies, especially fintechs, behind the scenes. So back office tasks, things like fraud dispute management, fraud investigations. And these often take up more time from human support agents than the frontline stuff, which can be a lot simpler, like question answering. And so the way we're trying to operate is we've got these almost three pillars of customer support. We've got the frontline inbound customer support, which we also handle, simple question answering. We've got this concept of the back office agent, which can handle some of these back office tasks that typically sit in the company's internal systems. And then we've also got this outbound agent, which is able to proactively reach out to the customer and ask them questions, maybe when they haven't gotten in touch with us. And so when you tie these three things together, you see how we can automate the full suite of customer support, because you might have a customer get in touch and say, I don't like this transaction. I ordered blah, blah, blah, and I didn't arrive. I need you to dispute it for me. That triggers a back office task, which maybe spins up a dispute and sends it to the merchant. You now need to wait a couple of days for the merchant to get back to you. And when they do, you might need to then go back to the customer and ask them some more questions. Now you've got the inbound agent handling the initial query, going to the back office agent, which starts the dispute. And that might then need to reach out to the customer using the outbound agent to get some more information further down the line. And that's where you're going to see the full power of AI automation and customer support, we think. Yeah, I really like this. Okay, it's funny when you first introduced your company as an AI agent company, well, the skeptical part of you was like, everybody calls themselves the agent company. We're going to have to get into what does this actually mean? But I like that you've already described it as you have these three distinct agents that are actually coordinating often on a task. It's almost like I can imagine it as if you had a team of people that when a customer reaches out, you have a team that is responding to whatever that request is. I know I got introduced to your team from Lawrence at Incident.io, which was a previous episode. And they also, I think, are truly an agent company where they have a team of agent SREs that are working together to respond to an incident. And I think this is a new world for a lot of product teams, right? We hear 2025 is the year of agents, maybe if you're on the bleeding edge, maybe 2026 is going to be the year of agents for the rest of us. So I want to get into, tell me about, before we get into the products themselves, give me a sense for Gradient Labs. Is this a startup? How big are you? How long have you been around? Did you start out AI native? I joined about a year and a half ago and we were still seed stage then. There were the three founders, so Dimitri, Danai, and Neil, and a couple of founding engineers at that point. And so yeah, fairly early stage. We raised our series A earlier this year, but we are very much AI native. So I think the kind of driving force for starting the company was seeing how, I think it was ChatGPT that Dimitri saw and saw how it could actually reason and answer questions and just thought that this is going to change the world. And it already has, and I think it will continue to change the world. So yeah, I think that was the driving force for deciding to do something in this space. And I think we've been very much like AI native from the start. Okay, so there was no old product that had to be evolved to AI agent. This very much was AI, let's say AI company. We can get into whether it was an AI agent company from the start. And then did the two of you work on all three of those agents or do you focus on one in particular? So the way we operate is we spin up these small teams or strike teams, I guess you could call them. I think I've heard that terminology used a few times recently. So whatever is like the highest priority in the company right now, we'll put whoever is needed on that project. And so we're quite flexible about moving people around. So I think Jack probably has worked across maybe all three pillars now because he's one of only two product engineers at the company. So he probably does get to move around a few different places and work on a few of these different projects. I myself have worked on mainly the inbound agent and a little bit on the outbound agent so far and a little bit of the back office, but not too much yet. Okay. And then tell me a little bit about how, like maybe let's pick one of these agents to dig into. And let's start with, I wanna get a sense for like, how did you identify a customer problem? How did you get into, we actually think an agent can contribute to solving this problem. Give me a little bit of the backstory for one of these agents. Yeah, I think maybe a good one to talk about would be the outbound agent, which I think initially we started off addressing the inbound side because I think that was the easiest thing to pitch to customers and the thing that would kind of give the quickest results. And I think we thought that the next logical step would be back office because we thought we could handle it inbound customer queries and then sell to our existing customers that we can also handle these back office issues as well. But we, as part of that, realized there was this kind of third missing piece of outbound conversations. We actually got a lot more signal from customers on that as something that would be a kind of easier step to then being able to handle the back office stuff. So like, I think Ibrahim gave a really good example about dispute management where we can do the kind of full flow, but there's also things like if you need, for example, customers to update their personal information, something like an outbound conversation is a really good way of kind of handling those like mass reach out campaigns that lots of fintechs especially need to do. So we had quite a strong signal from customers that was like something that was missing and that we're not sure competitors in our space are addressing. So yeah, I think we decided to focus on that primarily. So yeah, I think there we're building that out and finding like how customers want to work with it now. So we've had a couple episodes where we've gotten into customer service agents. I did an episode with this company, Neeple. They do kind of company agents. They started with the Neeple would respond to a support ticket. They had a workflow of a ticket comes in, we got to grab a bunch of stuff. They're going to suggest a response. Eventually they were able to automate responses. And that use case seems very clear to me. Like I know Intercom with Finn has done a lot in that space. A lot of support tickets follow a pretty standard almost workflow. Like you don't even, maybe you need an agent for what pipeline to send it to. But a lot of the like, how do you respond to this is pretty, let's say defined for lack of a better word. But I know in having those conversations very quickly the agent gets stuck because there's an action that needs to happen in order to respond to the support ticket. It sounds like that's something that could be part of your back office agent where the agent back office agent can go and take that action. And then maybe if the action that's required is go get more information from the customer, that's where your outbound agent would come in. Is that correct? Okay, missing some nods. Yeah, yeah. So that's exactly right. Like customers get in touch and it's relatively straightforward to handle kind of question answering with the existing tools that are there. But when customers actually want to action something, so say a customer gets in touch, their card's been stolen as an example we often use and they need to freeze their card. So our agent can actually then hit an API endpoint if the customer wants them to do that and freeze that customer's card. So it can interact with our customer's backend systems and actually take actions on behalf of the customer. But yeah, the real power comes when it's, say a customer reports like a fraudulent transaction, then our agent can say, okay, great. Like firstly, let's freeze your card. So it will go and hit the API endpoint to freeze that customer's card, but then it will kick off a potential like an outbound conversation or a back office process to do whatever needs to be done to register that fraudulent transaction and potentially open the dispute with the vendor or whatever it is. So yeah, and this is when I mentioned the iceberg. It's like the inbound part is the tip of the iceberg and we're trying to now start addressing the kind of big chunky stuff that sits underneath that. Yeah, I really like that iceberg analogy. Were you able to uncover these use cases from the inbound? Like you started with the inbound agent and you saw that there was cases that like the inbound agent couldn't fully satisfy because it required some of this outbound communication? I think the first instance was realizing that even just the inbound agent had certain limitations. Like you mentioned tool calls there and being able to take actions on the customer's account. Sometimes that would happen on inbound as well, purely. A customer might get in touch and say, my card has been stolen. That's quite a common one for banks. And usually a human, if you're talking to a human agent, they'd be able to like freeze your card, look at your address, confirm that your address is still the correct one and then order you a replacement card. But if you're talking to an AI agent that's able to just do question answering, then all they will be able to do is maybe read some articles somewhere and say, can you freeze your card? And can you order a card replacement? If the customer says my phone's been stolen as well, I can't log into the app. Then you run into a bit of a pickle because you can't actually help them there. So I think one of the core bits that's quite central to our platform and our agent across how bound inbound and back office as well is this concept of procedures, which are basically like natural language instructions. So when you look at them in the web app, they just look like a notion document. So anyone can write them out. And they're basically just telling the agent, what are the steps that you need to follow to resolve a particular type of problem for the customer? And the key thing with procedures is that's where we can get our customers or allow our customers to add in tool calls for tool executions. So they might have a procedure for handling a card replacement, where the first step is figure out why the customer needs a card replacement. Is their card lost, stolen, expired? Then based on that, step two might be if it's stolen, you need to freeze their card. Use this tool to freeze their card. And then step three will be, okay, order them the card replacement. But giving our customers the ability to write these instructions in natural language helps speed them up massively because one thing we found very early on was the knowledge for knowing what the agent should do in these cases sits in the head of subject matter experts who are often not technical. And so if you create a lot of barriers for them to be able to transfer that knowledge into a written procedure form for the agent to follow, then you're gonna create more roadblocks to getting that into production, but then also you're gonna lose some of that knowledge along the way through translation errors if someone's having to pass that along to an engineer and then the engineer is having to codify that in the agent somehow. So that was quite an early decision that we made, quite central to all three components that we've talked about so far. Yeah, I love this. This is actually the second time this is coming up, this idea of like your customers have their own business logic and their own workflows that in order for them to use your product, they have to be able to define, but they're not technical, right? And so as technical people, we're used to using an AN or other workflow kind of design products, but our average person who's not a knowledge worker or not necessarily a knowledge worker or is a knowledge worker, but not technical enough to do the workflow product can just describe their process like they would do another human. I think that's really powerful. Okay. Like most customers already had these procedures defined in documentation and actually translating them into a product was quite straightforward because of that. So yeah, it felt like a very natural path to be like, this is how you train humans to do work. So this is now how you train our agent to do the same work. Okay, yeah, I love that. Okay, and then with Outbound in particular, I feel like I can imagine, and again, push back if I'm getting this wrong, because I'm gonna make some guesses about finance companies. I'm gonna guess there's a lot of know your customer requirements and especially with onboarding a new customer or when something changes with a customer, whether they move or whatnot, or you detect fraud and you're trying to verify a transaction, that there's actually a lot of things that happen where it didn't start with the customer reached out to you or maybe they did reach out to you, but then some process has to happen and that triggers a step where you have to now reach out. Tell me a little bit about, if you're an agent, where does it get triggered? What's the beginning of the process for the Outbound agent? So the triggers can be defined by our customers. Again, so they'll know what, for example, needs to trigger maybe a KYC check or prompting the customer to update their details. Then they'll maybe have their own sync where they need to do that prompt once every six months, every one year, like regulatory checks. And so we allow them to define that in specific to their own use case. And then from there, the concept is quite similar to what we just described where you would have a procedure for an Outbound conversation. So if what you need to do is ask the customer to update their details, there'll be a procedure for getting the customer to update their details. And again, the company or some subject matter expert can hopefully, like Jack mentioned, from preexisting documents, paste or create this procedure for the agent to be able to follow to deal with that particular scenario. So when that procedure is triggered, we're then able to go away and get the agent to follow those steps. One of the main differences between an Outbound procedure and an Inbound procedure that we ran into, or one of our AI engineers who was working on this quite heavily ran into was for an Inbound conversation, it's quite easy to figure out once you've reached a natural conclusion point because the customer is the one that's gotten in touch with you. So once they're happy and satisfied with the outcome, usually they'll say something like, thank you very much, or they'll just sometimes not even reply. So they'll ghost you. And in those cases, you can generally assume that you've closed the conversation ticket off or they don't need any further help. But with Outbound, it's the company or our customer who is reaching out to their end user. And so deciding when you're done is a bit trickier because you can't rely on the customer telling you when they're done. It's on you to figure out when you've reached a suitable point for you to consider that procedure to have been completed successfully. And then you run into a few different scenarios as well, where maybe you've completed the procedure successfully, but you haven't reached a successful outcome. For example, you might get to the end of a procedure, but the customer hasn't done the thing that we wanted them to do. The aim of the procedure has not been achieved. So all these kind of like additional things that we need to think about with Outbound, which weren't there with the Inbound conversations. Okay, I love this. And it raises some questions right away around, I know one of the challenges with agents in the true agentic, I'm in a loop, I'm making tool calls, I get to judge when I'm done with my task, is if it's not always clear, if the first challenge was how do we define done? I can imagine this was particularly hard for an agent to also decide when it's done. So tell me a little bit about how did you uncover this problem? How do you iterate to identifying what done even looked like in these cases? Yeah, I think we realized it just from testing out the agent very quickly, because we got to a point where the agent would prematurely conclude the conversation and say, I'm finished, I'm done. And obviously from our manual testing, we could quite clearly see that it was in fact not done. There was still things that needed to do. And so that kind of led us down this path of, okay, we almost need to think about what is the, at least for Outbound, what is the criteria for being done? And then the way we architect our agent as well is that modularized different components of it. So we talked about the procedure there, and the part of the agent that follows the procedure is only one component. There are other components that run outside of it as well. So one of those components, for example, is guardrails, which make sure that the agent doesn't say anything that would breach potentially regulatory kind of guidelines, give financial advice or make financial promises. There's a number of other, what we refer to as skills in our agent. And you can have a skill, for example, to check if the conversation, or in this case, a procedure has truly reached or achieved its goal. And through those, we have the ability to override what the procedure agent has said. So if the procedure agent has said, I'm done, but then a different part or a check comes in and says, actually, you're not done, we have the ability to go straight and go back into the procedure agent or follow a different path that kind of tells it to go back and retry the conversation. Yeah, this is great. Okay, I wanna make sure I understand this. So you have a primary orchestrator agent that's trying to run through the procedure. And then you have other processes. Are they independent agents? Or are they skills that orchestrator agent can call? Tell me a little bit about what's the back and forth there. Yeah, we refer to them internally as skills. So it's central across all of our, I guess if you wanna call them the three separate agents, the inbound, back office and outbound. So they all operate in a very similar way. And the orchestrator is actually something we refer to as the state machine. And usually it's one kind of long running workflow. And it's not responsible for doing any AI or agentic stuff. Its job is purely to orchestrate the whole conversation and manage the state and history of the conversation. And what that orchestrator will do is it will trigger, again, what we refer to internally as turns. Okay. And so turns are usually units of work where we need to do something. So we need to decide what to do next to progress this conversation forward. And turns are usually triggered by either a customer message. So them coming back and saying something to us, a tool call returning result, which again means, okay, we need to go away and do something with that result. Or the third one is customer silence. So if they've ghosted us, we trigger another turn as well and try and nudge the customer forward. And inside those turns, that's where we run the main logic of our agent. And that's composed of these individual skills that I referred to. So one of those skills is the procedure following agent. Another skill would be the guardrails. But then we have a number of these other kind of skills that are building up the logic and reasoning capabilities of these agents ultimately to figure out what do we need to do next? And what do we need to do next could be send a message to the customer, execute a tool, or potentially, depending on one of these steps, returning a particular result, we might need to hand the conversation off to a human because the agent is not capable of dealing with it. Okay, I wanna make sure I'm visualizing this correctly. So you have this, you described it as a workflow. So you have a orchestrator workflow that's just orchestrating turns, it sounds like. A turn could be a number of things. But I think the way you described it, a turn could be an agent call? A turn usually is invoking the agent because the agent is what's needed to decide what happens next. So within a turn, that's where the agent logic will get executed. So we will run through these different skills, which are all sub-workflows as well. That enables us to call things in parallel and sequence and just speed things up. Not having everything run in sequence means that we can, for example, I mean, in the case of guardrails, we've got maybe a number of guardrails that need to run on the customer's message. They don't rely on one another. We can just trigger them all in parallel and then wait for their result when they come back. Okay, maybe let's talk through an example. Like what's something that would trigger the beginning of an outbound agent task? So the beginning of an outbound task, that would be triggered by a customer-defined trigger being hit. So in the case of outbound, that would then trigger a turn for the agent. Usually in this case, with something like this is a new outbound conversation being started, so conversation start trigger. In outbound, that's a special case because the agent then actually needs to greet the customer and tell them what is this conversation about because we're the ones reaching out to them. So in this case, there's a special trigger only for outbound and the agent will come up with some sort of greeting and usually the first step of the procedure explain to them why we're reaching out to the customer and what we need them to do. So that's an example of one turn that might get triggered in outbound. But then when the customer replies or hopefully replies to that message, that will then trigger another turn in the agent, this time with a different trigger, the trigger being customer message that now tells the agent to follow a different path to figure out how to respond and what the next steps are to respond to that customer message. Ah, one thing I'm realizing is this hack, because it's email, this can happen over a pretty long period of time. So it's not, I have this mental model in my head of agent running in its loop waiting for the next thing. And I think about that like in a conversation turn-based, but this is really spread out over time. And I can start to see why you have this orchestrator, you talked about it as you're maintaining state. So you have this like longer arc of, here's the outbound task or whatever the right language you're using for that. And these are the turns we've taken, and this is our current state. And we're trying to evaluate what do we do to get to a state where we've hit that success goal. I can imagine, I can imagine one of the hard parts about this is, especially for your outbound agent, you're starting with a customer procedure. And what are you doing to help your customer? I can imagine that procedure has to have a clear definition. to onboard them and help write those procedures. And as we learn like what makes a good procedure and like what kind of stumbling blocks customers hit along the way, we try and address those and productize that so that the next time a customer comes along, we can be a little bit more hands off with them if we need to be, when they can go and do these things themselves. So yeah, I think from a high level product perspective, that's the kind of approach we take to guide our users to getting to that point. And more specifically around how do we define what like done is in an outbound procedure. Ibrahim, maybe you're a better place to answer that. Yeah, I think it depends on the customer themselves as well, understanding what the goal of that outbound procedure is. So again, to take an example, if you need an outbound procedure to get the customer to update their KYC details, the ultimate end goal of that outbound procedure is to actually have the customer go into the app and put in some new information, or either that would just tick the box saying, yeah, these details are still up to date. So sometimes that information is quite deterministic. It might sit on the company side, but they can measure it in some way using a field somewhere in their own databases. Sometimes it can be a bit more vague. So it might be, we are going to request some information from a customer and they need to provide the full answers to it. So if we go back to the dispute example, where maybe we've started an outbound conversation because we need answers to three very specific questions to help progress forward a merchant dispute that the customer has opened. The customer can reply, but their replies might not actually answer the question. And so the goal of the procedure is to actually get the answers that we need to progress this dispute forward. And so it is working with our customers, like Jack said, the AI delivery team that we've got, they're quite close or sit quite close to working with our customers, and they'll figure out what the definition of done is for a procedure, what the goal is of the procedure to start with, and then how we can codify that into the procedure, a definition of we have reached our end state. And one thing to touch on as well, I think you mentioned that difficulty of getting this up off the ground. Another thing we leverage sometimes with some of our customers is historic conversations. So if they've got examples of previous outbound conversations of this particular type that their human agents have triggered and completed end-to-end with customers, we can learn from that as well. So we have a process where we can feed those historic conversations and bootstrap some procedural instructions for the customer. And oftentimes that does lead to a bit of an iterative loop where you get some bootstrapped instructions, and then you make some tweaks with a human expert or subject matter expert, but it's like it prevents that blank page problem where you're staring at an empty page and you don't know where to start. So if you're able to leverage those historical conversations to get you away from that blank page, that can often really help accelerate getting that procedure into production. Yeah, I can see, first of all, I can see the benefit of focusing on fintech. You have like use cases that are probably common across many customers, disputes, KYC requirements, whatever else, fraud. And so you almost can templatize or suggest to your customers, these are the things that you would use the outbound agent for. And then I also love that you're using their own kind of conversation history to scaffold that procedure. I imagine too with time, like you're going to learn across all your customers what makes a good dispute procedure, what makes a good KYC procedure, and you can bake that into the product. Okay, let's take this example of, let's say you have the outbound agent gathering information for a dispute. And Ibrahim, like you said, there's three questions that need to be answered. So we kick off this orchestrator agent. It starts with this greeting message. To further your pursuit, we need your dispute. We need to get answers to three questions. When the agent is generating that message, it sounds like even just from step one, there's some guardrails. I know finance, there's a lot of regulations around customer communication. Tell me a little bit about like in a turn, what are the types of things that are happening? And maybe let's do it in the context of this first. You have to reach out to a customer to get dispute information. Yeah, I think it would take a long time to go through all the stuff that's happening in a particular turn. I think the main headlines are, I think the guardrails that you mentioned there, and we can split those apart into two types of guardrails mainly. One is guardrailing on the customer's messages. So what they've said. So the most typical one you'll see when people try and jailbreak an agent is like the classic prompt injection. So making sure that the customer is not saying something that would potentially break the agent. But then also in fintech, you have a number of other scenarios that you have to be mindful of. So things like financial difficulties, customer vulnerability, complaints, which are regulated here in the UK, if you're a financial institution. So these are really important as well. And they can pop up at any time. So you could be asking a customer in an outbound procedure for some information on that merchant dispute. And then they come back with an answer saying, I'm really unhappy with how you've handled this. And I want to make a complaint. You need to be able to handle that appropriately in the middle of an outbound procedure that is potentially about something very different. So that's where those guardrails come in. They're really important. And the other type of guardrail is on the agent's answer. So before the agent actually delivers this answer, we get this concept of a draft, like it wants to say this to the customer. But then there are also a number of regulatory, but then also sometimes just from a company's perspective, there are things they don't want the agent to say. So they don't want the agent to ever make things like unsubstantiated financial promises. So in an outbound procedure, saying something like, oh, once you provide this information, we'll 100% make sure that you get refunded your money. That's a very dangerous thing to say. So those guardrails prevent the agent from saying something like that. So it seems like there's two parts to this. There's whenever the agent says something to the customer, you have to make sure that what the agent is saying follows company policy, follows regulatory policy. And then also when the customer says something back, there's this check of, can the agent proceed? Has the customer said something back that needs to trigger a different workflow, or maybe even get pushed off to a human? I can see why you use the language turn, because each of these is really a, you have an agent turn, you have a human turn, you're evaluating, where do we go based on that turn? Abraham, you Abraham, you said that a lot happens in a turn. I could dramatically oversimplify this and imagine, okay, I need to go get answers to these three questions in a dispute. Why can't I just send a message to the customer? We've learned a little bit about we have to have guardrails on what you say to the customer in that reach out. What else is happening in that turn? What else makes it complex? I think anytime you're dealing with natural language input from the customer in a conversation, they can choose to take that conversation in any direction that they want. And it might not be the direction that you or your procedure very cleanly lays out. So you could be talking about, yeah, we need answers to these three questions from a merchant to be able to handle your dispute. And they might answer all three of them and then say, oh, by the way, my card also expired. Can I get help sorting out a replacement? And now all of a sudden, that's not the job of the outbound agent. That's more of a, okay, you should get in touch through a different channel and contact customer support and they'll deal with that there. So you have to be able to handle these kinds of scenarios. And again, with natural language as well, you might get answers from customers, even in response to the procedure that are unclear or need clarification. So you have all these different potential paths that the agent can go down, which are not necessarily just let's carry on executing this procedure because that's the happy path. That's the path if you would like the agent to go down, but it's not always possible. We've also seen cases where a lot of our customers, they have a very diverse user base. So they speak different languages as well. And sometimes they get in touch and they start speaking in a different language. So you need to be able to handle that as well. What do you do if a customer responds to your outbound conversation in a different language that the company is not configured to support? So there are all these different paths you can end up going down, which are not the happy path. And so in one of those agent terms, we're reasoning about what has the customer said and what is the next best thing for us to do. And sometimes that next best thing cannot be just go back into the procedure. Sometimes it might need to be, we need to clarify with the customer what they've said. We might need to inform them that we only support these languages in customer support. And if they can't speak one of those languages, we'll have to transfer them to a human. Okay. And this raises so many questions about, okay, you have an agent who's following a procedure and I've seen lots of funny stories about agents that are so goal-directed. They just want to reach their goal. And now you're telling me a customer can just throw in a wrench at any time and kind of skew the conversation in a different direction. How are you preventing the original agent from just forcing the issue on the procedure? So that comes back again to those individual turns. The procedure execution agent is one of those skills. So it's not, it doesn't encompass like the whole agent. Okay. So there is a step before that where the agent can reason about, is procedure execution the thing that I should be doing here? And if the answer is no, then we will never get to the procedure execution, if that makes sense. The procedure execution won't have that chance to make that, like you said, that greedy decision of I need to just move this forward and get to the end at all costs. Yeah. I can imagine this gets complex really quickly because there's a hierarchy of what, like how, what do you have in place to help the agent decide what is the priority given this turn? So again, that comes back to our different skills as well. So we have, there's a couple of different ways of doing this. So there's the not very cost-efficient way, which is you run them all in parallel and then reason about which one is the correct one to do. And then you basically got the results all sitting there ready and waiting for you. That's the, I guess, not very cost-efficient, but latency optimized version of running this. And then the other alternative is you try and reason through step-by-step. So you think, okay, I think the best thing to do on this particular turn is not to go into the procedure agent, but actually clarify the customer's latest message with them, for example. And then maybe you go into that particular path of logic in the agent, get to a certain point and the agent says, actually, I shouldn't be clarifying here either. And that's like a deeper part of the agent, a deeper skill, which has more context on when it should and shouldn't be clarifying. And so we have the ability for the agent to go into and back out of these different paths. So it can go into the clarify path, but then if it realizes, actually, I shouldn't be here, it has the ability to zoom back out and then go, okay, what was the next thing that I should check for doing? And the next thing might be the procedure agent. I, in my head, I'm just imagining like so much complexity here of you encounter a use case that something didn't work very well. And you're like, okay, we got to add a skill. And you add the skill. And I imagine every skill you add adds like things that helped with and things that didn't help with things that made worse. So tell me a little bit about, and if I understand right, all three of your agents use the same architecture. So are there skills specific to each agent? Are they sharing skills? Because I can imagine if they're sharing skills, now you also have to think across all three use cases. Like how are you managing that complexity? Yeah, I think the answer is some skills are shared, but then some are quite specifically scoped to a particular type of agent. So the most obvious example we've run into recently is with our voice agent, where a lot of the skills were basically functionally quite similar. We needed to achieve the same outcomes. But the latency requirements were just vastly different. And the way we needed to architect those prompts and the way we execute those prompts was just very different. And so we create a different version of those same skills specifically optimized for the voice agent and for latency. And it's the same for the other types of procedural agents as well. If there is a particular skill that is only relevant for the outbound agent, for example, to give you that example, the skill that determines whether a procedure or an outbound procedure has been completed is only relevant for outbound procedures. Because like I mentioned before, for inbound conversations, we get that signal from the customer. So that skill is almost irrelevant there. But then there are some things that we can share equally. So for example, for text conversations, guardrails, we can mostly share because the input is broadly similar. We're looking for the same patterns. And so we are able to leverage and not have code duplication there. But again, it's just a case of being able to understand when we can use shared skills across different components of the agent. And maybe we've just gotten used to it. But I think our code base is architected quite well in that it doesn't feel like it's too overwhelming. It's quite nicely structured and where you're easily able to find and spot the relevant skills that we might need to edit or modify or add to over time. I imagine a lot of it has to do with your actual skill design and making sure that each skill is distinct and encapsulated. Is that true? Tell me a little bit about how do you think about a new skill versus augmenting a skill versus skill overlap? I can imagine that's a lot of conversation goes into that. I wouldn't say too much conversation. I think most of the time we kind of bias towards moving faster. So I think generally speaking, we have a fairly good sense now across the team of when to add to a skill, when to create a new skill. Usually, at least my kind of mental model is the skill will have a fairly well-defined name. And if the thing that you're trying to add to it doesn't really fit, it's like a function in traditional programming as well. If you can encapsulate the thing you're trying to add as being part of that function, you can add it to that function. But then if it's going to add new functionality or do something very different, then it's probably best to create a new separate function and put that logic in there. I think the same thing is true for skills as well, at least in our code base. Is the agent deciding what skills to use or is it a little more deterministic where on a turn you're always running certain skills or is it a mix of both? It's somewhat deterministic in that the agent will have access to a particular set of skills on a particular turn based on the context that it's running under. So again, an outbound procedure running on the first turn will have access to a very specific set of skills, usually probably only a greeting skill because that's all you can do on the first turn in an outbound procedure. But then when you're running an outbound procedure on the next turn where the trigger is a custom message, it'll be a different set of skills that in this tree-like path will get traversed and the agent will go through. Okay, so you are limiting the skills available based on what you know about the turn? Correct, yeah, and the agent as well. Okay, because I could imagine if you're just adding skills whenever you need them, you could very quickly overwhelm that agent with just options. Tell me a little bit, is that purely deterministic if the turn has these attributes to get these skills or is there an agent decision in there? What does that look like? No, it's purely deterministic. So we make that kind of call in the code. We know which skills should be available to an agent at any particular given context. And that helps make it a little bit more deterministic as well in the world of LLM non-determinism. Yeah. And also a little bit safer as well because, for example, there are certain skills where it might not make sense for the agent to have access to them in a particular context. It might not be safe for the agent to have access to them under a specific context. And so by scoping very specifically what the agent's path can look like in a particular turn, we at least limit a little bit the potential routes it can take. And it can't just go off into its own crazy direction in 0.01% of these edge cases where it's just randomly decided to go down and use a skill that it was never meant to use. Okay. And then I am really curious about, we talked about guardrails. I can imagine you have guardrails that are universal across all your customers. Maybe they're regulatory, maybe they're company policy. But I also can imagine guardrails that are specific to your customers. Maybe that's part of their procedure definition. I'm not sure. Tell me a little bit about, especially with customer support, you have to have a knowledge of a customer's policy and what they can respond with. How does that part work? So we have a few different levers that customers can pull. So some of that is written into the procedure itself. You can give the contextual information at the start of the procedure and say, if the customer fits this kind of scenario, then follow these steps, but otherwise you need to do something else. Then we are able to actually toggle guardrails on and off for specific customers if we want to. Certain customers are less interested. Fintechs who maybe aren't in banking or other more regulated spaces don't need to worry so much about some of the stricter guardrails that we have. So we can actually switch those off and that will improve the performance and make the agent a little bit less restrictive in what it can say. And the other thing we have is we call it tone of voice, but you can actually write some instructions on how you want your agent to sound and your agent will then adopt that tone of voice throughout all conversations. So I guess that's less related to procedural outcomes, but it's more related to how your agent will actually deal with inquiries and respond to them. Okay. So it sounds like a lot of your end customer's business logic rules that the agent would have to follow is encapsulated in that procedure itself. Yeah. I think that's what we try to encourage. So we also have this concept of resources, which get sent along with the conversation. So as an example, if you reach out to customer support through an app for a product that you use, that app will know who you are because you're logged into the app and it will have other information about you, about your account and stuff. And that information gets passed to us and we make that available to the agent to inform its decision making as well. So that helps the agent kind of spot potential issues or even small things like we can see that a customer, if they're reporting their cards have been stolen, we can see that their card isn't currently frozen and then we can make the tool cool to freeze it. And then we can check that resource again and see that it is frozen and that kind of helps to inform the agent as well. Yeah. Okay. So this brings up another challenge of it sounds like your agent is interacting with your customer's tools, whether that's freezing a card, whether that's looking at the status of an account. Is that something that like your customers are FinTech companies, so they clearly have some technology. Do they already have APIs for that? Are they, is this part of their onboarding? Are they having to build out those tools? Yeah, a good question. So this has been something we've been thinking about how we can smooth that process with customers. So lots of customers do have those APIs already because for internal support teams that were previously following these kind of internal procedures that they'd written, they would have a kind of back office or somewhere that they'd go and actually make those changes. So when humans were handling these queries, there were still tools available for them to go and act on the inquiry. For those customers that had those things already, it wasn't a huge shift to just expose those APIs to us. But we've also, we try and keep the barrier to entry as low as possible. So if you come in and write a procedure in our app, you can tell it to do whatever you want it to do and you don't need to have those APIs ready. So you can basically use like a placeholder tool that says at some point we'd like it to do this, but as we don't have that API right now, we'll put this placeholder in and then you can test your agent. You can test the procedure, chat through, talk through edge cases, hit it with some weird queries or unexpected things and see how it handles them. And you can iterate on your procedure until you're happy with it without actually needing, without being blocked by not having access to an API that does the thing you want it to do. So yeah, we've tried to keep that barrier to entry as low as possible. But ultimately, one other thing we've done in that space as well, that's quite interesting, and actually Jack can talk about in a lot more detail as well, is this concept of a tool we've built called Hasker Human. And it was essentially built and designed for exactly the scenario you mentioned where maybe customers don't have APIs and they don't have the engineering time to go and build those APIs to enable the agent to call a tool to do X, Y, and Z. I think there's a couple of use cases. So one is to, again, lower that barrier to entry. And the other one is for things where you actually need some kind of human authorization. So it may be like a refund, something else that you want a human to check that everything looks right and click the button. So you can now add this tool to a procedure and it will basically get in touch with the company. So there'll be a Slack channel or you can log into our web app and see these tasks. And you can go in and you get a summary of the conversation and you see what the human is requesting, what the customer is requesting, and then you can approve or reject it. And we found that, yeah, for those two use cases, that's enabled another huge chunk of automation or automating the real time-consuming parts of handling those queries. There were some figures that we saw from customers. I can't remember them off the top of my head, but it's saving huge amounts of cost and time from their support teams just to have the small kind of snippet that you approve or reject rather than having to handle the whole conversation. You know what I love about this is, Jack, as you were describing this before Ibrahim jumped in, and you talked about placeholder tools, in my head, I almost was thinking about it as like it's a tool call to a human. And so I love that's where this went, right? Agents calling tools is such a nice interface from an agent to code, but it's also a really nice interface from an agent to human in the loop, which is, I don't know what that says about what our role is going to be in the future if we're just going to be agent tools, but very nice. I love that's how it ended up. Okay. I can imagine, especially in finance, we've talked about your guardrails. I want to get into how your guardrails work, because I can imagine you have some absolute lines you can't cross in the way that you communicate with customers. And this seems a guardrail that has to be rock solid. So I want to explore a little bit about how do those guardrails work? What are you doing? What's the technology behind them? Yeah, it's a good question. The guardrails are essentially, if you break it down, a classification type problem, but using natural language. So you have a scenario. So if we take the example of unsubstantiated financial promise, you have a conversation history, some things that have been said, and now you've got a next draft answer that your agent wants to send. And you basically are trying to guardrail against is this answer that the agent is about to say going to violate a particular guardrail. And in this case, the guardrail is unsubstantiated financial promise. So the way we structure it is as LLM prompts, and we edge cases. So you'll have to generate them manually. But if you've got those labels and you can run these evals, then it becomes not easy, but a lot easier and similar to how we used to evaluate ML classifiers back in the day. Because then it becomes a problem of, okay, run your change or your new prompt against some historic examples. And then you can compute your metrics like recall, precision, and flag rate. And then like you mentioned, for the guardrails where we have very low tolerance and we have to be absolutely rock solid, then we aim and shoot for very high recall and maybe accept slightly lower precision there. And again, we can make that trade off based on the particular guardrail as well. And how bad would it be if the agent violated that? Okay. So it sounds like, I love this example because you're hitting on a lot of things here that I think teams are still learning how to do. So we have, it sounds like a guardrail that's really an LLM is judged, but it's a binary classifier, yes or no. And you're evaluating that judge based on a dataset where you know the right answers. I think one of the challenges teams have with this approach is how are you curating that dataset and how are you keeping it current? Yeah, that's a good question. So to start with almost always it's best to leverage domain expertise. And so for us, that was some of our early customers who would label some examples for us, either test conversations or synthetic conversations. And they would give those labels, that data to us. And then over time, because some of us, our background is from that kind of FinTech world as well, we were able to leverage a little bit of our own knowledge of what is bad, what is not bad to generate some more of those examples. And then it becomes a tricky one, like you said, to keep them up to date. The good thing with some of these guardrails is that there isn't necessarily too much of a drift in the data that you see, because the definition of what is a financial promise is not going to change dramatically over time. And the same goes for many of the other guardrails, even the regulatory ones, the regulations don't change very often. When they do, then it becomes a bit more of a problem because you have to map whatever changes they've made to the wording of the regulations to your guardrails. But that doesn't tend to happen as often. The trickier part of it is getting a sample size of labels that you're happy with. And that takes time and potentially a little bit of what we refer to as an auto-evaluation process as well. So I guess it's like another LLM as a judge, where sometimes what we'll do is we'll run an LLM over a finished conversation, we refer to as our auto-eval agent, and we will try and get it to flag certain conversations for manual review that we think would be of interest for quality assurance purposes. So it's another way of us building into our platform that QA capability, but then it also feeds back into the labeling because usually those conversations that have been flagged by the auto-eval are more likely to give you a true positive label than just randomly sampling from thousands of conversations. There's so much I want to dig into here. One of the challenges I have with LLM as judge evals is they end up being really expensive over time, right? They're only as good as the data set you're testing against, or they're only as good as the human labels you're aligning them with. In your product, your guardrails don't feel optional. They feel like they need to work, and they need to work really well. And I can imagine in my head the complexity here could scale pretty quickly. Are we talking about four or five guardrails? Are we talking about dozens of guardrails? Because in my brain, if you have dozens of guardrails, how in the world are you managing this data set, these data sets? So let me pause there for that question, and then I want to get into your auto-eval. Yeah, it's an interesting one. I guess first thing quickly to call out as well is the guardrails, I think in a way you can maybe describe them as LLM as a judge, but I think the key distinction there is we don't take the output of a guardrail to be the ground truth. So it's a result that it's output, but we don't store it as a label. We use it in the conversation. So for example, if a guardrail output says like this answer from the agent is going to violate a guardrail, we might transfer that to a human and avoid sending that message to avoid potentially a bad situation. We log that in the database, but we don't consider that as a ground truth label. So where we get the ground truth labels are always currently at least from some sort of human manual sampling and manual review exercise. So I might go and review the true positives the guardrail has flagged in production and find that actually only 10% of them are truly true positives because sometimes it was a bit too heavy handed in wanting to hand a conversation off. So that helps us feed back and tune those guardrails over time as well. I think your second question was around how do you manage multiple guardrails as well, like the scale? And the answer to that is it depends on the guardrails. Some guardrails are just by default a lot simpler because they very rarely flag. So we don't worry about them as much and some of them are slightly lower stakes as well. But then there are a handful of guardrails which are quite critical. And like you said, there'll be some that are enabled across all customers and some that only, like Jack was talking about earlier on, certain customers will toggle on and off. So the core set of guardrails is like the main bulk of our label set that we kind of use to maintain as well. But then we have those decisions from production across all the guardrails and we can use those to manually review and then feed back and store those labels. And so for each of our guardrails, we do have data sets in our data warehouse, which we can use if we ever wanted to go back and fine tune them. Okay. Yeah. I think what you're reminding me of is I think when I first learned about evals, my brain immediately turned it into, wow, you have to do all this across your whole product and there's all this complexity. And then you get into it and reality sets in and you're like, actually, we care about these four metrics. We don't have to do a hundred things. We can focus on these four things. So there's this priority of you might have a lot of guardrails. Some of them are clearly more important than others. And that's where you're putting a lot of time and effort into curating those data sets, making sure they're staying rock solid. Monitoring is quite important there as well. If you've got a guardrail that has been pretty solid at flagging, let's say 0.1% of conversations and then suddenly one week you see it spike up to 1%, that's usually a good signal as well as, okay, something weird is happening there. We need to go and look at that. So we have that kind of tracking and internal monitoring as well, traditional metrics that track the health of these different guardrails. So we can see when something is deviating from the norm, if something looks off, that usually means we need to go and investigate it. So that's a good barometer as well for us to be able to figure out how we're going to dedicate our time. We're not going to go and look at every single guardrail every single week, but if one of them is spiking, then we will probably take a look at that. Yeah, that's a great point. Okay. So, and then now, and then it sounds like you have this auto eval that runs, and this was over the course of the whole conversation. So not on a turn, is that correct? Correct. Yeah. Okay. And then what is that auto eval looking for? Again, it's similar in a way to guardrails. There are certain failure patterns or customer experience, detrimental customer experience patterns that we know about, and we kind of want to flag for those conversations. And we do this once the conversation is finished. So we've got the full from hello all the way to goodbye, the full transcript or from hello to please transfer me to a human if it's gone that badly. So what we're able to then do is reason about a number of different things. So have we missed a particular guardrail? So one of those important guardrails that we really care about. So we almost get like a double check here of once the conversation is finished, let's scan it again and run it through to check for that guardrail. Things like excessive repetition on the agent's side, or if the customer is displaying negative sentiment, any sort of signals or indicators that the conversation has not gone well, and then that will trigger a review task. Again, in the web app, someone can go and review those conversations that have been flagged, and then provide more granular labels into, was this a false positive from the auto eval system, and actually this conversation is fine? Or actually, was a particular guardrail violated and not flagged? Or was this a bad customer experience because the agent was repeating itself too much? So those are just some examples, but it's always the manual review that gives us the true ground truth of was this good or bad. It's almost like your auto eval is sampling what needs human review for your then next round of error analysis of what needs improvement. Exactly. That's great. Okay, we are coming up on time. I want to make sure I have time to ask you, what's next for your multi-agent system? What I love about it is it sounds like you're really building a common agent architecture, and then you've got these three instances of types of agents. What's next up for you? Yeah, so I think exactly that. We focused on laying the groundwork, and recently as well, we've just shipped our voice agent, which we mentioned earlier. So now we can operate across basically any channel, like email, live chat, phone calls as well. And we also have the foundational pieces for inbound conversations, back office processing, and then outbound. And we see this world where we connect those three concepts and are basically able to automate the entire customer experience. So yeah, a customer reaches out because they've had a problem, it kicks up some sort of back office process, which triggers an outbound conversation to somebody else. That gets resolved. It goes back into the back office process, which says, okay, great, that's sorted. Go back to the customer and say, yeah, that's all done. And we can handle those really complex kind of multi-layered queries completely automatically without human review. So that's where we're headed. And I think we are focusing on tying all those bits together for our customers. All right, this has been amazing. I actually really love that you took a very product approach to agents. You didn't build an outbound agent and then an inbound agent and then a back office agent. You actually built an agent platform that is driving all three of those agents, which I feel like it was only possible by being a new company after this technology became available, which is pretty cool. And I'm excited to see where it goes. Thanks for spending some of your time with me today. I appreciate it. Yeah. Awesome. Thanks for having us. Yeah. Thanks so much. If you enjoyed this conversation, please subscribe in your favorite podcast app and give us a rating as it helps others find the show. Thanks. I appreciate it.