When Trust Is Everything: Building AI for Physicians at Healio

Overview

This episode features the Helio team—Jen (SVP Product Development), Casey Utley (Senior UX Designer), and Matt (VP Technology)—walking through the creation of Helio AI, the company’s first AI product. Helio AI is designed to help physicians handle information overload by providing fast, trustworthy, source-backed answers drawn from Helio’s clinical content and vetted third-party medical literature.

A central theme is building clinician trust: not merely generating correct-sounding answers, but making the system transparent, medically credible, and usable under real-world time pressure.

Key Takeaways

Trust is the product, not a feature. Physicians need answers they can verify quickly. Helio AI emphasizes citations, credible sources, and transparency to address skepticism about hallucinations and low-quality references.
User discovery reshaped core assumptions. In beta testing, clinicians asked fewer “diagnosis/treatment” prompts than expected and more “patient communication” prompts (e.g., explaining diagnoses empathetically). This pushed the team to adjust response tone and design for bedside communication needs.
RAG quality depends on curation, not just technology. Helio filters sources (including weeding out lower-quality journals) rather than “dumping everything” into retrieval and hoping the model sorts it out—especially important in medical contexts.
Hybrid retrieval is necessary for clinical usefulness. The team found that lexical search, vector/semantic search, and recency-weighting all matter—particularly for questions like “latest treatments,” where freshness is part of relevance.
Evals are directionally useful, but human review remains essential. Helio uses physician feedback heavily and is experimenting with “LLM-as-judge” across dimensions like safety, faithfulness, completeness, and clarity—without blindly trusting automated scoring.

Practical Steps

Start with attitude and workflow research before building. Run a survey or interviews to learn where users would apply AI (point-of-care, prep time, email follow-ups), then validate with moderated usability testing.
Prototype for both UX and model behavior early. Use a working prototype (not just mockups) so users can enter real prompts and you can observe unexpected use cases, preferred response formats, and trust requirements.
Design responses for “scan first, verify fast.” Provide bullet summaries and tables for speed, but include inline citations (e.g., numbered subscripts) plus an easy way to open full sources.
Implement safety and privacy guardrails upfront. Add PHI detection/masking before prompts reach downstream systems; block inappropriate requests and enforce HIPAA-aligned handling.
Build a continuous feedback loop. Add thumbs up/down on every answer, capture prompt + response + reason codes, review negative feedback weekly, and maintain an engaged clinician advisory group to validate improvements.
Tune retrieval for recency and document type. Adjust ranking when prompts imply “latest” information, and preserve verbatim text for guidelines where exact phrasing matters.

Notable Quotes

Casey Utley: “We’ve really learned to not just treat accuracy… as a finish line, but trust… is built through transparency, tone adjustment and just the respect for our users’ time.”
Matt: “We do use a combination of [lexical, vector, and semantic search]… and… modify the search results to make sure that you’re pulling more recent articles.”
Jen: “We are just continuing to work on improvements… when that feedback is negative, what was the response, what was the query, and really digging into that.”

Full Transcript

Source: openai 49m runtime

Welcome to Just Now Possible with Teresa Torres. Hi, my name is Jen. I am Senior Vice President of Product Development at Helio. So I was the lead from a product management perspective on the product we're going to talk about today. I've been at Helio for over 20 years, but I've been in a product development role for just over five. Hi, I'm Casey Utley. I'm a Senior UX Designer at Helio. Along with Jen, I've helped lead the UX work at Helio, which includes creating prototypes, usability testing, and then just making sure we're solving real problems for our users. And I've been at Helio for about four years now. I'm our VP of Technology at Helio. I've been here for 15 years, straight out of college, straight to Helio. I've been here ever since, been loving it. For the Helio AI product, I was the primary developer. So despite my managerial title, I'm pretty hands-on with the work that we do here. Helio must be a great place to work with people with such long tenures. I love it. We are not the only ones. All right, well, tell me a little bit about what does Helio do? Yeah, so Helio is our website platform. We provide daily news update, information, education to healthcare providers across more than 20 medical specialties. So we have a news channel, there's a CME channel for continuing education, a product called Clinical Guidance offering reference information, and we also have an online community. Yeah, I remember, Jen, when you reached out, we went back and forth a little bit. What really resonated with me about what you were doing is just, I think in the medical space, and this is probably going to be a good segue into your AI product, it's so hard for practitioners to just keep up on what's current and what's happening. And I know as a patient, it can be really frustrating to be like, what do you mean you don't know about this? So I love just your problem space really resonated with me a lot. OK, let's talk a little bit about the AI product we're going to talk through. Give me some context and definitely give me the high level of like how you're using AI. So Helio AI is, we try to address the problem like you just mentioned, Teresa, where physicians are constantly having information overload while working under extreme time pressure. So they need to keep up with those guidelines and those new studies that come out daily and having those critical, making those critical decisions while providing quality patient care. So that's where Helio AI is coming in and that we are a one-stop shop and an AI tool where physicians can ask those questions and have that up-to-date information. And like I looked at your website, it looks like Helio has been around for quite some time. Obviously, from your intros, we've learned that. As a new site, as a CME provider, for people that don't know, CME is Continuing Medical Education. Tell me a little bit about, is this the first AI product for Helio? Is this just one of what's in your portfolio? Give me a sense of like where does this product fit in your company's portfolio? Yeah, it is the first AI product. Helio, the company's been around for 125 years. So we've been, we're always looking for ways to help get information out to healthcare professionals. What's the best way? And to Casey's point, what problems can we help them solve? We'll dig into problems and the problem space, I'm sure, in a little bit. But I think this was a case where there was this emerging technology. We were seeing these pain points for our healthcare professionals. And it just made sense to move in this direction with Helio AI. And were you already seeing clinicians using your services for this problem? Or was this an adjacent problem that you realized AI now makes possible? I think they're definitely, we're already using our services. And I think healthcare as a whole, just like all of us, are trying to figure out how AI can help us in our jobs and in our day to day. And our website is a place where physicians come to find out the latest information on clinical guidelines, FDA alerts, the latest news. And you mentioned education and other reference information. But they were already coming there and we know that they're looking for information. And we want to make it easier for them to find it. And if we think about when providers at the point of care, so when they're with a patient, they have a question, they need the answer quickly. But they need to make sure it's coming from a trustable, credible source. So we really felt like we could provide that for them with this product. Yeah, I guess what I'm trying to distinguish between is if I'm a physician, I probably have activities that I'm doing on an ongoing basis, like separate from point of care to stay up to date on the latest research, keep my CME credits current, whatever the case may be. What I think is really unique about what you're doing is really looking at that point of care moment. So taking into account the details of that patient, what's the research or knowledge that the physician needs in that moment? Is that a problem space you had experienced with as a company or was that a new problem space? It was definitely a problem space we had seen. I think time is something that we have heard is an issue for a lot of these health care professionals. They just don't have the time. If you think about the clinical trials that are coming out and all the research data, it would be a lot to comb through if you're going to look at every medical journal article that's come out and been published. It is something we've been trying to address with a lot of different products. Trust was an overarching theme when we were talking about AI specifically to physicians. From the very beginning, we surveyed over 300 health care professionals to understand their attitudes toward AI. When would they be using a product like this? Where are their pain points? And like you said, it's the point of care where they need to be focused on a patient, for example. But we had these continuous touch points with physicians from that survey to moderated usability testing and then testing a product live. And we know that they really value the trusted sources, trusted medical sources, and the transparency of tools like this. And they're willing to use that at point of care when it is transparent for them. Yeah, this is something I really want to dig into. I know, so I have a little bit of experience in the medical literature realm because my first job out of college, I worked at a company that helped to bring journals online. So back in the 90s. And I actually worked on a CME product, which is why I'm familiar with CME. I can imagine, so for people that aren't familiar with this world, journals are definitely tiered. So there's like people aren't going to trust a study from a journal they've never heard of. There's certainly like credibility markers. So this idea of trust, like you can't just tell me the answer. I need to know where it's coming from. I need to know. And I'm assuming most physicians are fairly familiar with the sources you're drawing from and they can make those judgments. Although I also know over the last 20 years, we have so many journals and maybe that's getting harder. But the thing that's fascinating to me about your problem space is the moment of point of care. We all hear these stories about clinicians that have to see 40 patients a day. They've got five minutes with you. We're digging into a really hard problem. Are they using your product literally in the moment at the point of care? I will say that we did test this. We had beta testers for this product and we actually had them enter in prompts without any guardrails or rules around like them entering the prompts. And they started to enter in prompts about patient communication. And that was surprising and challenged our assumptions as a product team because we thought they would be asking a lot more about diagnostic problems and treatment problems. But instead they were asking, how do I communicate or explain this diagnosis to my patient? How can I be a little bit more empathetic to my patients? And that actually shifted a direction too with our product and that we had to adjust the tone of our product's response to be a little bit more empathetic. Because yes, it was clinically accurate, but physicians in that moment of care were actually looking for a more empathetic response from our product. Ah, that's so fascinating. Okay, help me picture this. Are we talking about a doctor sitting with a patient in the room when they're using your product? Is it the day before when they're preparing for their appointments? Give me the like, if you were to like storyboard how your customer is using your product, what are the highlights in that storyboard? After speaking with physicians, engaging how they were using Helio AI or how they would use Helio AI, I would say that they're using it in preparation for those moments. So not necessarily right there next to the patient, they want to be engaged in that moment with the patients. And so we're respecting that time with this product experience, but we're using it as a way for them. We know that they're using it as a way to prepare for those moments. So it might be like the day before they're preparing for tomorrow's appointments or lunch break, they're preparing for their afternoon appointments, however they do it. It's just how do I very quickly learn what I need to learn to be my best in these upcoming appointments? That's exactly right. Okay. Yeah. In our survey data, we found about 75% of those 300 respondents that Casey had mentioned earlier report using AI for patient interactions and patient care. And then there was even a certain percentage that shared that they have showed their patients the AI generated content to help explain something to them. Yeah, I was going to say a lot of patient doctor interactions too have moved to email. And I could imagine this being really helpful in that context as well of actually literally can plan out what I'm going to say. I wanted to bring up a scenario. So I was actually at the doctor's office yesterday, and I'm asking my doctor about different things. And she didn't know what I was going to bring up, but I can tell that she's looking over at the computer and kind of half reading, half replying to my concerns. And I feel like that's going to be a real problem space that we can help solve where a doctor can come in and quickly type a question. And within a few seconds, it gives you those bullet points of this is what the patient is asking about. Here are the things that this patient needs to know right now at the point of care. Yeah, I will share. I had a pretty bad ankle break this year and had to have surgery. And I've had a lot of doctor's appointments this year. And I've had like my surgeon fail to tell me pretty critical things like he handed me a brace and was like, you're ready for this. Whereas what he put in my notes to my PT was where the boot for another month and then transition into the brace. And I was like, oh, I thought I was ready for the break. And it's just that's a busy doctor who just missed a bullet point. So I can definitely I have empathy for that busy doctor. First of all, I can't imagine doing their job. A hundred percent. I often forget things when I'm talking to people as well. But it's nice if you just have something that you can refer back to and be like, OK, these are the different things that I need to bring up during this medical visit. And these are actually things that doctors have told us that they want to see in our product. And then we can do some prompt engineering to update our system prompts to make sure that if you're in a clinical setting, if you're at the point of care, make sure that you have bullet points that will show exactly what you need to be telling the patient, what you need to be asking the patient, what the patient should be expecting. All those different things are what our product tries to do. OK, so I've already heard a number of things that I really like you're creating. Here's your communication to do list, which is great. Casey, you hit on this idea of tone and just empathy. And how do I communicate like the human parts of clinician interactions, which I know plenty of doctors that are good at that, like medicine part of human interactions and sometimes forget about the human part. But I also understand from talking with you, Jen, that you guys also pull in a lot of information from 30 third party sources. You're using maybe maybe not like personally identifying patient information. I'd love to learn about how you're doing this, but like you're pulling in data to augment what you're recommending to the doctor. Give me a sense of what does that look like? Yeah. So the we know that trust is the most important part of our product. So I actually two years ago after the chat moment, we was in Helio. We're talking, OK, what are we going to do in Helio? And ultimately, we did all the things that Jen talked about. But I was one of the people that was like, it's already solved. You can just go ahead and ask. What's the problem? Why can't we just use that? But it really is that trust factor. How are you getting your information? When it first launched was laughably bad, actually, with constant hallucination. So it truly could not be trusted. So we need to use a rag system in order to not only ingest all of our clinical content from all the news that we have within our site, all the continuing medical education that we have in our site. But we also pull from trusted journals in PubMed. And you had mentioned earlier that not all journals are equal in the sense of the quality of them. So we do have a mechanism that weeds out some of the lower quality journals and pull in only the most respected journals to use during inference to ultimately be used for our users. Yeah, this is great. So let's get a little bit into I want to go back to the beginning of this AI product. So ChadGBD came out. You're starting to ask this question of what are we going to use AI for? It sounds like you were already aware of there is this key moment at point of care where a physician needs information. How did you did you start with a prototype? How did you evaluate if AI was going to be good for this? What did those early days look like? We started the discovery with that survey that we referenced earlier with just asking healthcare professionals, what are your attitudes toward AI? How are you using it? And then we were staying close with our users throughout the process. So yes, we did use prototypes. We use Figma. We showed rough low fidelity prototypes to our users and moderated usability testing. And then Max team also built a working prototype as well so that we could then see how physicians react to entering prompts and their feedback on the responses as well. So one thing that I think has come up quite a bit on this podcast is, especially for when your audience isn't technical. Now, I know a lot of physicians are technical, but I'm going to say broadly, they don't work in tech. One of the challenges with AI is if we just give them this chat box, do they know what to type in? So what's interesting to me is your prototype was, was it a chat box? You get to put in whatever prompt you want. Tell me a little bit about what you learned from that. Yeah, it was a working prototype so that the users can enter real prompts and get real answers. And that's where our assumptions really got challenged. So like the physician, they weren't just asking diagnostic or treatment questions. Like I said, they were asking more about patient communication. So that helped us adjust the tone. Another insight was that we know that they're pressed on time, but they really do appreciate bulleted, summarized responses in our LLM. But they also appreciate going into a deep dive into the sources as well. So that watching that behavior and then also just seeing the content and how they naturally would enter in those responses really helped shape the product. And what was that early prototype like a chat GPT wrapper? Were you already pulling in trusted sources? What did that first prototype look like? That has an interesting story in itself that Casey and team had already created a prototype within Figma. But because we on the IT development side, we're also using AI tools like cursor. I know there was a week when we were talking about we want to give this prototype out to our physician beta testers. And we wanted to do that within a week. And we had a Figma prototype. And I said to Jen, I was like, oh, I could just make the product in a weekend. And she thought I was crazy, but she's like, OK, go ahead and do it. So, you know, with these tools, you can make the product. It didn't have all the bells and whistles that it has today, but it was at least close to the look and feel that currently exists. So you're in the right setting. You're in the right mindset when you're using our product. Then we added a little feedback form in the bottom that was tied to the user's question. So we were collecting what was the user's prompt? What was the response? And then there was a rating scale. Jen and Casey can talk about what was in the rating scale. But we're collecting feedback that way. Yeah, it was almost like we had two things we needed to test, right? So we had to test the LLM itself and the responses that the physicians were getting. And then we also had to test the usability of the design. And are people going to be able to use it the way that they want to and find what they need? We did. We just we made it work, worked together and were able to do both pretty quickly. Okay. And it sounds like from this prototype, you learned a lot about how to build trust, how physicians want to actually interact with it. What happened next? We continuously tested the product with physicians. We didn't just stop with, okay, now we have empathy. We have the right link of the product. But we wanted to continuously explore and validate the product with our physicians. So we continuously checked in with physicians from multiple specialties. And then once we started to see that there was a noticeable shift or impact in confidence and comfort using the tool, then we knew it was ready to be handed off and finalized with Matt's team. And is this like this prototype? Was there like Matt, you mentioned RAG earlier. In your prototype, were you already drawing from third party sources? Was that already part of that initial prototype? Yeah, it was. So that was RAG connected to the content that existed on Helio as well as PubMed data. But we've since then have added in additional data sources to help enhance our answers. Okay. So I want to get into that. But before we do, I want to make sure I understand this initial prototype. So we talked about clinicians, like physicians are using this to prepare for their appointments. So are they seeing pulled in data bullet points tied to a specific patient appointment? I think this is probably a good time to bring it up that the tool is HIPAA compliant. So that was one thing that was really important for us. But yeah, it really, it all depends on the prompt and how questions being asked. The physician does have the ability to enter in certain information if they're trying to prepare for a specific patient, maybe for something that they maybe haven't seen a case like that in a while. So I would just say a lot of it is really dependent on what that physician is looking for at that moment for that particular patient. I see. So it's not tied, like it's not like they see a schedule of patients and on each one they're seeing like this augmented information. They have a chat box where they can start to ask questions and get help. Correct. And that's their interface to your whole system of content and research and material they have access to. And then you augmented that with some third party sources. Yes. I gotcha. Okay. Okay. So Matt, I want to go back to your first weekend of building a prototype. I know LLMs are new. RAG is new. It draws a lot on our history with search. Tell me a little bit about your experience with this technology before that weekend. And how were you able to accomplish a prototype in one weekend? So we actually did work with an outside company initially. So that was fairly early this year. They were the ones that stood up the initial version of the RAG system for us. So we had that benefit, but then we certainly made improvements to it after we got to beta testing. So at that point, at beta testing weekend, I hadn't done too many additional improvements to it. But I think what we had learned from our beta testing, one, the initial response was, look, there's a lot of text here. So one of the changes that we made was, can we add more bullets? Can we add more tables? So it's a little bit more digestible for our end users. Casey talked about the tone of the response. It was just too clinical. And then the other thing, which I think we knew all along, but this was confirmation for us, was our first iteration was actually pretty slow in terms of how quickly it responded. So all of those were excellent feedback that it was great to hear early before our product actually made it out into the wild. But we knew that we had to make some changes, and we ultimately did. Yeah. Okay. So let's get into what does your architecture look like today? It sounds like a clinician has a box that can type in a prompt. What's happening next? And the other part to keep in mind about our product is, we do serve ads, and we saw an opportunity as the LLM is crunching on the user's response. That's a perfect opportunity to serve up a contextual ad. So during that few seconds when RAG's doing its thing, LLM's doing its thing, perfect opportunity to serve up a relevant ad, it's a little bit of a trick so that the user doesn't feel like they're actually waiting the whole time for the streamed response. They can actually see an ad that's relevant to them. And so that's a way that we can monetize the product as well. You know, what's interesting about that is everybody talks about how advertising is coming to LLMs, and so it's fascinating you're already doing it. And I actually love the positioning of it's a way to make them feel like they're not waiting. And you have a lot of intent in their prompt, so you have a lot of content for what's relevant to them in this moment, which is really great. Exactly. We try to associate the ad with keywords that might exist within the prompt. So for instance, if a user is asking about lung cancer, then perhaps there's a lung cancer ad that we're able to then surface while the physician is waiting for their response. Okay, so physician enters a prompt, you're matching an ad so that they don't feel like they're just sitting there watching it, whatever the funny words Claude uses. And then what's happening behind the scenes? I can tell you high level where we're making sure we run through that RAG system process. And we did find that traditional lexical search or vector search or semantic search all independently was not going to be sufficient. So we do use a combination of all of that. We also found that it was important that if users are asking you about things like what are the latest treatments, right? There's this intent that you want something more recent. So you have to modify the search results to make sure that you're pulling more recent articles. So there's definitely a lot of gotchas than just saying, okay, this query, these results look better, or even do iterative searches to surface the most relevant information. We don't- That's been an observation of mine in the entire AI space that I think maybe a year ago, all of the KPIs were based upon speed. How quickly can the LLM serve back a response? We were optimizing for that as well as keeping it trustworthy. But trustworthy actually also brings with it a little bit less speed because you have to make sure that the answer is accurate and you actually are going to the best sources. However, since then, products like Google's Deep Research, and then everybody else came up with a Deep Research version. Because each of those products came out, I think there's now also an expectation that it's okay if it takes longer, as long as we can explain back to the user, why are you taking longer? Explain that with reasoning today. You see the different things that the LLM is crunching on before it gives you back the final answer. I can foresee a world where much like the frontier LLMs, we have both a fast mode and a slow mode, depending upon if this is a clinical setting and you just need an answer back immediately, or if this is more clinical or a research setting, where it's okay if you take a minute, as long as the answer is really in-depth and helps explain an issue. Yeah. Okay. So it sounds like the heart of your system is this RAG step. You've got a bunch of inputs. It looks like you have news on your site, you're searching PubMed. What are some of the other inputs that are part of this RAG search that you can share? I think what we're trying to do is make sure that the content that's going into the system is credible and trustworthy. So we're really picking and choosing what we think is going to really again help that end user. So what's going to make those answers better? What's going to make those responses better? How are we going to help that physician better treat their patient at the end of the day? So our approach to the content that goes in has really been focused on trusted sources, again, high-impact content and content that's just really going to help the quality of the answer that the physician is getting. Yeah. That makes a lot of sense. So I can imagine you could take this approach of let's dump everything into RAG and let the agent try to decide what's quality, and that would be really tough. So it sounds like there's already this filter of let's start with what we know are the most credible, trustworthy sources so that when the agent is pulling stuff back, I realize it may not even be an agent, that's an assumption I'm making there, that there's less risk. You're not going to return something that's low quality. Yeah. I can speak broadly about what the process is there. Even PubMed itself, there's five different ways that you can get that information. You have the choice of going directly to PubMed, they have that information in FTP, or they also offer an API, or that data is also in BigQuery, it's also in S3 bucket. So you have to vet out all of these different potential ways to get that information, and ultimately what will give you the best quality information, what will get you the information the fastest. So we did and have all those different prototypes of, these are the different ways that you can get that information, and they each come with their own different challenges. Yeah. This is interesting. When we're talking about building SaaS software and we're doing integrations, it really was like, okay, what's your API? Great. But I think when we're talking about search, and we're talking about data sources, and we're talking about getting data in the right format, it can be a lot messier and require a little more experimentation. There are even some partners that we're working with where the answer is, okay, you can go ahead and crawl our page. Crawling pages comes with its own unique challenges as well. Even within the same site, maybe the structure isn't the same. It can be tedious to make sure that you're capturing all of the information that you intend to for the system. Yeah. I can imagine with web crawling, first of all, web crawling itself is a very brittle process, but then also just unstructured data. And so you mentioned for your RAG, you're using a number of different technologies. You're not just using embeddings. You're not just using keyword search. Is that to support different input types, or are you actually taking the same types of inputs and making them searchable in different ways? I'm not sure what you mean by that. Like, on a previous episode, I talked to a company where like for product catalogs, very structured data, they found that keyword search worked better. But for web crawled data, very unstructured data, they played with chunking strategies and did more embeddings and semantic search. And so for their RAG step, it wasn't that all of their data was in embeddings and all of their data was available via keyword search. They first started to look at what type of document are we trying to retrieve? And then they sent the query to the appropriate search strategy. Yeah, just speaking generically about this, for instance, we also host guidelines on Helio. And we want to, if there's a guideline that needs to be returned, we want to return back that guideline as close to how it was said on our site as possible. So that's a scenario where you can't just let the LLM run wild and do its thing. You have to be able to keep that text, that sentence, that paragraph completely intact to return back to the user. And that's not always easy with LLMs. No, and so that's why just speaking broadly about this, maybe you have to have a part of that process to understand, okay, I have this paragraph of text. And I know that eventually we will be returning a response, but is there somewhere during that process I can inject the text verbatim so that we can keep the integrity of the answer? Yeah, okay. There's a couple of things that, Matt, there's something you're touching on now that I want to come back to in a minute. But first I want to go back to something Casey talked about, which is this idea of trust. We might provide bullet points because they want a concise summary, but they want to dig and go deeper. Tell me a little bit about the interface elements. Like, how are you supporting, I almost think about this as progressive disclosure. Like, here's a nice summary, but I want to dig into the areas that I want to dig in more. Tell me a little bit about what you're doing there and what you've learned. Yes, when we were testing with users, we knew that they needed quick, easy access to the sources. And also going back to building a little bit more trust within the sources, we did test that with users. And with our beta users, they were able to see the list of sources. And also when we were doing the low fidelity testing, we had some locked sources as well. We said, all right, now, do you trust this source? What would you do with this source? How would you interact with this source? And we were able to see that users might actually go in and click into a source, or maybe they just want to see where it's coming from in surface level. But we were able to gauge that behavior from the user. And is this like, I've seen sources done a lot of different ways. Like here's a paragraph, we aggregated this from five sources. I've seen other people do it, like here's a list of claims and I'm going to give you a citation for literally every single citation. Give me a sense of like, where do you fall in that range or is that even the right range? I'll say that users, they have access to subscripts where they can click on the number. I think Matt can talk a little bit more about how to surface those and how often to surface and reference those. But from a UX perspective, we just wanted to make sure that the user, okay, I've read this paragraph, I read this line, I know exactly where it's coming from. I can click the subscript and I can go directly and verify the source. The first iteration, users were reading the response and then they could tab over to all the references. And I think we had a physician say they would rather see the reference as they're reading. So I think we had an iteration where we went and in a hover state, so you could see what the source was in addition to going over and clicking. So I think we did have some iterations there based on some feedback. So I can imagine a lot, some part of your pipeline must be looking at, okay, we think we want to return this information. Now I've got to tie it to sources or maybe you're building it up from sources. Matt, is there something you can share there about claims with sources? I know this is tricky with LLMs. Yeah. So the trick is to understand what your sources are going to be before you feed it to the LLM and then your LLM can reference those sources. Okay. So then as it's going through the process, it can be like, okay, I know that's gonna be reference number three, right? And then all the LLM has to do is put the number three there and then on the front end, we can wire that up to when a user hovers over it or clicks on it, we can pop up the correct citation. Okay. So is there, so it sounds like you're pulling from a lot of sources. Clearly there's an LLM step that's trying to synthesize that and figure out what to show to the user. As part of that, there's what came from what sources so we can connect up to these citations. I know in your prototype, you mentioned you were collecting feedback at the end from the clinician. Do you still have a sort of human feedback loop on your responses? We do. We actually have a group called Helio Innovation Partners where we're constantly having touchpoints with our users. We can ask questions. We can say, hey, go check out Helio AI and give us a little bit of feedback about X, Y, Z. So we're constantly having those touchpoints with our users even now that it's launched. Feedback right from the answer too. So if someone isn't happy, there's a thumbs up, thumbs down really quickly. If someone is unhappy with the answer, there's an option to click on some pre-selected categories of why or they can type in explanation if they choose to. And the other thing I'll say is the one thing we found working with a lot of physicians is that they wanna be involved in the development of tools like this. So we are lucky enough to be building up an advisory board. We do have some physicians who have been really helpful with the development of this and have been continuing to give us feedback on what they're seeing. So that's been really helpful as well. Yeah, that's always a really clear sign that you're solving a problem that matters to them when they wanna get that involved. Okay, I do wanna get into guardrails and evals because I can imagine it's pretty critical that you get this right. It seems like in your context, it's really critical that you have confidence that what you're telling the clinician is the right thing. So tell me a little bit about what are you putting in place to make sure that's true? Whether that's production guardrails, after the fact, what does that look like? Yeah, so we use guardrails upfront to make sure that the user is, first of all, asking appropriate questions. And we also have a process that masks any kind of personal health information and that's important for our HIPAA compliance. So we make sure from the very beginning that it would be unusual for a physician to do this, but we wanna make sure that they're not entering a patient's name or something like that, or somehow entering their social security number. Those are things that are clear and obvious. We wanna mask that. We wanna make sure that information never makes it into any server. It's pretty much just lost in thin air as it goes through the system. On the backend in terms of evals, because trust was so important to us early on, and we didn't want to rely at least initially on LLMs to tell us if the answer was good or not. I think that we've evolved into that and we're actively working on an LLM as judges system, but I think we were so focused on, we want physician feedback. The physicians know what they're talking about. We don't want an LLM who's already flawed to come back and tell us if the answer is right or wrong. We can't really trust them, but we can trust our dozens of physicians that are going through our response or helping us beta test to tell us where we need to improve or where something might not be accurate. To make a long answer longer, yes, we do currently use LLM as judges today, but we're at the moment just using it to collect that information and understand, is there something that we could do with this data? Do we need to run this by another physician to make sure that it's accurate? Because trust, again, is the most important thing and we don't want to just blind trust an LLM. Yeah, I do love having the customer be the ultimate judge of quality. And when I first learned about evals, I really like the first thing that jumped into my mind was how do we push as much of this to the customer as possible? Because if it's not, even our evals might say it looks great, but if it doesn't work for the customer, it doesn't work. Jen, you mentioned that on every response, the physician can mark thumbs up, thumbs down. Tell me what's happening with that feedback. What do you guys do with that data? Yeah, we have been live for, gosh, about four to six weeks, I guess. So we are constantly looking at the feedback data that comes in and we're a small team, so we are very focused and taking a look at when that feedback is negative, what was the response, what was the query, and really digging into that. We have a team that meets weekly and we are just continuing to work on improvements and prioritize what's important. And if there's changes that we need to make on the backend or tweaks that we need to make, we're prioritizing those. So it's something we look really carefully at. I would even say, unfortunately, there hasn't been much negative feedback, but I'm curious about for users that don't think it's good, why not? Because we want to improve it, but there's been more positive feedback than negative feedback. Yeah, which is a good problem to have, but I understand the sentiment there. And then Matt, tell me a little bit about, you mentioned you're starting to play with LLM as judges, just trying to look at where they might be useful in your system. Is there anything you can share about, like at what points you're using them, how you're setting them up, anything that other people could learn from? Yeah, and right now we have eight judges. This is highly experimental. So who knows where this will go? Will it be used? But we check for safety, medical accuracy, faithfulness, relevancy, completeness, reasoning, clarity, and then one final judge to take everything into account and give us an answer about how good was this answer. So we're messing around with it right now. There are some tools out there that help make that easier. I've also made my own that make that easier. Right now it's just data collection, seeing where we're at, seeing where we can improve. But I think this will be ongoing conversations about how much weight do we put into this versus our own beta testers and our position panel. And I think ultimately it's gonna be a mix of the two, but we're searching for more feedback and we have plenty of access to LLM. So that is one benefit that we can just run our evals whenever we want to get at least some feedback about the quality of our responses. Yeah, some of those things you rattled off, they sound like a lot of what you get from like an eval tool off the shelf. Are you doing a mix of like off the shelf evals versus your own? I can imagine with medical, like what's helpful in a medical context is different from a generic definition of helpful. What, tell me a little bit about, are you mixing and matching? Are you using only one or the other? Honestly, we're trying everything at this point, right? The one I ran just yesterday was using ChatGPT 5.2. Because assuming if that's the best model out there, then it has the best understanding of is this a quality response or not. But I've also hooked it up to all of the LLMs because again, I think more feedback is better than less feedback. I think we'll see which one we ultimately settle on, but we wanna make sure that we're not just changing things because the LLM told us to. This might be things that we bring back to our physician panel and say, hey, this is some information that we got from our LLM judge's eval. Do you think this is right? Maybe you can help ask a few questions to see if you agree that we should change this, but it always has to come back to the physician and make sure that they agree that what we're doing is the right thing because this is a product for the physician. It's not a product for the LLM. Yeah, I know a lot of teams are using these evals directionally, right? It's just, can we surface? Is there a problem? Now let's go investigate. It, to me, it's very analogous to like our behavioral analytics. Like they can tell us something odd is happening over here. Now go talk to your customer. It's not gonna tell you the whole story. So it sounds like that's a really nice mindset. Teresa, I had heard you say on a previous podcast that over time your evals might shift. And I was curious about how that might happen or why that might happen. Are people experiencing that in the real world? I would say I personally have gone through some like, like when I first learned about evals, it felt like you need to have evals on everything. And then as I like started to build, I realized, no, actually I just need to evals on the most critical things. And then on this podcast, talking to teams about how they do evals, there's so many strategies. Like some people are literally doing evals on every single step in their pipeline. Some people are doing evals at key decision points. Some people are waiting for things to go wrong and then doing an eval around that. Some people are reporting success with off the shelf evals. I think that tends to be most common with like consumer general audiences. I think most people are still building custom evals and like custom eval tooling. But I also know plenty of teams that are starting to get a ton of value out of the eval tools. And so it's, I think it's all over the place. I think this stuff is new enough that we, everybody's experimenting. I don't know that there's a right answer yet. Yeah, I know. That's what's so exciting about that. Pretty much since the chat GPT moment three years ago, that was the starting gate for many teams to be like, okay, go build your AI product. And here's a million different ways that you can approach the problem. And at the same time, I'll, go ahead. I think we're seeing this across all the elements of AI products. So whether that's, we talked about rag and the evolution of rag and how it started simple and it's getting more complex and we have tiers and we have different search technologies. I think I'm seeing it with like agent orchestration and multi-agent and whether or not your main agent loop is a loop or whether it's a more fixed pipeline. Like literally I'm hearing anything and everything and people are having success with every strategy. So like I tried to write, like I started to think about based on these interviews, are there recommendations for new teams? And like the only recommendation I can take away from the conversation so far is you really have to experiment to figure out what works in your context because this is so new. There's so many ways to do things and the way that's gonna work is gonna be very context dependent. Yeah, time will tell. Yeah, it's fun though. It makes it, it's a lot more fun than just building crud apps, right? Oh, a hundred percent. I'm tired of it. No, I think that's the biggest change is instead of writing C sharp and JavaScript and SQL apps, I'm now more of an architect. I'll wake up at six in the morning after a wonderful night of sleep and I'm like, I just had this great idea. Crack open cursor, perfectly describe exactly what I wanna build and then I can just watch it do its thing. So I think those 15 years of development experience, it hasn't been wasted. It just helps amplify what I'm doing now with AI. Yeah, I'm not gonna lie. I have Claude Code building two features while we talk. I know, that's impressive multitasking. We'll see if it does a good job, but it is a working in the background. Okay, let me ask you this. Is there anything you wish I had asked you or is there anything else you wanna share about your product and where you're at? I would just share that I think we're really just focused on that responsible adoption of AI. I think Teresa, to your point, it's really important that we're getting feedback directly from our users, directly from these physicians and a lot of our focus is there and we wanna make sure that they're comfortable using a tool like Helio AI so that they can preserve that trust with their patients and really just looking for ways to continue to improve that overall experience. Yeah, and to bounce off of what Jen just said, AI, designing for AI, it doesn't reduce the need for discovery at all. I think it's actually like raising the stakes for us. The more responsibility a product has, especially in the healthcare space, the more deeply we have to listen to our users, such as like our physicians. And with that, and by doing that, we've really learned to not just treat accuracy, response accuracy as a finish line, but trust in a product like Helio AI is built through transparency, tone adjustment and just the respect for our users' time. Yeah, I really like that. I saw somebody on LinkedIn post something like, how do we build trust with our customers with this AI product that's totally unreliable? And when I read that, I was like, your AI product shouldn't be totally unreliable. That's the first problem with that statement, right? You're not building the right AI product if you're not building in reliability. And then, but it does acknowledge, like I think across the board, we're seeing trust is one of the most critical factors. It's not about like getting the right answer is step one. And then there's a whole bunch of downstream. Can we get people to act on it? Matt, anything you wanna add? I just wanted to echo what Casey just said a moment ago, because I think it's so true in the times of AI that it's really easy to build anything, but the design aspect of our applications has never been more important, that we need to make sure that our apps are intuitive and it matches what we're trying to, what our physicians want to do. So how can we make that as easy as possible? That's the objective. And the beauty is, once we figure that out, we have these AI tools that are able to help us build much faster. Yeah, awesome. All right, tell me what's next for Helio AI. Looking ahead, the UX team, we are doubling down on our discovery process. We are working with a trusted group of physicians. I mentioned it earlier, Helio Innovation Partners. We are having multiple touch points to validate what's on our roadmap for 2026. There's a lot more to come, but we just want to keep the same approach, which is to stay close to our physicians, validate early, and then just make sure it aligns with the feedback that the physicians are providing. This has been really fun to learn about your team. I will share, like I said at the beginning, I think your problem space is absolutely fascinating. I think we can be doing a lot more to help very busy physicians. Let's stay on top of what's probably an information fire hose. So I think you're doing great work. Thanks for taking the time to share your work with me. Thanks so much, this was awesome. If you enjoyed this conversation, please subscribe in your favorite podcast app and give us a rating as it helps others find the show. Thanks, I appreciate it.