GPT 5.5 just did what no other model could

Overview

Clara Voe gives an early read on GPT-5.5 and GPT-5.5 Pro after testing them for a couple of weeks, mostly in Codex and a bit in ChatGPT. Her main point is simple: this model seems expensive, but for advanced coding work it can clear problems that other models struggled with, and it does it with less hand-holding.

She is less convinced about the fit for the average ChatGPT user. Her testing suggests the model’s extra intelligence shows up most clearly when the task is technically hard, spread across multiple subproblems, or tied to a real codebase with a backlog of messy issues.

Key Takeaways

The biggest theme is ROI through ambition, not just speed. Clara says most AI tools already help people move faster, but GPT-5.5 Pro changed what she felt able to attempt. In her telling, the gain was not only shorter turnaround time; it was the model’s ability to solve problems that had been sitting untouched for months because they were too annoying or complex to tackle.

She also says efficiency matters in a practical way. The model kept context better, worked more autonomously, and reduced the need for babysitting. That let her run several tasks in parallel in Codex and still get usable output. For a developer, that changes the workflow from “prompt, wait, repair” to something closer to dispatching work.

Her test in ChatGPT was less convincing. She asked GPT-5.5 to build an app to teach her child advanced subtraction, and while the result was acceptable, the model spent about 17 minutes thinking through it. Her point is not that the app was bad. It is that many consumer and business tasks may not need that much intelligence or that much latency. She sees an “intelligence overhang”: more raw capability than many non-technical use cases can currently make good use of.

The strongest evidence came from Codex. Clara describes uploading a CSV of security findings from a scan of her company’s codebase and asking the model to review the issues, group related items, propose changes, and implement them. She says it handled that broad, loosely grouped task well, and after human review the changes held up. She ties that to a later penetration test that came back very clean, which she presents as validation that the model’s output was solid.

Practical Steps

If you are a developer or engineering lead, try GPT-5.5 Pro on grouped backlog work rather than one-off prompts. Good candidates from Clara’s examples:

Security remediation lists
Technical debt with many linked subissues
Flaky tests
Front-end cleanup
Backlogs that have sat untouched because they are boring or messy

Use a structured handoff. Clara’s approach can be turned into a simple playbook:

Export your issues into a file, such as a CSV.
Ask the model to cluster related items first.
Have it propose an architectural plan before changing code.
Let it implement the grouped fixes.
Run human review and code review before merging.

For ChatGPT use, be selective. If the task is a simple app or a low-stakes request, the extra thinking time may not be worth it. Save this model for work where higher reasoning and autonomy can pay for the cost.

Notable Quotes

“It is a powerhouse and I’ve been able to do things with this model, especially around advanced coding, that I haven’t been able to do before with any other model on the market.” - Clara Voe
“I’m going to pay the intelligence tax.” - Clara Voe
“If you have a list, a triage list of technical debt, if you have a triage list of security issues... you can throw that list at GPT-5.5 and it will get that list done.” - Clara Voe

Full Transcript

Source: openai 23m runtime

Welcome back to How I AI. I'm Clara Voe, product leader and AI obsessive here on a mission to help you build better with these new tools. Today I have a very special episode for you where I'm going to tell you everything I think about the new GPT-5.5 model, which I've been able to test for the past couple weeks. Spoiler alert, it is a powerhouse and I've been able to do things with this model, especially around advanced coding that I haven't been able to do before with any other model on the market. And I'm going to show you how it breaks my personal high-tech eval hacking into this little computer. Let's get to it. So before I tell you what I built with GPT-5.5, let me tell you a little bit about the model itself. So today, OpenAI is releasing GPT-5.5 and GPT-5.5 Pro into Codex and ChatGPT, not available in the API quite yet. And this model I've been testing for the past couple weeks, and I will tell you what OpenAI is saying is true. They're saying that it has a higher capacity for complex work. It is more efficient, including being more token efficient, getting that work done. And so the whole idea with this model is it's smarter and it's more efficient, so you're going to get more done. And that has really been my experience. Now, I'm glad it's more efficient because it is expensive. GPT-5.5 is $5 per million input tokens and $30 for output tokens. And GPT-5.5 Pro, which has powered all this work that I've been doing, is $30 for a million input tokens and $180 for output tokens. So this is a pricey one, but when I reflect on what I was able to achieve with this model in early testing, I'm going to pay. I'm going to pay the intelligence tax because I think what I was able to achieve is really important. And this is one of the things that I think about a lot when I'm testing these new models or testing these new tools. You know, everything has an ROI and there can be an ROI in terms of speed. So can I get the things done that I want to get done faster? And that's certainly been an accelerant from an AI tooling perspective and something we've all experienced for the past couple of years. But where GPT-5.5 really helps me is ambition. It has been able to do things that literally I have not been able to do before for a couple of reasons. One, just intelligence higher has solved problems that other models and other harnesses other than Codex have really had a hard time with. The second thing I've experienced is because the efficiency is higher, I'm able to do more faster without losing context of what I'm working on because it's happening really quickly or it's being more autonomous. So I don't have to babysit as much. So again, I'm getting more done. So I do believe that what OpenAI is telling us is true, but that's coming out of my own experience spending hours and hours and hours with this model, throwing problems at it that other models have really had a hard time with, including GPT-5.5. So let's talk about what I built. And folks, for the less technical here, one of the things I'm going to say about the model, and I tested it a little bit in ChatGPT, but not a lot, is that I don't know what to do with all this intelligence if you don't have complex problems to solve. So while I've tested it in ChatGPT in my personal account, which is what I got access to, I don't have complex, high-intelligence problems to solve in my personal account. And so it was really hard for me to think of where I would use 5.5 or 5.5 Pro in ChatGPT simply because the problems I'm solving there aren't that hard. But I did try to solve problems there. So let's just talk about quickly how I used 5.5 in ChatGPT and what it gave me. And it'll just give you an indication of what I'm going to show you a little bit later. But again, I think what the consumer or even the everyday enterprise business user is going to struggle with using ChatGPT with this model is how many problems do you have that require super intelligence. So again, I think this is going to be a model that developers and software engineers really love. And I'm really excited to see what OpenAI does in terms of unleashing and boxing this intelligence in use cases that then the quote-unquote everyday person can use. So that's a little bit of my lecture on how much we have an intelligence overhang, basically. So what did I ask ChatGPT, GPT-5.5 to do in ChatGPT? Really simple thing. I'm teaching my second grader two-digit and three-digit subtraction. He's actually in first grade, but, you know, San Francisco, I'm trying to push him ahead. And so one of the ways that I've been able to teach him is build these little apps that help him understand subtraction with two digits and three digits and learn some kind of tactics to do that well. And so I asked it to build an app for me to teach my second grader more advanced subtraction concepts. I haven't been super pleased with some of the vibe coding tools or Claude code on this. Nothing's really built this exactly how I wanted, so I wanted to give 5.5 a shot at it. And first out the gate, it's a thinker. So you can see here it thought for 17 minutes, 27 seconds about this. You are going to have this experience with this model. This is going to be a theme of this mini-episode. This thing will think. And it planned a app for advanced subtraction, built the code, all this kind of stuff. Now, here's my question. Do we need 17 minutes of hyperintelligence thinking to build this app? Probably not. If I wasn't testing for the purpose of this podcast, would I have waited 18 minutes for this app? Probably not. So again, what are we going to do with all this intelligence? Is this the right form factor for a non-technical software engineer to access it? Not 100% sure. And it built me a app here. You can see it includes many lessons, word problems, read aloud. It's fine. It's fine. It's fine. It has different modules in it. The design leaves something to be desired, but again, I'm not really going to the GPT models for front end. I really want them to solve my hardest technical problems. And so I would just say in ChatGPT, I'm unsure yet, only because I'm not sure what the average ChatGPT user is really trying to achieve and how much intelligence is required, even on the coding side. And so I just wanted to start there by saying, if you're in ChatGPT, you're using 5.5, let me know your hard intelligence problems so I can test them. I think the basic vibe code me, a little simple app, it's fine. It's not great. It's not any more in particular impressive than other things on the market, but it does a reasonable job. And then just the sniff of 5.5 is it's going to think a lot and it's going to give you this chain of thought reasoning here to let you know how it's thinking and managing its own process. Okay, so I'm going to put away ChatGPT. It's fine. Let's talk about using 5.5 Pro in Codex. And you all, I love, I love her. I do. My initial reaction when I first started testing GPT-5.5 in Codex is I am cooking. And what I mean by that is I was kicking off tons of tasks in parallel because the feedback loop for fast, the efficiency you felt right away. I was knocking off very long standing tasks with tons of subtasks underneath them. And I'll give an example of what those are. And I was able to bite off a tech debt technical problem in the ChatPurdy code base that I have wanted to take care of for truly months. It has been plaguing me and GPT-5.5 blasted through it. So I want to show you a couple of those examples so you can understand what kind of tasks GPT-5.5 plus Codex is really good at and why I think its intelligence is higher and the way it's configured to work autonomously and efficiently is really beneficial for the software engineer. So the first thing that I did, which I'm not going to show you for what will become very obvious reasons, is we used OpenAI's Codex security product to run a threat assessment and security scan on the ChatPurdy code base. And it was pretty good. We're pretty secure, but it did come up with some low priority or low severity issues that we needed to remediate. And instead of taking those one by one, what I did is I downloaded the CSV of those issues, uploaded it to Codex and just said, can you please architecturally review these issues, group them if they're thematic, and then propose a change and then make those changes. And I will say it just did it. It did it very well. We did human review on that. We did code review on that. And we were just really happy with the quality of execution, but also the fact that I could give it a list of generally associated, but not single project tasks. And it could execute on those well. And the real validation of the quality of that output came when we had very quickly after that, our annual penetration test and our pen test came back super clean. And so I would just say, if you have a list, a triage list of technical debt, if you have a triage list of security issues, even maybe front-end debt, flaky tests, engineers, pay attention. You can throw that list at GPT-5.5 and it will get that list done. So that's use case one that I thought was really