> These behaviors are surprising. It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities.
It is only surprising to those who refuse to understand how LLMs work and continue to anthropomorphise them. There is no being “truthful” here, the model has no concept of right or wrong, true or false. It’s not “lying” to you, it’s spitting out text. It just so happens that sometimes that non-deterministic text aligns with reality, but you don’t really know when and neither does the model.
Precisely. The tools often hallucinate: including in its instructions higher up even before your prompt portion. Also the behind the scenes stuff not show to the user during reasoning.
You see binary failures all the time when doing function calls or JSON outputs.
That is… “please call this function” … does not call function
“calling JSON endpoint”… does not emit JSON
so from the article the tool generates hallucinations that the tool has used external stuff: but that was entirely fictitious. it does not know that this tool usage was fictitious and then sticks by its guns.
The workaround is to have verification steps, throw away “bad” answers. Instead of expecting one true output, expect a stream of results which have a yield (agriculture) of a certain amount. say 95% work, 5% garbage. never consider the results truly accurate, just “accurate enough”. Verify always
As an electrical engineer it is absolutely amazing how much LLMs suck at describing electrical circuits. It is somewhat ok with natural language, which works for the simplest circuits. For more complex stuff Chatgpt (regardless of model) seems to default to absolutely nonsensical ASCII circuit diagrams, you can ask it to list each part with each terminal and describe the connections to other parts and terminals and it will fail spectacularly with missing parts, missing terminals, parts no one ever heard of, short circuits, dangling nodes with no use..
If tou ask it to draw a schematic thigns somehow get even worse.
But what it is good at is proposing ideas. So if you want to do a thing that could be solved by using a Gilbert cell, the chances it might mention a Gilbert Cell are realistically there.
But I am already having students coming by with LLM slob circuits asking why the don't work..
Makes sense. It's not trained at complex electrical circuits, it's trained at natural language. And code, sure. And other stuff it comes across while training on those, no doubt including simple circuitry, but ultimately, all it does is produce plausible conversations, plausible responses, stuff that looks and sounds good. Whether it's actually correct, whether it works, I don't think that's even a concept in these systems. If it gets it correct by accident, that's mostly because correct responses also look plausible.
It claims to have run code on a Macbook because that's a plausible response from a human in this situation. It's basically trying to beat the Turing Test, but if you know it's a computer, it's obvious it's lying to you.
Whether it's actually correct, whether it works, I don't think that's even a concept in these systems.
I'm not an expert, but it is a concept in these systems. Check out some videos on Deepseek's R1 paper. In particular there's a lot they did to incentivize the chain-of-thought reasoning process towards correct answers in "coding, mathematics, science, and logic reasoning" during reinforcement learning. I presume basically all the state of the art CoT reasoning models have some similar "correct and useful reasoning" portion in their RL tuning. This explains why models are getting better at math and code, but not as much at creative writing. As I understand it, everybody is pretty data limited, but it's much easier to generate synthetic training data where there is a right answer than it is to make good synthetic creative writing. It's also much easier to check that the model is answering those problems correctly during training, rather than waiting for human feedback via RLHF.
It seems that OpenAI forgot to make sure their critic model punished o3 for being wrong it claimed it had a laptop, lol.
One of the blog post authors here! I think this finding is pretty surprising at the purely behavioral level, without needing to anthropomorphize the models. Two specific things I think are surprising:
- This appears to be a regression relative to the GPT-series models which is specific to the o-series models. GPT-series models do not fabricate answers as often, and when they do they rarely double-down in the way o3 does. This suggests there's something specific in the way the o-series models are being trained that produces this behavior. By default I would have expected a newer model to fabricate actions less often rather than more!
- We found instances where the chain-of-thought summary and output response contradict each other: in the reasoning summary, o3 states the truth that e.g. "I don't have a real laptop since I'm an AI ... I need to be clear that I'm just simulating this setup", but in the actual response, o3 does not acknowledge this at all and instead fabricates a specific laptop model (with e.g. a "14-inch chassis" and "32 GB unified memory"). This suggests that the model does have the capability of recognizing that the statement is not true, and still generates it anyway. (See https://x.com/TransluceAI/status/1912617944619839710 and https://chatgpt.com/share/6800134b-1758-8012-9d8f-63736268b0... for details.)
You're still using language that includes words like "recognize" which strongly suggest you haven't got the parent poster's point.
The model emits text. What it's emitted before is part of the input to the next text generation pass. Since the training data don't usually include much text saying one thing then afterwards saying "that was super stupid, actually it's this other way" the model also is unlikely to generate a new token saying the last one was irrational.
If you wanted to train a model to predict the next sentence would be a contradiction of the previous you could do that. "True" and "correct" and "recognize" are not in the picture.
No, a block of text that begins "please improve on the following text:" is likely to continue after the included block with some text that sounds like a correction or refinement.
Nothing is "recognized", nor is anything "an error". Nothing is "thinking" any more than it would be if the LLM just printed whether the next letter were more likely to be a vowel or consonant. Just because it's doing a better job modeling text doesn't magically make it be doing something that's not a text prediction function.
> It just so happens that sometimes that non-deterministic text aligns with reality, but you don’t really know when and neither does the model.
This is overly simplistic and demonstratably false - there's plenty of scenarios where a model will sell something false on purpose (e.g. when joking) and will tell you it was false with high probability correctly whether it was false or not after that.
However you want to frame it - there's clearly a more accurate than chance evaluation of truthfulness.
I don’t see how A follows from B. Being able to lie on purpose doesn’t in my mind mean that it’s also able to tell when a statement is true or false. The first one is just telling a tale which they are good at
The model has only a linguistic representation of what is "true" or "false"; you don't. This is a limitation of LLMs, human minds have more to it than NLP
A couple of years of this LLM AI hype train has blinded people to what was actually surprising about LLMs. The surprise wasn't that you could make a language model and it wasn't that a language model could generate text. Those are both rather pedestrian observations, and their implementations are trivial. The surprise of LLMs was that contemporary hardware could scale this far and that an un-curated training set turns out to contain a statistically significant amount of truth. Deep learning was interesting because we didn't expect that amount of computation to be feasible at this time in human history, not because nobody had ever thought of it before.
The surprise of the LLM AI was that they were somewhat truthful at all.
The AI revolution has mostly been a hardware revolution. I studied AI in the 1990s, so I knew about neural networks and backpropagation, so when suddenly everybody was talking about "Deep Learning", I wanted to know what was different about it. Turns out: not much. It's mostly just plain old backpropagation on a much larger scale because we have more powerful hardware.
Of course there have still been plenty of meaningful innovations, like the transformer/attention thing, but it's mostly the fact that affordable graphics cars offer massively-parallel floating point calculations which turns out to be exactly what we need to scale this up. That and the sheer amount of data that's become available in the age of the Internet.
>The AI revolution has mostly been a hardware revolution.
It's certainly important but this reads as overly simplistic to me. All the hardware we have today won't make an SVM or a random forest scale the way transformers do.
We don't need to anthropomorphise them, that was already done by the training data. It consumed text where humans with egos say things to defend what they said before (even if illogical or untrue). All the LLM is doing is mimicking the pattern.
"... o-series models are often prompted with previous messages without having access to the relevant reasoning. When asked questions that rely on their internal reasoning for previous steps, they must then come up with a plausible explanation for their behavior."
The fact is that humans do this all the time too -- their subconscious prompts them to do something, which they then do without reflecting or analyzing what their motivation might be. When challenged on it, they come up with a rationalization, not an actual reflected explanation.
The movie "Memento" is basically about how humans do this -- use faulty memories to rationalize stories for ourselves. At some point, a secondary character asks the main character, "And this fancy suit you're wearing, this car, where did they come from?" The main character (who is unable to form any long term memory) says, "I'm an insurance agent; my wife had insurance and I used the money from the payout to buy them." To which the secondary character says, "An in your grief, you went out and bought a Jaguar."
Not to give any spoilers, but that's not where the Jaguar came from, and the secondary character knows that.
That paper doesn't contradict the parent. It's just pointing out that you can extract knowledge from the LLM with good accuracy by
"... finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values"
The LLM itself still has no idea of the truth or falsity of what it spits out. But you can more accurately retrieve yes/no answers to knowledge encoded in the model by using this specific trick - it's a validation step you can impose - making it less likely that the yes/no answer is wrong.
Anybody that doesn't acknowledge this as a base truth of these systems should not be listened to. It's not intelligence, it's statistics.
The AI doesn't reason in any real way. It's calculating the probability of the next word appearing in the training set conditioned on the context that came before, and in cases where there are multiple likely candidates it's picking one at random.
To the extent you want to claim intelligence from these systems, it's actually present in the training data. The intelligence is not emergent, it's encoded by humans in the training data. The weaker that signal is to the noise of random internet garbage, the more likely the AI will be to pick a random choice that's not True.
I'm arguing that this is to simple of an explanation.
The claude paper showed, that it has some internal model when answering in different languages.
The process of learning can have effects in it, which is more than statistics. IF the training itself optimizes itself by having a internal model representation, than its no longer just statistics.
It also sounds like that humans are the origin of intelligence, but if humans do the same thing as LLM, and the only difference is, that we do not train LLMs from scratch (letting them discover the world, letting them inventing languages etc. but priming them with our world), than our intelligence was emergent and the LLMs one by proxy.
Since the rise of LLMs, the thought has definitely occurred to me that perhaps our intelligence might also arise from language processing. It might be.
The big difference between us and LLMs, however, is that we grow up in the real world, where some things really are true, and others really are false, and where truths are really useful to convey information, and falsehoods usually aren't (except truths reported to others may be inconvenient and unwelcome, so we learn to recognize that and learn to lie). LLMs, however, know only text. Immense amounts of text, without any way to test or experience whether it's actually true or false, without any access to a real world to relate it to.
It's entirely possible that the only way to produce really human-level intelligent AI with a concept of truth, is to train them while having them grow up in the real world in a robot body over a period of 20 years. And that would really restrict the scalability of AI.
I just realized that kids (and adults) these days grow up more in virtual environments behind screens than in touch with the real world, and maybe that might have an impact on our ability to discern truth from lies. That would certainly explain a lot about the state of our world.
A few years back i saw a documentary about kids in a third world country were it is normal to use plastic bags for drinking soda.
These kids couldn't understand that the plastic garbage in their own nature is not part of nature.
Nonetheless, depending on what rules you mean, there are a lot of people who show that logic or 'truth' is not the same for everyone.
People believing in a god, ghosts, conspiricy theories, flat earth etc.
I'm more curious if the 'self' can only be trained if you have a clear line of control. We learn what the self is because there is a part which we can control and than there is a part which we can't control.
The only scientific way to prove intelligence is using statistics. If you can prove that a certain LLM is accurate enough in generalised benchmarks it is sufficient to call it intelligent.
I don't need to know how it works internally, why it works internally.
What you (and parent post) are suggesting is that it is not intelligent based on its working. This is not a scientific take on the subject.
This is in fact how it works for medicine. A drug works because it has been shown to work based on statistical evidence. Even if we don't know how it works internally.
> It is only surprising to those who refuse to understand how LLMs work and continue to anthropomorphise them. There is no being “truthful” here, the model has no concept of right or wrong, true or false. It’s not “lying” to you, it’s spitting out text. It just so happens that sometimes that non-deterministic text aligns with reality, but you don’t really know when and neither does the model.
My problem with this attitude is that it's surprisingly accurate for humans, especially mentally disabled ones. While I agree that something is "missing" about how LLMs display their intelligence, I think it's wrong to say that LLMs are "just spitting out text, they're not intelligent". To me, it is very clear that LLM models do display intelligence, even if said intelligence is a bit deficient, and even if it weren't, it wouldn't be exactly the type of intelligence we see in people.
My point is, the phrase "AI" has been thrown around pointlessly for a while already. Marketing people would sell a simple 100-line programs with a few branches as "AI", but all common people would say that this intelligence is indeed just a gimmick. But when ChatGPT got released, something flipped. Something feels different about talking to ChatGPT. Most people see that there is some intelligence in there, and it's just a few old men yelling at the clouds "It's not intelligence! It's just statistical token generation!" as though these two were mutually exclusive.
Finally, I'd like to point out you're not "alive". You're just a very complex chemical reaction/physical interaction. Your entire life can be explained using organic chemistry and a bit of basic physics. Yet for some reason, most people decide not to think of life in this way. They attribute complex personalities and emotionaly to living beings, even though it's mostly hormones and basic chemistry again. Why?
Really? LLMs are bullshit generators, but design. The surprising thing here is that people think that LLMs are "powerful at solving math tasks". (They're not.)
> The surprising thing here is that people think that LLMs are "powerful at solving math tasks".
That's not really surprising either. We have evolved to recognize ourselves in our environment. We recognize faces and emotions in power outlets and lawn chairs. Recognizing intelligence in the outputs of LLMs is less surprising than that. But the fact that we recognize intelligence in LLMs implies intelligence in them just about as much as your power outlet is happy or sad because it looks that way to you.
I enjoy watching newer-generation models exhibit symptoms that echo features of human cognition. This particular one is reminiscent of the confabulation seen in split-brain patients, e.g. https://www.edge.org/response-detail/11513
Ask it to create a Typescript server side hello world.
It produces a JS example.
Telling it that's incorrect (but no more detail) results in it iterating all sorts of mistakes.
In 20 iterations it never once asked me what was incorrect.
In contrast, o4-mini asked me after 5, o4-mini-high asked me after 1, but narrowed the question to "is it incorrect due to choice of runtime?" rather than "what's incorrect?"
I told it to "ask the right question" based on my statement ("it is incorrect") and it correctly asked "what is wrong with it?" before I pointed out no Typescript types.
This is the critical thinking we need not just reasoning (incorrectly).
> Ask it to create a Typescript server side hello world.
It produces a JS example.
Well TS is a strict superset of JS so it’s technically correct (which is the best kind of correct) to produce JS when asked for a TS version. So you’re the one that’s wrong.
> Well TS is a strict superset of JS so it’s technically correct (which is the best kind of correct) to produce JS when asked for a TS version. So you’re the one that’s wrong.
Try that one at your next standup and see how it goes over with the team
He's not wrong. If the model doesn't give you what you want, it's a worthless model. If the model is like the genie from the lamp, and gives you a shitty but technically correct answer, it's really bad.
> If the model doesn't give you what you want, it's a worthless model.
Yeah, if you’re into playing stupid mind games while not even being right.
If you stick to just voicing your needs, it’s fine. And I don’t think the TS/JS story shows a lack of reasoning that would be relevant for other use cases.
> Yeah, if you’re into playing stupid mind games while not even being right.
If I ask questions outside of the things I already know about (probably pretty common, right?), it's not playing mind games. It's only a 'gotcha' question with the added context, otherwise it's just someone asking a question and getting back a Monkey's Paw answer: "aha! See, it's technically a subset of TS.."
You might as well give it equal credit for code that doesn't compile correctly, since the author didn't explicitly ask.
As I mentioned TS/JS was only one issue (semantic vs technical definition), the other is that it didn't know to question me, making it's reasoning a waste of time. I could have asked something else ambiguous based the on context, not a TS/JS example, it likely would still not have questioned me.
In contrast if you question a fact, not a solution, I find LLMs are more accurate and will attempt to take you down a notch if you try to prove the fact wrong.
Well yes, but still the name should give it away and you'll be shot during PRs if you submit JS as TS :D
The fact is the training data has confused JS with TS so the LLM can't "get its head" around the semantic, not technical difference.
Also the secondary point wasn't just that it was "incorrect" it's the fact its reasoning was worthless unless it knew who to ask and the right questions to ask.
If somebody says to you something you know is right, is actually wrong, the first thing you ask them is "why do you think that?" not "maybe I should think of this from a new angle, without evidence of what is wrong".
It illustrates lack of critical thinking, and also shows you missed the point of the question. :D
I'm confused - the post says "o3 does not have access to a coding tool".
However, OpenAI mentiones a Python tool multiple times in the system card [1], e.g.:
"OpenAI o3 and OpenAI o4-mini combine state-of-the-art reasoning with full tool capabilities—web browsing, Python, [...]"
"The models use tools in their chains of thought to augment their capabilities; for example, cropping or transforming images, searching the web, or using Python to analyze data during their thought process."
I interpreted this to mean o3 does have access to a tool that enables it to run code. Is my understanding wrong?
One of the blog post authors here! We evaluated o3 through the API, where the model does not have access to any specific built-in tools (although it does have the capability to use tools, and allows you to provide your own tools). This is different than when using o3 through the ChatGPT UI, where it does have a built-in tool to run code.
I don't understand why the UIs don't make this obvious. When the model runs code, why can't the system just show us the code and its output, in a special UI widget that the model can't generate any other way?
Then if it says "I ran this code and it says X" we can easily verify. This is a big part of the reason I want LLMs to run code.
Weirdly I have seen Gemini write code and make claims about the output. I can see the code, the claims it makes about the output are correct. I do not think it could make these correct claims without running the code. But the UI doesn't show me this. To verify it, I have to run the code myself. This makes the whole feature way less valuable and I don't understand why!
Power user here, working with these models (the whole gamut)
side-by-side on a large range of tasks has been my daily work since they came out.
I can vouch that this is extremely characteristic of o3-mini compared to competing models (Claude, Gemini) and previous OA models (3.5, 4o).
Compared to those, o3-mini clearly has less of the "the user is always right" training. This is almost certainly intentional. At times, this can be useful - it's more willing to call you out when you're wrong, and less likely to agree with something just because you suggested it. But this excessive stubbornness is the great downside, and it's been so prevalent that I stopped using o3-mini.
I haven't had enough time with o3 yet, but if it is indeed an evolution of o3-mini, it comes at no surprise it's very bad for this as well.
Yes! I always ask these models a simple question, that all models don't have the right answers.
"List of mayors of my City X".
All OF THEM, get it wrong. Hallucinate the names, wrong dates, etc. The list is on wikipedia, and for sure they trained on that data, but they are not able to answer properly.
Yeah, that's the big upside for sure - it baseline hallucinates less. But when it does, it's very assertive in gaslighting you that it's hallucination is in fact the truth, it can't "fix" its own errors. I've found this tradeoff not to be worth it for general use.
So people keep claiming that these things are like junior engineers, and, increasingly, it seems as if they are instead like the worst possible _stereotype_ of junior engineers.
I wish there are benchmarks for these scenarios. Anyone who has used LLMs know that they are very different from human. And after certain context, it become irritating to talk to these LLMs.
I don't want my LLM to excel in IMO or codeforces. I want it to understand my significantly easier but complex to state problem, think of solutions, understand its own issues and resolve it, rather than be passive agressive.
"Benchmarks" in AI are hilarious. These tools can't even solve problems which are moderately more difficult than something that has a geeks4geeks page, but according to these benchmarks they are all IOI gold medallists. What gives?
The benchmarks are created by humans. So are the training sets. It turns out the sorts of problems that humans like to benchmark with are also the sorts of problems humans like to discuss wherever that training set was scraped.
Well that and the whole field is filled with AI hypemen who "contribute" by asking ChatGPT about the quality and validity of some other GPT response.
Being "reductive" is how we got where we are today. We try to form hypotheses about things so that we can reduce them to their simplest model. This understanding then leads to massive gains. We've been doing this ever since we have observed things like the behavior of animals in order that we could hunt them more easily.
In the same way it helps a lot to try to understand what the correct model of an AI is in order that we can use it more productively. Certainly based on it's 'measurable properties' it does not behave like a reasonable human being. Some of the time it does, some of the time it goes completely off the rails. So there must be some other model that is more useful. "They are not rational actors, they can only generate plausible-looking texts." - seems to be more useful to me. "They are rational actors" - would be more like magical thinking which is not what got us to where we are today.
It feels similar to Llama4 - rushed. Sonnet had been king for at least 6 months, then Gemini 2.5 Pro recently raised the bar. They felt they had to respond. Ghibli memes are great, but not at the cost of losing the whole enterprise market. Currently for B2C, there's almost no lock in. Users can switch to a better app/model at very little cost. With B2B it's different, a product built on Sonnet generally isn't just going to switch to an OA model overnight unless there's huge benefits. OA will want a piece of that lock-in pie, which they'd been losing at a very rapid pace. Whether their new models solve that remains to be seen. To me, actually building products on top of these models, I still don't see much reason to use any of their models. From all testing I've been doing over the last 2 days, they don't seem particularly competitive. Potentially 4.1 or o4-mini for certain tasks, but whether they beat e.g. Deepseek v3 currently isn't clear-cut.
Yeah. God knows. I was really surprised to see the Fchollet's benchmark being aced months ago, but whatever their internal QA was perhaps lacking. I was asking some fairly simple code, that too in Python, using Scikit learn for which I presume there must be a lot of training data, it for some reason, changed the casing of the columns, and didn't follow my instructions as I asked it, cause the function was being rewritten to reduce bloat, along with other random things I didn't ask it for.
I am however wondering if this is o3-preview or o3? I have had wildly fluctuating experiences when I used the preview models previously, esp. the GPT4-Turbo previews though the GPT4-Turbo/V/o were a lot more stable.
> These behaviors are surprising. It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities.
It is only surprising to those who refuse to understand how LLMs work and continue to anthropomorphise them. There is no being “truthful” here, the model has no concept of right or wrong, true or false. It’s not “lying” to you, it’s spitting out text. It just so happens that sometimes that non-deterministic text aligns with reality, but you don’t really know when and neither does the model.
Precisely. The tools often hallucinate: including in its instructions higher up even before your prompt portion. Also the behind the scenes stuff not show to the user during reasoning.
You see binary failures all the time when doing function calls or JSON outputs.
That is… “please call this function” … does not call function
“calling JSON endpoint”… does not emit JSON
so from the article the tool generates hallucinations that the tool has used external stuff: but that was entirely fictitious. it does not know that this tool usage was fictitious and then sticks by its guns.
The workaround is to have verification steps, throw away “bad” answers. Instead of expecting one true output, expect a stream of results which have a yield (agriculture) of a certain amount. say 95% work, 5% garbage. never consider the results truly accurate, just “accurate enough”. Verify always
As an electrical engineer it is absolutely amazing how much LLMs suck at describing electrical circuits. It is somewhat ok with natural language, which works for the simplest circuits. For more complex stuff Chatgpt (regardless of model) seems to default to absolutely nonsensical ASCII circuit diagrams, you can ask it to list each part with each terminal and describe the connections to other parts and terminals and it will fail spectacularly with missing parts, missing terminals, parts no one ever heard of, short circuits, dangling nodes with no use..
If tou ask it to draw a schematic thigns somehow get even worse.
But what it is good at is proposing ideas. So if you want to do a thing that could be solved by using a Gilbert cell, the chances it might mention a Gilbert Cell are realistically there.
But I am already having students coming by with LLM slob circuits asking why the don't work..
Makes sense. It's not trained at complex electrical circuits, it's trained at natural language. And code, sure. And other stuff it comes across while training on those, no doubt including simple circuitry, but ultimately, all it does is produce plausible conversations, plausible responses, stuff that looks and sounds good. Whether it's actually correct, whether it works, I don't think that's even a concept in these systems. If it gets it correct by accident, that's mostly because correct responses also look plausible.
It claims to have run code on a Macbook because that's a plausible response from a human in this situation. It's basically trying to beat the Turing Test, but if you know it's a computer, it's obvious it's lying to you.
Whether it's actually correct, whether it works, I don't think that's even a concept in these systems.
I'm not an expert, but it is a concept in these systems. Check out some videos on Deepseek's R1 paper. In particular there's a lot they did to incentivize the chain-of-thought reasoning process towards correct answers in "coding, mathematics, science, and logic reasoning" during reinforcement learning. I presume basically all the state of the art CoT reasoning models have some similar "correct and useful reasoning" portion in their RL tuning. This explains why models are getting better at math and code, but not as much at creative writing. As I understand it, everybody is pretty data limited, but it's much easier to generate synthetic training data where there is a right answer than it is to make good synthetic creative writing. It's also much easier to check that the model is answering those problems correctly during training, rather than waiting for human feedback via RLHF.
It seems that OpenAI forgot to make sure their critic model punished o3 for being wrong it claimed it had a laptop, lol.
One of the blog post authors here! I think this finding is pretty surprising at the purely behavioral level, without needing to anthropomorphize the models. Two specific things I think are surprising:
- This appears to be a regression relative to the GPT-series models which is specific to the o-series models. GPT-series models do not fabricate answers as often, and when they do they rarely double-down in the way o3 does. This suggests there's something specific in the way the o-series models are being trained that produces this behavior. By default I would have expected a newer model to fabricate actions less often rather than more!
- We found instances where the chain-of-thought summary and output response contradict each other: in the reasoning summary, o3 states the truth that e.g. "I don't have a real laptop since I'm an AI ... I need to be clear that I'm just simulating this setup", but in the actual response, o3 does not acknowledge this at all and instead fabricates a specific laptop model (with e.g. a "14-inch chassis" and "32 GB unified memory"). This suggests that the model does have the capability of recognizing that the statement is not true, and still generates it anyway. (See https://x.com/TransluceAI/status/1912617944619839710 and https://chatgpt.com/share/6800134b-1758-8012-9d8f-63736268b0... for details.)
You're still using language that includes words like "recognize" which strongly suggest you haven't got the parent poster's point.
The model emits text. What it's emitted before is part of the input to the next text generation pass. Since the training data don't usually include much text saying one thing then afterwards saying "that was super stupid, actually it's this other way" the model also is unlikely to generate a new token saying the last one was irrational.
If you wanted to train a model to predict the next sentence would be a contradiction of the previous you could do that. "True" and "correct" and "recognize" are not in the picture.
LLMs can recognize errors in their own output. That's why thinking models generally perform much better than the non-thinking ones.
No, a block of text that begins "please improve on the following text:" is likely to continue after the included block with some text that sounds like a correction or refinement.
Nothing is "recognized", nor is anything "an error". Nothing is "thinking" any more than it would be if the LLM just printed whether the next letter were more likely to be a vowel or consonant. Just because it's doing a better job modeling text doesn't magically make it be doing something that's not a text prediction function.
You're using the same words again. It looks like reasoning, but it's a simulation.
The LLM merchants are driving it though, by using pre-existing words for things that are not what they are saying they are.
It's amazing what they can do, but an LLM cannot know if what it outputs is true or correct, just statistically likely.
> It just so happens that sometimes that non-deterministic text aligns with reality, but you don’t really know when and neither does the model.
This is overly simplistic and demonstratably false - there's plenty of scenarios where a model will sell something false on purpose (e.g. when joking) and will tell you it was false with high probability correctly whether it was false or not after that.
However you want to frame it - there's clearly a more accurate than chance evaluation of truthfulness.
I don’t see how A follows from B. Being able to lie on purpose doesn’t in my mind mean that it’s also able to tell when a statement is true or false. The first one is just telling a tale which they are good at
But it is able to tell if a statement is true or false, as in it can predict whether it is true or false with much above 50% accuracy.
The model has only a linguistic representation of what is "true" or "false"; you don't. This is a limitation of LLMs, human minds have more to it than NLP
LLMs are also more than NLP. They're deep learning models.
What? Yes the modelling technique falls under "deep learning" but it still very much processes language and language only, making it NLP.
Yes yes, language modelling ends up being surprisingly powerful at scale, but that doesn't make it not language modelling.
A couple of years of this LLM AI hype train has blinded people to what was actually surprising about LLMs. The surprise wasn't that you could make a language model and it wasn't that a language model could generate text. Those are both rather pedestrian observations, and their implementations are trivial. The surprise of LLMs was that contemporary hardware could scale this far and that an un-curated training set turns out to contain a statistically significant amount of truth. Deep learning was interesting because we didn't expect that amount of computation to be feasible at this time in human history, not because nobody had ever thought of it before.
The surprise of the LLM AI was that they were somewhat truthful at all.
The AI revolution has mostly been a hardware revolution. I studied AI in the 1990s, so I knew about neural networks and backpropagation, so when suddenly everybody was talking about "Deep Learning", I wanted to know what was different about it. Turns out: not much. It's mostly just plain old backpropagation on a much larger scale because we have more powerful hardware.
Of course there have still been plenty of meaningful innovations, like the transformer/attention thing, but it's mostly the fact that affordable graphics cars offer massively-parallel floating point calculations which turns out to be exactly what we need to scale this up. That and the sheer amount of data that's become available in the age of the Internet.
>The AI revolution has mostly been a hardware revolution.
It's certainly important but this reads as overly simplistic to me. All the hardware we have today won't make an SVM or a random forest scale the way transformers do.
Neural networks and backpropagation were known in the 90s too.
We don't need to anthropomorphise them, that was already done by the training data. It consumed text where humans with egos say things to defend what they said before (even if illogical or untrue). All the LLM is doing is mimicking the pattern.
LLMs are deterministic. It's just the creators often add pseudo-random seeds to produce a variety of outputs.
Actually that thread has an interesting theory:
"... o-series models are often prompted with previous messages without having access to the relevant reasoning. When asked questions that rely on their internal reasoning for previous steps, they must then come up with a plausible explanation for their behavior."
The fact is that humans do this all the time too -- their subconscious prompts them to do something, which they then do without reflecting or analyzing what their motivation might be. When challenged on it, they come up with a rationalization, not an actual reflected explanation.
The movie "Memento" is basically about how humans do this -- use faulty memories to rationalize stories for ourselves. At some point, a secondary character asks the main character, "And this fancy suit you're wearing, this car, where did they come from?" The main character (who is unable to form any long term memory) says, "I'm an insurance agent; my wife had insurance and I used the money from the payout to buy them." To which the secondary character says, "An in your grief, you went out and bought a Jaguar."
Not to give any spoilers, but that's not where the Jaguar came from, and the secondary character knows that.
This just isn’t true - one interesting paper on the topic: https://arxiv.org/abs/2212.03827
That paper doesn't contradict the parent. It's just pointing out that you can extract knowledge from the LLM with good accuracy by
"... finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values"
The LLM itself still has no idea of the truth or falsity of what it spits out. But you can more accurately retrieve yes/no answers to knowledge encoded in the model by using this specific trick - it's a validation step you can impose - making it less likely that the yes/no answer is wrong.
Can you say a bit more? Just reading the abstract, it's not clear to me how this contradicts the parent comment.
Anybody that doesn't acknowledge this as a base truth of these systems should not be listened to. It's not intelligence, it's statistics.
The AI doesn't reason in any real way. It's calculating the probability of the next word appearing in the training set conditioned on the context that came before, and in cases where there are multiple likely candidates it's picking one at random.
To the extent you want to claim intelligence from these systems, it's actually present in the training data. The intelligence is not emergent, it's encoded by humans in the training data. The weaker that signal is to the noise of random internet garbage, the more likely the AI will be to pick a random choice that's not True.
I'm arguing that this is to simple of an explanation.
The claude paper showed, that it has some internal model when answering in different languages.
The process of learning can have effects in it, which is more than statistics. IF the training itself optimizes itself by having a internal model representation, than its no longer just statistics.
It also sounds like that humans are the origin of intelligence, but if humans do the same thing as LLM, and the only difference is, that we do not train LLMs from scratch (letting them discover the world, letting them inventing languages etc. but priming them with our world), than our intelligence was emergent and the LLMs one by proxy.
Since the rise of LLMs, the thought has definitely occurred to me that perhaps our intelligence might also arise from language processing. It might be.
The big difference between us and LLMs, however, is that we grow up in the real world, where some things really are true, and others really are false, and where truths are really useful to convey information, and falsehoods usually aren't (except truths reported to others may be inconvenient and unwelcome, so we learn to recognize that and learn to lie). LLMs, however, know only text. Immense amounts of text, without any way to test or experience whether it's actually true or false, without any access to a real world to relate it to.
It's entirely possible that the only way to produce really human-level intelligent AI with a concept of truth, is to train them while having them grow up in the real world in a robot body over a period of 20 years. And that would really restrict the scalability of AI.
I just realized that kids (and adults) these days grow up more in virtual environments behind screens than in touch with the real world, and maybe that might have an impact on our ability to discern truth from lies. That would certainly explain a lot about the state of our world.
A few years back i saw a documentary about kids in a third world country were it is normal to use plastic bags for drinking soda.
These kids couldn't understand that the plastic garbage in their own nature is not part of nature.
Nonetheless, depending on what rules you mean, there are a lot of people who show that logic or 'truth' is not the same for everyone.
People believing in a god, ghosts, conspiricy theories, flat earth etc.
I'm more curious if the 'self' can only be trained if you have a clear line of control. We learn what the self is because there is a part which we can control and than there is a part which we can't control.
The only scientific way to prove intelligence is using statistics. If you can prove that a certain LLM is accurate enough in generalised benchmarks it is sufficient to call it intelligent.
I don't need to know how it works internally, why it works internally.
What you (and parent post) are suggesting is that it is not intelligent based on its working. This is not a scientific take on the subject.
This is in fact how it works for medicine. A drug works because it has been shown to work based on statistical evidence. Even if we don't know how it works internally.
Assuming the statistical analysis was sound. It is not always so. See the replication crisis for example
> It is only surprising to those who refuse to understand how LLMs work and continue to anthropomorphise them. There is no being “truthful” here, the model has no concept of right or wrong, true or false. It’s not “lying” to you, it’s spitting out text. It just so happens that sometimes that non-deterministic text aligns with reality, but you don’t really know when and neither does the model.
My problem with this attitude is that it's surprisingly accurate for humans, especially mentally disabled ones. While I agree that something is "missing" about how LLMs display their intelligence, I think it's wrong to say that LLMs are "just spitting out text, they're not intelligent". To me, it is very clear that LLM models do display intelligence, even if said intelligence is a bit deficient, and even if it weren't, it wouldn't be exactly the type of intelligence we see in people.
My point is, the phrase "AI" has been thrown around pointlessly for a while already. Marketing people would sell a simple 100-line programs with a few branches as "AI", but all common people would say that this intelligence is indeed just a gimmick. But when ChatGPT got released, something flipped. Something feels different about talking to ChatGPT. Most people see that there is some intelligence in there, and it's just a few old men yelling at the clouds "It's not intelligence! It's just statistical token generation!" as though these two were mutually exclusive.
Finally, I'd like to point out you're not "alive". You're just a very complex chemical reaction/physical interaction. Your entire life can be explained using organic chemistry and a bit of basic physics. Yet for some reason, most people decide not to think of life in this way. They attribute complex personalities and emotionaly to living beings, even though it's mostly hormones and basic chemistry again. Why?
> Your entire life can be explained using organic chemistry and a bit of basic physics.
Your internal experience of your life cannot be, though (this may change in the future).
What about ChatGPT's internal experience?
Given that I can't even prove your internal experience, I'll have to demur on this topic ;)
> These behaviors are surprising
Really? LLMs are bullshit generators, but design. The surprising thing here is that people think that LLMs are "powerful at solving math tasks". (They're not.)
> The surprising thing here is that people think that LLMs are "powerful at solving math tasks".
That's not really surprising either. We have evolved to recognize ourselves in our environment. We recognize faces and emotions in power outlets and lawn chairs. Recognizing intelligence in the outputs of LLMs is less surprising than that. But the fact that we recognize intelligence in LLMs implies intelligence in them just about as much as your power outlet is happy or sad because it looks that way to you.
I enjoy watching newer-generation models exhibit symptoms that echo features of human cognition. This particular one is reminiscent of the confabulation seen in split-brain patients, e.g. https://www.edge.org/response-detail/11513
o3 has been the worst model of the new 3 for me.
Ask it to create a Typescript server side hello world.
It produces a JS example.
Telling it that's incorrect (but no more detail) results in it iterating all sorts of mistakes.
In 20 iterations it never once asked me what was incorrect.
In contrast, o4-mini asked me after 5, o4-mini-high asked me after 1, but narrowed the question to "is it incorrect due to choice of runtime?" rather than "what's incorrect?"
I told it to "ask the right question" based on my statement ("it is incorrect") and it correctly asked "what is wrong with it?" before I pointed out no Typescript types.
This is the critical thinking we need not just reasoning (incorrectly).
> Ask it to create a Typescript server side hello world. It produces a JS example.
Well TS is a strict superset of JS so it’s technically correct (which is the best kind of correct) to produce JS when asked for a TS version. So you’re the one that’s wrong.
> Well TS is a strict superset of JS so it’s technically correct (which is the best kind of correct) to produce JS when asked for a TS version. So you’re the one that’s wrong.
Try that one at your next standup and see how it goes over with the team
He's not wrong. If the model doesn't give you what you want, it's a worthless model. If the model is like the genie from the lamp, and gives you a shitty but technically correct answer, it's really bad.
> If the model doesn't give you what you want, it's a worthless model.
Yeah, if you’re into playing stupid mind games while not even being right.
If you stick to just voicing your needs, it’s fine. And I don’t think the TS/JS story shows a lack of reasoning that would be relevant for other use cases.
> Yeah, if you’re into playing stupid mind games while not even being right.
If I ask questions outside of the things I already know about (probably pretty common, right?), it's not playing mind games. It's only a 'gotcha' question with the added context, otherwise it's just someone asking a question and getting back a Monkey's Paw answer: "aha! See, it's technically a subset of TS.."
You might as well give it equal credit for code that doesn't compile correctly, since the author didn't explicitly ask.
As I mentioned TS/JS was only one issue (semantic vs technical definition), the other is that it didn't know to question me, making it's reasoning a waste of time. I could have asked something else ambiguous based the on context, not a TS/JS example, it likely would still not have questioned me. In contrast if you question a fact, not a solution, I find LLMs are more accurate and will attempt to take you down a notch if you try to prove the fact wrong.
Well yes, but still the name should give it away and you'll be shot during PRs if you submit JS as TS :D
The fact is the training data has confused JS with TS so the LLM can't "get its head" around the semantic, not technical difference.
Also the secondary point wasn't just that it was "incorrect" it's the fact its reasoning was worthless unless it knew who to ask and the right questions to ask.
If somebody says to you something you know is right, is actually wrong, the first thing you ask them is "why do you think that?" not "maybe I should think of this from a new angle, without evidence of what is wrong".
It illustrates lack of critical thinking, and also shows you missed the point of the question. :D
I'm confused - the post says "o3 does not have access to a coding tool".
However, OpenAI mentiones a Python tool multiple times in the system card [1], e.g.: "OpenAI o3 and OpenAI o4-mini combine state-of-the-art reasoning with full tool capabilities—web browsing, Python, [...]"
"The models use tools in their chains of thought to augment their capabilities; for example, cropping or transforming images, searching the web, or using Python to analyze data during their thought process."
I interpreted this to mean o3 does have access to a tool that enables it to run code. Is my understanding wrong?
[1] https://openai.com/index/o3-o4-mini-system-card/
One of the blog post authors here! We evaluated o3 through the API, where the model does not have access to any specific built-in tools (although it does have the capability to use tools, and allows you to provide your own tools). This is different than when using o3 through the ChatGPT UI, where it does have a built-in tool to run code.
(Interestingly, even in the ChatGPT UI the o3 model will sometimes state that it ran code on its personal MacBook Pro M2! https://x.com/TransluceAI/status/1912617941725847841)
I see, thanks for the clarification!
I don't understand why the UIs don't make this obvious. When the model runs code, why can't the system just show us the code and its output, in a special UI widget that the model can't generate any other way?
Then if it says "I ran this code and it says X" we can easily verify. This is a big part of the reason I want LLMs to run code.
Weirdly I have seen Gemini write code and make claims about the output. I can see the code, the claims it makes about the output are correct. I do not think it could make these correct claims without running the code. But the UI doesn't show me this. To verify it, I have to run the code myself. This makes the whole feature way less valuable and I don't understand why!
Power user here, working with these models (the whole gamut) side-by-side on a large range of tasks has been my daily work since they came out.
I can vouch that this is extremely characteristic of o3-mini compared to competing models (Claude, Gemini) and previous OA models (3.5, 4o).
Compared to those, o3-mini clearly has less of the "the user is always right" training. This is almost certainly intentional. At times, this can be useful - it's more willing to call you out when you're wrong, and less likely to agree with something just because you suggested it. But this excessive stubbornness is the great downside, and it's been so prevalent that I stopped using o3-mini.
I haven't had enough time with o3 yet, but if it is indeed an evolution of o3-mini, it comes at no surprise it's very bad for this as well.
Yes! I always ask these models a simple question, that all models don't have the right answers.
"List of mayors of my City X".
All OF THEM, get it wrong. Hallucinate the names, wrong dates, etc. The list is on wikipedia, and for sure they trained on that data, but they are not able to answer properly.
o3-mini? It just says it doesn't know lol
Yeah, that's the big upside for sure - it baseline hallucinates less. But when it does, it's very assertive in gaslighting you that it's hallucination is in fact the truth, it can't "fix" its own errors. I've found this tradeoff not to be worth it for general use.
Sounds like we're getting closer and closer to an AI that acts like a human ;-)
So people keep claiming that these things are like junior engineers, and, increasingly, it seems as if they are instead like the worst possible _stereotype_ of junior engineers.
Reasoning models are complete nonsense in the face of custom agents. I would love to be proven wrong here.
Um.. wasn't this what was mentioned was going to happen in AI 2027?
"In a few rigged demos, it even lies in more serious ways, like hiding evidence that it failed on a task, in order to get better ratings."
I wish there are benchmarks for these scenarios. Anyone who has used LLMs know that they are very different from human. And after certain context, it become irritating to talk to these LLMs.
I don't want my LLM to excel in IMO or codeforces. I want it to understand my significantly easier but complex to state problem, think of solutions, understand its own issues and resolve it, rather than be passive agressive.
"Benchmarks" in AI are hilarious. These tools can't even solve problems which are moderately more difficult than something that has a geeks4geeks page, but according to these benchmarks they are all IOI gold medallists. What gives?
The benchmarks are created by humans. So are the training sets. It turns out the sorts of problems that humans like to benchmark with are also the sorts of problems humans like to discuss wherever that training set was scraped.
Well that and the whole field is filled with AI hypemen who "contribute" by asking ChatGPT about the quality and validity of some other GPT response.
LLMs can't think. They are not rational actors, they can only generate plausible-looking texts.
Maybe so, but they boost my coding productivity, so why not?
(Not the mentioned LLMs here though.)
I do the rational acting, and it does the rest.
You're being reductive. A system should be evaluated on how its measurable properties more than anything else.
Being "reductive" is how we got where we are today. We try to form hypotheses about things so that we can reduce them to their simplest model. This understanding then leads to massive gains. We've been doing this ever since we have observed things like the behavior of animals in order that we could hunt them more easily.
In the same way it helps a lot to try to understand what the correct model of an AI is in order that we can use it more productively. Certainly based on it's 'measurable properties' it does not behave like a reasonable human being. Some of the time it does, some of the time it goes completely off the rails. So there must be some other model that is more useful. "They are not rational actors, they can only generate plausible-looking texts." - seems to be more useful to me. "They are rational actors" - would be more like magical thinking which is not what got us to where we are today.
Is it just me or it feels like bit of a dissapointment? I have been using it for some hours now, and its needlessly convoluting the code.
It feels similar to Llama4 - rushed. Sonnet had been king for at least 6 months, then Gemini 2.5 Pro recently raised the bar. They felt they had to respond. Ghibli memes are great, but not at the cost of losing the whole enterprise market. Currently for B2C, there's almost no lock in. Users can switch to a better app/model at very little cost. With B2B it's different, a product built on Sonnet generally isn't just going to switch to an OA model overnight unless there's huge benefits. OA will want a piece of that lock-in pie, which they'd been losing at a very rapid pace. Whether their new models solve that remains to be seen. To me, actually building products on top of these models, I still don't see much reason to use any of their models. From all testing I've been doing over the last 2 days, they don't seem particularly competitive. Potentially 4.1 or o4-mini for certain tasks, but whether they beat e.g. Deepseek v3 currently isn't clear-cut.
Yeah. God knows. I was really surprised to see the Fchollet's benchmark being aced months ago, but whatever their internal QA was perhaps lacking. I was asking some fairly simple code, that too in Python, using Scikit learn for which I presume there must be a lot of training data, it for some reason, changed the casing of the columns, and didn't follow my instructions as I asked it, cause the function was being rewritten to reduce bloat, along with other random things I didn't ask it for.
Everyone games the benchmarks, but a lot is pointing towards both Meta and OpenAI going to even further lengths than the others.
I am however wondering if this is o3-preview or o3? I have had wildly fluctuating experiences when I used the preview models previously, esp. the GPT4-Turbo previews though the GPT4-Turbo/V/o were a lot more stable.