There's a methodological flaw here—you asked the model to generate an example, and then simulate the results on that example. For all you know, that specific (very simple) example and its output are available on the internet somewhere. Try it with five strings of other random words and see how it does.
When I ran these prompts, I saw in the chain of thought
Hmm, I need to run some code. I'm thinking I can use Python, right? There’s this Python tool I can simulate in my environment since I can’t actually execute the code. I’ll run a TfidfVectorizer code snippet to compute some outcomes.
It is ambiguous, but this leads me to believe the model does have access to a Python tool. Also, my 'toy examples' were identical to yours, making me think it has been seen in the training data.
This gave me a thought on the future of consumer-facing LLMs though. I was speaking to my nephew about his iPhone, he hadn't really considered that it was "just" a battery, a screen, some chips, a motor, etc.. all in a nice casing. To him, it was a magic phone!
Technical users will understand LLMs are "just" next token predictors that can output structured content to interface with tools all wrapped in a nice UI. To most people they will become magic. (I already watched a video where someone tried to tell the LLM to "forget" some info...)
o3-mini doesn't have access to a Python tool. I've seen this kind of thing in the "reasoning" chains from other models too - they'll often say things like "I should search for X" despite not having a search tool. It's just a weird aspect of how the reasoning tokens get used.
Some of the paid models do have access to an interpreter, ever since the "Code Interpreter" feature was announced some time ago. Seems like it's already been a year or two.
If you're using ChatGPT or the Assistants API w/ managed tools (I dont remember if this is even available for o3-mini), it has access to a python execution tool.
The word "just" can trivialize so much. Rockets are "just" explosives pointed in one direction. Computers are "just" billions of transistors in a single package. Humans are "just" a protein shell for DNA.
There is a lot of that LLMs are just(x) running around, but it seems to me like that is missing the point, in the extreme.
The “magic” is that yes, LLMs are “just” statistical next token predictors.
And as code only, LLMs produce garbage.
When you feed them human cultural-linguistic data, they “magically” can communicate useful ideas, reason, maintain an internal world state, and use tools.
The llm architecture is just a mechanism for imprinting and representing human cultural data. Human cultural data is the “magic”, somehow embodying the ability to reason, maintain state, use tools, and communicate.
Learning how to represent language data in vector-space allowed us to actually encode the meaning embedded in cultural data, since written language is just a shorthand.
Actually representing meaning allows us to run culture as code. Transformer boxes are a target for that code.
The magic is human culture.
Culture matters. We should be curating our culture.
Which is all we are. Next-token prediction isn't just all you need, it's all there is.
That's the most interesting part of what we're learning now, I think. So many people refused to accept that for any number of reasons -- religious, philosophical, metaphysical, personal -- and now they have no choice.
Well, the code in question is also written by the same LLM, so it could just output something it knows the answers to already. On its own, this result doesn't really seem to prove anything.
OK, this is wild. I just saw o3-mini (regular) to precisely simulate (calculate?) output of quite complicated computations. Well, at least for a human… and no, it didn’t use code interpreter.
When Code Interpeter is used on ChatGPT OpenAI make it very clear that it is being used through UI hints.
I really hope they don't ever change that UI pattern, this stuff is hard enough to understand already.
If you really want to test this, you can take advantage of the fact that Code Interpeter runs in a persistent sandbox VM. Tell the o3-mini prompt to save a file, then switch to GPT-4o (which can use Code Interpreter for real) and have it run Python code to show if that file exists or not.
I was trying to solve this simple beam deflection problem and been getting inconsistent results in various models (O1 mini and Gemini 2.0 flash thinking experimental) between different runs. Do you get consistent deflection numbers?
> An 6061-T6 aluminum alloy hollow round 2 in diameter beam with 0.125 in thickness and length 120 in is simply supported at each end. A point load of 100 lb is applied at the middle. What is the deflect in the middle and 12 in from the ends.
There's a methodological flaw here—you asked the model to generate an example, and then simulate the results on that example. For all you know, that specific (very simple) example and its output are available on the internet somewhere. Try it with five strings of other random words and see how it does.
This is key. LLMs will cheat and just look up training data whenever it can. Needs to be actively countered if one is to make it "reason".
I did exactly that, and it was very close to the actual sklearn output.
When I ran these prompts, I saw in the chain of thought
It is ambiguous, but this leads me to believe the model does have access to a Python tool. Also, my 'toy examples' were identical to yours, making me think it has been seen in the training data.This gave me a thought on the future of consumer-facing LLMs though. I was speaking to my nephew about his iPhone, he hadn't really considered that it was "just" a battery, a screen, some chips, a motor, etc.. all in a nice casing. To him, it was a magic phone!
Technical users will understand LLMs are "just" next token predictors that can output structured content to interface with tools all wrapped in a nice UI. To most people they will become magic. (I already watched a video where someone tried to tell the LLM to "forget" some info...)
o3-mini doesn't have access to a Python tool. I've seen this kind of thing in the "reasoning" chains from other models too - they'll often say things like "I should search for X" despite not having a search tool. It's just a weird aspect of how the reasoning tokens get used.
Some of the paid models do have access to an interpreter, ever since the "Code Interpreter" feature was announced some time ago. Seems like it's already been a year or two.
It doesn't always use the tool, but it can: https://chatgpt.com/share/67bcb3cb-1024-800b-8b7e-31335c6347...
If you're using ChatGPT or the Assistants API w/ managed tools (I dont remember if this is even available for o3-mini), it has access to a python execution tool.
the chat but o3-mini does not have access to the code interpreter
85 IQ: LLMs are magic
110 IQ: LLMs are "just" next token predictors that can output structured content to interface with tools all wrapped in a nice UI
140 IQ: LLMs are magic
The word "just" can trivialize so much. Rockets are "just" explosives pointed in one direction. Computers are "just" billions of transistors in a single package. Humans are "just" a protein shell for DNA.
There is a lot of that LLMs are just(x) running around, but it seems to me like that is missing the point, in the extreme.
The “magic” is that yes, LLMs are “just” statistical next token predictors.
And as code only, LLMs produce garbage.
When you feed them human cultural-linguistic data, they “magically” can communicate useful ideas, reason, maintain an internal world state, and use tools.
The llm architecture is just a mechanism for imprinting and representing human cultural data. Human cultural data is the “magic”, somehow embodying the ability to reason, maintain state, use tools, and communicate.
Learning how to represent language data in vector-space allowed us to actually encode the meaning embedded in cultural data, since written language is just a shorthand.
Actually representing meaning allows us to run culture as code. Transformer boxes are a target for that code.
The magic is human culture.
Culture matters. We should be curating our culture.
140 IQ: LLMS are magic ... token predictors that can output structured content
Which is all we are. Next-token prediction isn't just all you need, it's all there is.
That's the most interesting part of what we're learning now, I think. So many people refused to accept that for any number of reasons -- religious, philosophical, metaphysical, personal -- and now they have no choice.
Well, the code in question is also written by the same LLM, so it could just output something it knows the answers to already. On its own, this result doesn't really seem to prove anything.
I tried with alternate values and got the same result - not quite precisely exact, but extremely close values.
Threw it into deepseek and it gets it right albeit taking like 3-5 minutes checking it's logic and math multiple times
OK, this is wild. I just saw o3-mini (regular) to precisely simulate (calculate?) output of quite complicated computations. Well, at least for a human… and no, it didn’t use code interpreter.
How do you know it didn't use a code interpreter if they don't share the chain-of-thought?
When Code Interpeter is used on ChatGPT OpenAI make it very clear that it is being used through UI hints.
I really hope they don't ever change that UI pattern, this stuff is hard enough to understand already.
If you really want to test this, you can take advantage of the fact that Code Interpeter runs in a persistent sandbox VM. Tell the o3-mini prompt to save a file, then switch to GPT-4o (which can use Code Interpreter for real) and have it run Python code to show if that file exists or not.
I was trying to solve this simple beam deflection problem and been getting inconsistent results in various models (O1 mini and Gemini 2.0 flash thinking experimental) between different runs. Do you get consistent deflection numbers?
> An 6061-T6 aluminum alloy hollow round 2 in diameter beam with 0.125 in thickness and length 120 in is simply supported at each end. A point load of 100 lb is applied at the middle. What is the deflect in the middle and 12 in from the ends.
Perhaps worth running a few more submissions to determine if it did use one or not.
The obvious next step here is to see how well this generalises to arbitrary inputs :)