The "spreadsheet" example video is kind of funny: guy talks about how it normally takes him 4 to 8 hours to put together complicated, data-heavy reports. Now he fires off an agent request, goes to walk his dog, and comes back to a downloadable spreadsheet of dense data, which he pulls up and says "I think it got 98% of the information correct... I just needed to copy / paste a few things. If it can do 90 - 95% of the time consuming work, that will save you a ton of time"
It feels like either finding that 2% that's off (or dealing with 2% error) will be the time consuming part in a lot of cases. I mean, this is nothing new with LLMs, but as these use cases encourage users to input more complex tasks, that are more integrated with our personal data (and at times money, as hinted at by all the "do task X and buy me Y" examples), "almost right" seems like it has the potential to cause a lot of headaches. Especially when the 2% error is subtle and buried in step 3 of 46 of some complex agentic flow.
> how it normally takes him 4 to 8 hours to put together complicated, data-heavy reports. Now he fires off an agent request, goes to walk his dog, and comes back to a downloadable spreadsheet of dense data, which he pulls up and says "I think it got 98% of the information correct...
This is where the AI hype bites people.
A great use of AI in this situation would be to automate the collection and checking of data. Search all of the data sources and aggregate links to them in an easy place. Use AI to search the data sources again and compare against the spreadsheet, flagging any numbers that appear to disagree.
Yet the AI hype train takes this all the way to the extreme conclusion of having AI do all the work for them. The quip about 98% correct should be a red flag for anyone familiar with spreadsheets, because it’s rarely simple to identify which 2% is actually correct or incorrect without reviewing everything.
This same problem extends to code. People who use AI as a force multiplier to do the thing for them and review each step as they go, while also disengaging and working manually when it’s more appropriate have much better results. The people who YOLO it with prompting cycles until the code passes tests and then submit a PR are causing problems almost as fast as they’re developing new features in non-trivial codebases.
“The fallacy in these versions of the same idea is perhaps the most pervasive of all fallacies in philosophy. So common is it that one questions whether it might not be called the philosophical fallacy. It consists in the supposition that whatever is found true under certain conditions may forthwith be asserted universally or without limits and conditions. Because a thirsty man gets satisfaction in drinking water, bliss consists in being drowned. Because the success of any particular struggle is measured by reaching a point of frictionless action, therefore there is such a thing as an all-inclusive end of effortless smooth activity endlessly maintained.
It is forgotten that success is success of a specific effort, and satisfaction the fulfillment of a specific demand, so that success and satisfaction become meaningless when severed from the wants and struggles whose consummations they arc, or when taken universally.”
The proper use of these systems is to treat them like an intern or new grad hire. You can give them the work that none of the mid-tier or senior people want to do, thereby speeding up the team. But you will have to review their work thoroughly because there is a good chance they have no idea what they are actually doing. If you give them mission-critical work that demands accuracy or just let them have free rein without keeping an eye on them, there is a good chance you are going to regret it.
Isn't the point of an intern or new grad that you are training them to be useful in the future, acknowledging that for now they are a net drain on resources.
Yeah, people complaining about accuracy of AI-generated code should be examining their code review procedures. It shouldn’t matter if the code was generated by a senior employee, an intern, or an LLM wielded by either of them. If your review process isn’t catching mistakes, then the review process needs to be fixed.
This is especially true in open source where contributions aren’t limited to employees who passed a hiring screen.
This is taking what I said further than intended. I'm not saying the standard review process should catch the AI generated mistakes. I'm saying this work is at the level of someone who can and will make plenty of stupid mistakes. It therefore needs to be thoroughly reviewed by the person using before it is even up to the standard of a typical employee's work that the normal review process generally assumes.
Yep, in the case of open source contributions as an example, the bottleneck isn't contributors producing and proposing patches, it's a maintainer deciding if the proposal has merit, whipping (or asking contributors to whip) patches into shape, making sure it integrates, etc. If contributors use generative AI to increase the load on the bottleneck it is likely to cause a negative net effect.
This very much. Most of the time, it's not a code issue, it's a communication issue. Patches are generally small, it's the whole communication around it until both parties have a common understanding that takes so much time. If the contributor comes with no understanding of his patch, that breaks the whole premise of the conversation.
98% sure each commit doesn’t corrupt the database, regress a customer feature, open a security vulnerability. 50 commits later … (which is like, one day for an agentic workflow)
I would be embarrassed to be at OpenAI releasing this and pretending the last 9 months haven't happened... waxing poetically about "age of agents" - absolutely cringe and pathetic
Or as I would like to put it, LLM outputs are essentially the Library of Babel. Yes, it contains all of the correct answers, but might as well be entirely useless.
”The people who YOLO it with prompting cycles until the code passes tests and then submit a PR are causing problems almost as fast as they’re developing new features in non-trivial codebases.”
This might as well be the new definition of “script kiddie”, and it’s the kids that are literally going to be the ones birthed into this lifestyle. The “craft” of programming may not be carried by these coming generations and possibly will need to be rediscovered at some point in the future. The Lost Art of Programming is a book that’s going to need to be written soon.
It sounds like you’re saying that good tests are enough to ensure good code even when programmers are unskilled and just rewrite until they pass the tests. I’m very skeptical.
It may not be a provable take, but it’s also not absurd. This is the concept behind modern TDD (as seen in frameworks like cucumber):
Someone with product knowledge writes the tests in a DSL
Someone skilled writes the verbs to make the DSL function correctly
And from there, any amount of skill is irrelevant: either the tests pass, or they fail. One could hook up a markov chain to a javascript sourcebook and eventually get working code out.
> One could hook up a markov chain to a javascript sourcebook and eventually get working code out.
Can they? Either the dsl is so detailed and specific as to be just code with extra steps or there is a lot of ground not covered by the test cases with landmines that a million monkeys with typewriters could unwittingly step on.
The bugs that exist while the tests pass are often the most brutal - first to find and understand and secondly when they occasionally reveal that a fundamental assumption was wrong.
“The quip about 98% correct should be a red flag for anyone familiar with spreadsheets”
I disagree. Receiving a spreadsheet from a junior means I need to check it. If this gives me infinite additional juniors I’m good.
It’s this popular pattern of HN comments - expect AI to behave deterministically correct - while the whole world operates on stochastically correct all the time…
In my experience the value of junior contributors is that they will one day become senior contributors. Their work as juniors tends to require so much oversight and coaching from seniors that they are a net negative on forward progress in the short term, but the payoff is huge in the long term.
I don't see how this can be true when no one stays at a single job long enough for this to play out. You would simply be training junior employees to become senior employees for someone else.
So this has been a problem in the tech market for a while now. Nobody wants to hire juniors for tech because even at FAANGs the average career trajectory is what, 2-3 years? There's no incentive for companies to spend the time, money, and productivity hit to train juniors properly. When the current cohort ages out, a serious problem is going to occur, and it won't be pretty.
And it should go without saying that LLMs do not have the same investment/value tradeoff. Whether or not they contribute like a senior or junior seems entirely up to luck
Prompt skill is flaky and unreliable to ensure good output from LLMs
When my life was spreadsheets, we were expected to get to the point of being 99.99% right.
You went from “do it again” to “go check the newbies work”.
To get to that stage your degree of proficiency would be “can make out which font is wrong at a glance.”
You wouldn’t be looking at the sheet, you would be running the model in your head.
That stopped being a stochastic function, with the error rate dropping significantly - to the point that making a mistake had consequences tacked on to it.
The act of trying to make that 2% appear like "minimal, dismissable" is almost a mass psychosis in the AI world at times it seems like.
A few comparisons:
>Pressing the button: $1
>Knowing which button to press: $9,999
Those 2% copy-paste changes are the $9.999 and might take as long to find as rest of the work.
I also find that validating data can be much faster than calculating data. It's like when you're in algebra class and you're told to "solve for X". Once you find the value for X you plug it into the equation to see if it fits, and it's 10x faster than solving for X originally.
Regardless of if AI generates the spreadsheet or if I generate the spreadsheet, I'm still going to do the same validation steps before I share it with anyone. I might have a 2% error rate on a first draft.
Of course, Pareto principle is at work here. In an adjacent field, self-driving, they are working on the last "20%" for almost a decade now. It feels kind of odd that almost no one is talking about self-driving now, compared to how hot of a topic it used to be, with a lot of deep, moral, almost philosophical discussions.
> The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time.
In my experience for enterprise software engineering, in this stage we are able to shrink the coding time with ~20%, depending on the kind of code/tests.
However CICD remains tricky. In fact when AI agents start building autonomous, merge trains become a necessity…
Ah, those pesky regulations that try to prevent road accidents...
If it's not a technological limitation, why aren't we seeing self-driving cars in countries with lax regulations? Mexico, Brazil, India, etc.
Tesla launched FSD in Mexico earlier this year, but you would think companies would be jumping at the opportunity to launch in markets with less regulation.
So this is largely a technological limitation. They have less driving data to train on, and the tech doesn't handle scenarios outside of the training dataset well.
Do we even know what % of Waymo rides in SF are completely autonomous? I would not be surprised if more of them are remotely piloted than they've let on...
Can you name any of the specific regulations that robot taxi companies are lobbying to get rid of? As long as robotaxis abide by the same rules of the road as humans do, what's the problem? Regulations like you're not allowed to have robotaxis unless you pay me, your local robotaxi commissioner $3/million/year, aren't going to be popular with the populus but unfortunately for them, they don't vote, so I'm sure we'll see holdouts and if multiple companies are in multiple markets and are complaining about the local taxi cab regulatory commision, but there's just so much of the world without robotaxis right now (summer 2025) that I doubt it's anything mure than the technology being brand spanking new.
But it seems the reason for that is that this is a new, immature technology. Every new technology goes through that cycle until someone figures out how to make it financially profitable.
This is a big moving of the goalposts. The optimists were saying Level 5 would be purchasable everywhere by ~2018. They aren’t purchasable today, just hail-able. And there’s a lot of remote human intervention.
Hell - SF doesn’t have motorcyclists or any vehicular traffic, driving on the wrong side of the road.
Or cows sharing the thoroughfares.
It should be obvious to all HNers that have lived or travelled to developing / global south regions - driving data is cultural data.
You may as well say that self driving will only happen in countries where the local norms and driving culture is suitable to the task.
A desperately anemic proposition compared to the science fiction ambition.
I’m quietly hoping I’m going to be proven wrong, but we’re better off building trains, than investing in level 5. It’s going to take a coordination architecture owned by a central government to overcome human behavior variance, and make full self driving a reality.
I'm in the Philippines now, and that's how I know this is the correct take. Especially this part:
"Driving data is cultural data."
The optimists underestimate a lot of things about self-driving cars.
The biggest one may be that in developing and global south regions, civil engineering, design, and planning are far, far away from being up to snuff to a level where Level 5 is even a slim possibility. Here on the island I'm on, the roads, storm water drainage (if it exists at all) and quality of the built environment in general is very poor.
Also, a lot of otherwise smart people think that the increment between Level 4 and Level 5 is the same as that between all six levels, when the jump from Level 4 to Level 5 automation is the biggest one and the hardest to successfully accomplish.
Most people live within a couple hours of a city though, and I think we'll see robot taxis in a majority of continents by 2035 though. The first couple cities and continents will take the longest, but after that it's just a money question, and rich people have a lot of money. The question then is: is the taxi cab consortium, which still holds a lot of power, despite Uber, in each city the in world, large enough to prevent Waymo from getting a hold, for every city in the world that Google has offices in.
Yeah where they have every inch of SF mapped, and then still have human interventions. We were promised no more human drivers like 5-7 years ago at this point.
High speed connectivity and off vehicle processing for some tasks.
Density of locations to "idle" at.
There are a lot of things that make all these services work that means they can NOT scale.
These are all solvable but we have a compute problem that needs to be addressed before we get there, and I haven't seen any clues that there is anything in the pipeline to help out.
The typical Lyft vehicle is a piece of junk worth less than $20k, while the typical Waymo vehicle is a pretend luxury car with $$$ of equipment tacked on.
Waymo needs to be proving 5-10x the number of daily rides as Lyft before we get excited
Well, if we say these systems are here, it still took 10+ years between prototype and operational system.
And as I understand it; These are systems, not individual cars that are intelligent and just decide how to drive from immediate input, These system still require some number of human wranglers and worst-case drivers, there's a lot of specific-purpose code rather nothing-but-neural-network etc.
Which to say "AI"/neural nets are important technology that can achieve things but they can give an illusion of doing everything instantly by magic but they generally don't do that.
It’s past the hype curve and into the trough of disillusionment. Over the next 5,10,15 years (who can say?) the tech will mature out of the trough into general adoption.
GenAI is the exciting new tech currently riding the initial hype spike. This will die down into the trough of disillusionment as well, probably sometime next year. Like self-driving, people will continue to innovate in the space and the tech will be developed towards general adoption.
We saw the same during crypto hype, though that could be construed as more of a snake oil type event.
The Gartner hype cycle assumes a single fundamental technical breakthrough, and describes the process of the market figuring out what it is and isn't good for. This isn't straightforwardly applicable to LLMs because the question of what they're good for is a moving target; the foundation models are actually getting more capable every few months, which wasn't true of cryptocurrency or self-driving cars. At least some people who overestimate what current LLMs can do won't have the chance to find out that they're wrong, because by the time they would have reached the trough of disillusionment, LLM capabilities will have caught up to their expectations.
If and when LLM scaling stalls out, then you'd expect a Gartner hype cycle to occur from there (because people won't realize right away that there won't be further capability gains), but that hasn't happened yet (or if it has, it's too recent to be visible yet) and I see no reason to be confident that it will happen at any particular time in the medium term.
If scaling doesn't stall out soon, then I honestly have no idea what to expect the visibility curve to look like. Is there any historical precedent for a technology's scope of potential applications expanding this much this fast?
> If scaling doesn't stall out soon, then I honestly have no idea what to expect the visibility curve to look like. Is there any historical precedent for a technology's scope of potential applications expanding this much this fast?
Lots of pre-internet technologies went through this curve. PCs during the clock speed race, aircraft before that during the aeronautics surge of the 50s, cars when Detroit was in its heydays. In fact, cloud computing was enabled by the breakthroughs in PCs which allowed commodity computing to be architected in a way to compete with mainframes and servers of the era. Even the original industrial revolution was actually a 200-year ish period where mechanization became better and better understood.
Personally I've always been a bit confused about the Gartner Hype Cycle and its usage by pundits in online comments. As you say it applies to point changes in technology but many technological revolutions have created academic, social, and economic conditions that lead to a flywheel of innovation up until some point on an envisioned sigmoid curve where the innovation flattens out. I've never understood how the hype cycle fits into that and why it's invoked so much in online discussions. I wonder if folks who have business school exposure can answer this question better.
> If scaling doesn't stall out soon, then I honestly have no idea what to expect the visibility curve to look like.
We are seeing diminishing returns on scaling already. LLMs released this year have been marginal improvements over their predecessors. Graphs on benchmarks[1] are hitting an asymptote.
The improvements we are seeing are related to engineering and value added services. This is why "agents" are the latest buzzword most marketing is clinging on. This is expected, and good, in a sense. The tech is starting to deliver actual value as it's maturing.
I reckon AI companies can still squeeze out a few years of good engineering around the current generation of tools. The question is what happens if there are no ML breakthroughs in that time. The industry desperately needs them for the promise of ASI, AI 2027, and the rest of the hyped predictions to become reality. Otherwise it will be a rough time when the bubble actually bursts.
The problem with LLMs and all other modern statistical large-data-driven solutions’ approach is that it tries to collapse the entire problem space of general problem solving to combinatorial search of the permutations of previously solved problems. Yes, this approach works well for many problems as we can see with the results with huge amount of data and processing utilized.
One implicit assumption is that all problems can be solved with some permutations of existing solutions. The other assumption is the approach can find those permutations and can do so efficiently.
Essentially, the true-believers want you to think that rearranging some bits in their cloud will find all the answers to the universe. I am sure Socrates would not find that a good place to stop the investigation.
Right. I do think that just the capability to find and generate interesting patterns from existing data can be very valuable. It has many applications in many fields, and can genuinely be transformative for society.
But, yeah, the question is whether that approach can be defined as intelligence, and whether it can be applicable to all problems and tasks. I'm highly skeptical of this, but it will be interesting to see how it plays out.
I'm more concerned about the problems and dangers of this tech today, than whatever some entrepreneurs are promising for the future.
> We are seeing diminishing returns on scaling already. LLMs released this year have been marginal improvements over their predecessors. Graphs on benchmarks[1] are hitting an asymptote.
This isnt just a software problem. IF you go look at the hardware side you see that same flat line (IPC is flat generation over generation). There are also power and heat problems that are going to require some rather exotic and creative solutions if companies are looking to hardware for gains.
The Gartner hype cycle is complete nonsense, it's just a completely fabricated way to view the world that helps sell Gartner's research products. It may, at times, make "intuitive sense", but so does astrology.
The hype cycle has no mathematical basis whatsoever. It's marketing gimmick. It's only value in my life has been to quickly identify people that don't really understand models or larger trends in technology.
I continue to be, but on introspection probably shouldn't be, surprised that people on HN treat is as some kind of gospel. The only people who should respected are other people in the research marketing space as the perfect example of how to dupe people into paying for your "insights".
Could you please expand on your point about expanding scopes? I am waiting earnestly for all the cheaper services that these expansions promise. You know cheaper white-collar-services like accounting, tax, and healthcare etc. The last reports saw accelerating service inflation. Someone is lying. Please tell me who.
Hence why I said potential applications. Each new generation of models is capable, according to evaluations, of doing things that previous models couldn't that prima facie have potential commercial applications (e.g., because they are similar to things that humans get paid to do today). Not all of them will necessarily work out commercially at that capability level; that's what the Gartner hype cycle is about. But because LLM capabilities are a moving target, it's hard to tell the difference between things that aren't commercialized yet because the foundation models can't handle all the requirements, vs. because commercializing things takes time (and the most knowledgeable AI researchers aren't working on it because they're too busy training the next generation of foundation models).
It sounds like people should just ignore those pesky ROI questions. In the long run, we are all dead so let’s just invest now and worry about the actual low level details of delivering on the economy-wide efficiency later.
As capital allocators, we can just keep threatening the worker class with replacing their jobs with LLMs to keep the wages low and have some fun playing monopoly in the meantime. Also, we get to hire these super smart AI researchers people (aka the smartest and most valuable minds in the world) and hold the greatest trophies. We win. End of story.
Back in my youthful days, educated and informed people chastised using the internet to self-diagnose and self-treat. I completely missed the memo on when it became a good idea to do so with LLMs.
Which model should I ask about this vague pain I have been having in my left hip? Will my insurance cover the model service subscription? Also, my inner thigh skin looks a bit bruised. Not sure what’s going on? Does the chat interface allow me to upload a picture of it? It won’t train on my photos right?
Silicon Valley, and VC money has a proven formula. Bet on founders and their ideas, deliver them and get rich. Everyone knows the game, we all get it.
Thats how things were going till recently. Then FB came in and threw money at people and they all jumped ship. Google did the same. These are two companies famous for throwing money at things (Oculus, metaverse, G+, quantum computing) and right and proper face planting with them.
Do you really think that any of these people believe deep down that they are going to have some big breath through? Or do you think they all see the writing on the wall and are taking the payday where they can get it?
Liquidity in search of the biggest holes in the ground. Whoever can dig the biggest holes wins. Why or what you get out of digging the holes? Who cares.
The critics of the current AI buzz certainly have been drawing comparisons to self driving cars as LLMs inch along with their logarithmic curve of improvement that's been clear since the GPT-2 days.
Whenever someone tells me how these models are going to make white collar professions obsolete in five years, I remind them that the people making these predictions 1) said we'd have self driving cars "in a few years" back in 2015 and 2) the predictions about white collar professions started in 2022 so five years from when?
> said we'd have self driving cars "in a few years" back in 2015
And they wouldn't have been too far off! Waymo became L4 self-driving in 2021, and has been transporting people in the SF Bay Area without human supervision ever since. There are still barriers — cost, policies, trust — but the technology certainly is here.
People were saying we would all be getting in our cars and taking a nap on our morning commute. We are clearly still a pretty long ways off from self-driving being as ubiquitous as it was claimed it would be.
There are always extremists with absurd timelines on any topic! (Didn't people think we'd be on Mars in 2020?) But this one? In the right cities, plenty of people take a Waymo morning commute every day. I'd say self-driving cars have been pretty successful at meeting people's expectations — or maybe you and I are thinking of different people.
Reminds me of electricity entering the market and the first DC power stations setup in New York to power a few buildings. It would have been impossible to replicate that model for everyone. AC solved the distance issue.
That's where we are at with self driving. It can only operate in one small area, you can't own one.
We're not even close to where we are with 3d printers today or the microwave in the 50s.
I think people don't realize how much models have to extrapolate still, which causes hallucinations. We are still not great at giving all the context in our brain to LLMs.
There's still a lot of tooling to be built before it can start completely replacing anyone.
Okay, but the experts saying self driving cars were 50 years out in 2015 were wrong too. Lots of people were there for those speeches, and yet, even the most cynical take on Waymo, Cruise and Zoox’s limitations would concede that the vehicles are autonomous most of the time in a technologically important way.
There’s more to this than “predictions are hard.” There are very powerful incentives to eliminate driving and bloated administrative workforces. This is why we don’t have flying cars: lack of demand. But for “not driving?” Nobody wants to drive!
This is the exact same issue that I've had trying to use LLMs for anything that needs to be precise such as multi-step data pipelines. The code it produces will look correct and produce a result that seems correct. But when you do quality checks on the end data, you'll notice that things are not adding up.
So then you have to dig into all this overly verbose code to identify the 3-4 subtle flaws with how it transformed/joined the data. And these flaws take as much time to identify and correct as just writing the whole pipeline yourself.
I'll get into hot water with this, but I still think LLMs do not think like humans do - as in the code is not a result of a trying to recreate a correct thought process in a programming language, but some sort of statistically most likely string that matches the input requirements,
I used to have a non-technical manager like this - he'd watch out for the words I (and other engineers) said and in what context, and would repeat them back mostly in accurate word contexts. He sounded remarkably like he knew what he was talking about, but would occasionally make a baffling mistake - like mixing up CDN and CSS.
LLMs are like this, I often see Cursor with Claude making the same kind of strange mistake, only to catch itself in the act, and fix the code (but what happens when it doesn't)
I think that if people say LLMs can never be made to think, that is bordering on a religious belief - it'd require humans to exceed the Turing computable (note also that saying they never can is very different from believing current architectures never will - it's entirely reasonable to believe it will take architectural advances to make it practically feasible).
But saying they aren't thinking yet or like humans is entirely uncontroversial.
Even most maximalists would agree at least with the latter, and the former largely depends on definitions.
As someone who uses Claude extensively, I think of it almost as a slightly dumb alien intelligence - it can speak like a human adult, but makes mistakes a human adult generally wouldn't, and that combinstion breaks the heuristics we use to judge competency,and often lead people to overestimate these models.
Claude writes about half of my code now, so I'm overall bullish on LLMs, but it saves me less than half of my time.
The savings improve as I learn how to better judge what it is competent at, and where it merely sounds competent and needs serious guardrails and oversight, but there's certainly a long way to go before it'd make sense to argue they think like humans.
Everyone has this impression that our internal monologue is what our brain is doing. It's not. We have all sorts of individual components that exist totally outside the realm of "token generation". E.g. the amygdala does its own thing in handling emotions/fear/survival, fires in response to anything that triggers emotion. We can modulate that with our conscious brain, but not directly - we have to basically hack the amygdala by thinking thoughts that deal with the response (don't worry about the exam, you've studied for it already)
LLMs don't have anything like that. Part of why they aren't great at some aspects of human behaviour. E.g. coding, choosing an appropriate level of abstraction - no fear of things becoming unmaintainable. Their approach is weird when doing agentic coding because they don't feel the fear of having to start over.
I don't think you'll get into hot water for that. Anthropomorphizing LLMs is an easy way to describe and think about them, but anyone serious about using LLMs for productivity is aware they don't actually think like people, and run into exactly the sort of things you're describing.
I just wrote a post on my site where the LLM had trouble with 1) clicking a button, 2) taking a screenshot, 3) repeat. The non-deterministic nature of LLMs is both a feature and a bug. That said, read/correct can sometimes be a preferable workflow to create/debug, especially if you don't know where to start with creating.
I think it's basically equivalent to giving that prompt to a low paid contractor coder and hoping their solution works out. At least the turnaround time is faster?
But normally you would want a more hands on back and forth to ensure the requirements actually capture everything, validation and etc that the results are good, layers of reviews right
It seems to be a mix between hiring an offshore/low level contractor and playing a slot machine. And by that I mean at least with the contractor you can pretty quickly understand their limitations and see a pattern in the mistakes they make. While an LLM is obviously faster, the mistakes are seemingly random so you have to examine the result much more than you would with a contractor (if you are working on something that needs to be exact).
the slot machine is apt. insert tokens, pull lever, ALMOST get a reward. Think: I can start over, manually, or pull the lever again. Maybe I'll get a prize if I pull it again...
and of course, you pay whether the slot machine gives a prize or not. Between the slot machine psychological effect and sunk cost fallacy I have a very hard time believing the anecdotes -- and my own experiences -- with paid LLMs.
Often I say, I'd be way more willing to use and trust and pay for these things if I got my money back for output that is false.
In my experience using small steps and a lot of automated tests work very well with CC. Don’t go for these huge prompts that have a complete feature in it.
Remember the title “attention is all you need”? Well you need to pay a lot of attention to CC during these small steps and have a solid mental model of what it is building.
"It feels like either finding that 2% that's off (or dealing with 2% error) will be the time consuming part in a lot of cases."
This is the part you have wrong. People just won't do that. They'll save the 8 hours and just deal with 2% error in their work (which reduces as AI models get better). This doesn't work with something with a low error tolerance, but most people aren't building the next Golden Gate Bridge. They'll just fix any problems as they crop up.
Some of you will be screaming right now "THAT'S NOT WORTH IT", as if companies don't already do this to consumers constantly, like losing your luggage at the airport or getting your order wrong. Or just selling you something defective, all of that happens >2% of the time, because companies know customers will just deal-with-it.
I think the question then is what's the human error rate... We know we're not perfect... So if you're 100% rested and only have to find the edge case bug, maybe you'll usually find it vs you're burned out getting it 98% of the way there and fail to see the 2% of the time bugs... Wording here is tricky to explain but I think what we'll find is this helps us get that much closer... Of course when you spend your time building out 98% of the thing you have sometimes a deeper understanding of it so finding the 2% edge case is easier/faster but only time will tell
The problem with this spreadsheet task is that you don't know whether you got only 2% wrong (just rounded some numbers) or way more (e.g. did it get confused and mistook a 2023 PDF with one from 1993?), and checking things yourself is still quite tedious unless there's good support for this in the tool.
At least with humans you have things like reputation (has this person been reliable) or if you did things yourself, you have some good idea of how diligent you've been.
Right? Why are we giving grace to a damn computer as if it's human? How are people defending this? If it's a computer, I don't care how intelligent it is. 98% right is actually unacceptable.
Distinguishing whether a problem is 0.02 ^ n for error or 0.98 ^ n for accuracy is emerging as an important skill.
Might explain why some people grind up a billion tokens trying to make code work only to have it get worse while others pick apart the bits of truth and quickly fill in their blind spots. The skillsets separating wheat from chaff are things like honest appreciation for corroboration, differentiating subjective from objective problems, and recognizing truth-preserving relationships. If you can find the 0.02 ^ n sub-problems, you can grind them down with AI and they will rapidly converge, leaving the 0.98 ^ n problems to focus human touch on.
> "I think it got 98% of the information correct... I just needed to copy / paste a few things. If it can do 90 - 95% of the time consuming work, that will save you a ton of time"
"Hello, yes, I would like to pollute my entire data store" is an insane a sales pitch. Start backing up your data lakes on physical media, there is going to be an outrageous market for low-background data in the future.
semi-related: How many people are going to get killed because of this?
More work, without a doubt - any productivity gain immediately becomes the new normal. But now with an additional "2%" error rate compounded on all the tasks you're expected to do in parallel.
I do this kind of job and there is no way I am doing this job in 5-10 years.
I don't even think it is my company that is going to adapt to let me go but it is going to be an AI first competitor that puts the company I work for out of business completely.
There are all these massively inefficient dinosaur companies in the economy that are running digitized versions of paper shuffling and a huge number of white collar bullshit jobs built on top of digitized paper shuffling.
Wage inflation has been eating away at the bottom line on all these businesses since Covid and we are going to have a dinosaur company mass extinction event in the next recession.
IMO the category error being made is that LLMs are going to agentically do digitized paper shuffling and put digitized paper shufflers out of work. That is not the problem for my job. The issue is agentically from the ground up making the concept of digitized paper shuffling null and void. A relic of the past that can't compete in the economy.
I don't know why everyone is so confident that jobs will be lost. When we invented power tools did we fire everyone that builds stuff, or did we just build more stuff?
if you replace "power tools" with industrial automation it's easy to cherry pick extremes from either side. Manufacturing? a lot of jobs displaced, maybe not lost.
People say this, but in my experience it’s not true.
1) The cognitive burden is much lower when the AI can correctly do 90% of the work. Yes, the remaining 10% still takes effort, but your mind has more space for it.
2) For experts who have a clear mental model of the task requirements, it’s generally less effort to fix an almost-correct solution than to invent the entire thing from scratch. The “starting cost” in mental energy to go from a blank page/empty spreadsheet to something useful is significant. (I limit this to experts because I do think you have to have a strong mental framework you can immediately slot the AI output into, in order to be able to quickly spot errors.)
3) Even when the LLM gets it totally wrong, I’ve actually had experiences where a clearly flawed output was still a useful starting point, especially when I’m tired or busy. It nerd-snipes my brain from “I need another cup of coffee before I can even begin thinking about this” to “no you idiot, that’s not how it should be done at all, do this instead…”
>The cognitive burden is much lower when the AI can correctly do 90% of the work. Yes, the remaining 10% still takes effort, but your mind has more space for it.
I think their point is that 10%, 1%, whatever %, the type of problem is a huge headache. In something like a complicated spreadsheet it can quickly become hours of looking for needles in the haystack, a search that wouldn't be necessary if AI didn't get it almost right. In fact it's almost better if it just gets some big chunk wholesale wrong - at least you can quickly identify the issue and do that part yourself, which you would have had to in the first place anyway.
Getting something almost right, no matter how close, can often be worse than not doing it at all. Undoing/correcting mistakes can be more costly as well as labor intensive. "Measure twice cut once" and all that.
I think of how in video production (edits specifically) I can get you often 90% of the way there in about half the time it takes to get it 100%. Those last bits can be exponentially more time consuming (such as an intense color grade or audio repair). The thing is with a spreadsheet like that, you can't accept a B+ or A-. If something is broken, the whole thing is broken. It needs to work more or less 100%. Closing that gap can be a huge process.
I'll stop now as I can tell I'm running a bit in circles lol
I understand the idea. My position is that this is a largely speculative claim from people who have not spent much time seriously applying agents for spreadsheet or video editing work (since those agents didn’t even exist until now).
“Getting something almost right, no matter how close, can often be worse than not doing it at all” - true with human employees and with low quality agents, but not necessarily true with expert humans using high quality agents. The cost to throw a job at an agent and see what happens is so small that in actual practice, the experience is very different and most people don’t realize this yet.
By that definition, the ChatGPT app is now an AI agent. When you use ChatGPT nowadays, you can select different models and complement these models with tools like web search and image creation. It’s no longer a simple text-in / text-out interface. It looks like it is still that, but deep down, it is something new: it is agentic…
https://medium.com/thoughts-on-machine-learning/building-ai-...
I think this is my favorite part of the LLM hype train: the butterfly effect of dependence on an undependable stochastic system propagates errors up the chain until the whole system is worthless.
"I think it got 98% of the information correct..." how do you know how much is correct without doing the whole thing properly yourself?
The two options are:
- Do the whole thing yourself to validate
- Skim 40% of it, 'seems right to me', accept the slop and send it off to the next sucker to plug into his agent.
I think the funny part is that humans are not exempt from similar mistakes, but a human making those mistakes again and again would get fired. Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
This depends on the type of work being done. Sometimes the cost of verification is much lower than the cost of doing the work, sometimes it's about the same, and sometimes it's much more. Here's some recent discussion [0]
> I think the funny part is that humans are not exempt from similar mistakes, but a human making those mistakes again and again would get fired. Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
My rule is that if you submit code/whatever and it has problems you are responsible for them no matter how you "wrote" it. Put another way "The LLM made a mistake" is not a valid excuse nor is "That's what the LLM spit out" a valid response to "why did you write this code this way?".
LLMs are tools, tools used by humans. The human kicking off an agent, or rather submitting the final work, is still on the hook for what they submit.
"a human making those mistakes again and again would get fired"
You must be really desperate for anti-AI arguments if this is the one you're going with. Employees make mistakes all day every day and they don't get fired. Companies don't give a shit as long as the cost of the mistakes is less than the cost of hiring someone new.
> Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
Well yeah, because the agent is so much cheaper and faster than a human that you can eat the cost of the mistakes and everything that comes with them and still come out way ahead. No, of course that doesn't work in aircraft manufacturing or medicine or coding or many other scenarios that get tossed around on HN, but it does work in a lot of others.
Definitely would work in coding. Most software companies can only dream of a 2% defect rate. Reality is probably closer to 98%, which is why we have so much organisational overhead around finding and fixing human error in software.
I wonder if you can establish some kind of confidence interval by passing data through a model x number of times. I guess it mostly depends on subjective/objective correctness as well as correctness within a certain context that you may not know if the model knows about or not.
Either way sounds like more corporate drudgery.
Because it's a budget. Verifying them is _much_ cheaper than finding all the entries in a giant PDF in the first place.
> the butterfly effect of dependence on an undependable stochastic system
We're using stochastic systems for a long time. We know just fine how to deal with them.
> Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
There are very few tasks humans complete at a 98% success rate either. If you think "build spreadsheet from PDF" comes anywhere close to that, you've never done that task. We're barely able to recognize objects in their default orientation at a 98% success rate. (And in many cases, deep networks outperform humans at object recognition)
The task of engineering has always been to manage error rates and risk, not to achieve perfection. "butterfly effect" is a cheap rhetorical distraction, not a criticism.
There are in fact lots of tasks people complete immediately at 99.99% success rate at first iteration or 99.999% after self and peer checking work
Perhaps importantly checking is a continual process and errors are identified as they are made and corrected whilst in context instead of being identified later by someone completely devoid of any context a task humans are notably bad at.
Lastly it's important to note the difference between a overarching task containing many sub tasks and the sub tasks.
Something which fails at a sub task comprising 10 sub tasks 2% of the time per task has a miserable 18% failure rate at the overarching task. By 20 it's failed at 1 in 3 attempts worse a failing human knows they don't know the answer the failing AI produces not only wrong answers but convincing lies
Failure to distinguish between human failure and AI failure in nature or degree of errors is a failure of analysis.
I have a friend who's vibe-coding apps. He has a lot of them, like 15 or more, but most are only 60–90% complete (almost every feature is only 60-90% complete), which means almost nothing works properly. Last time he showed me something, it was sending the Supabase API key in the frontend with write permissions, so I could edit anything on his site just by inspecting the network tab in developer tools.
The amount of technical debt and security issues building up over the coming years is going to be massive.
Yes - and that is especially true for high-stakes processes in organizations. For example, accounting, HR benefits, taxation needs to be exactly right.
Yes. Any success I have had with LLMs has been by micromanaging them. Lots of very simple instructions, look at the results, correct them if necessary, then next step.
Honestly, though, there are far more use cases where 98% correct is equivalent to perfect than situations that require absolute correctness, both in business and for personal use.
> It feels like either finding that 2% that's off (or dealing with 2% error) will be the time consuming part in a lot of cases.
The last '2%' (and in some benchmarks 20%) could cost as much as $100B+ more to make it perfect consistently without error.
This requirement does not apply to generating art. But for agentic tasks, errors at worst being 20% or at best being 2% for an agent may be unacceptable for mistakes.
As you said, if the agent makes an error in either of the steps in an agentic flow or task, the entire result would be incorrect and you would need to check over the entire work again to spot it.
Most will just throw it away and start over; wasting more tokens, money and time.
I am looking forward to learning why this is entirely unlike working with humans, who in my experience commit very silly and unpredictable errors all the time (in addition to predictable ones), but additionally are often proud and anxious and happy to deliberately obfuscate their errors.
I think there is a lot of confusion on this topic. Humans as employees have the same basic problem: You have to train them, and at some point they quit, and then all that experience is gone. Only: The teaching takes much longer. The retention, relative to the time it takes to teach, is probably not great (admittedly I have not done the math).
A model forgets "quicker" (in human time), but can also be taught on the spot, simply by pushing necessary stuff into the ever increasing context (see claude code and multiple claude.md on how that works at any level). Experience gaining is simply not necessary, because it can infer on the spot, given you provide enough context.
In both cases having good information/context is key. But here the difference is of course, that an AI is engineered to be competent and helpful as a worker, and will be consistently great and willing to ingest all of that, and a human will be a human and bring their individual human stuff and will not be very keen to tell you about all of their insecurities.
The security risks with this sound scary. Let's say you give it access to your email and calendar. Now it knows all of your deepest secrets. The linked article acknowledges that prompt injection is a risk for the agent:
> Prompt injections are attempts by third parties to manipulate its behavior through malicious instructions that ChatGPT agent may encounter on the web while completing a task. For example, a malicious prompt hidden in a webpage, such as in invisible elements or metadata, could trick the agent into taking unintended actions, like sharing private data from a connector with the attacker, or taking a harmful action on a site the user has logged into.
A malicious website could trick the agent into divulging your deepest secrets!
I am curious about one thing -- the article mentions the agent will ask for permission before doing consequential actions:
> Explicit user confirmation: ChatGPT is trained to explicitly ask for your permission before taking actions with real-world consequences, like making a purchase.
How does the agent know a task is consequential? Could it mistakenly make a purchase without first asking for permission? I assume it's AI all the way down, so I assume mistakes like this are possible.
The asking for permission thing is irrelevant. People are using this tool to get the friction in their life to near zero, I bet my job that everyone will just turn on auto accept and go for a walk with their dog.
There is almost guaranteed going to be an attack along the lines of prompt-injecting a calendar invite. Those things are millions of lines long already, with tones of auto-generated text that nobody reads. Embed your injection in the middle of boring text describing the meeting prerequisites and it's as good as written in a transparent font. Then enjoy exfiltrating your victim's entire calendar and who knows what else.
In the system I'm building the main agent doesn't have access to tools and must call scoped down subagents who have one or two tools at most and always in the same category (so no mixed fetch and calendar tools). They must also return structured data to the main agent.
I think that kind of isolation is necessary even though it's a bit more costly. However since the subagents have simple tasks I can use super cheap models.
What isolation is there? If a compromised sub agent returns data that gets inserted into the main agents context (structured or not) then the end result is the same as if the main agent was directly interacting with the compromising resource is it not?
Many of us have been partitioning our “computing” life into public and private segments, for example for social media, job search, or blogging. Maybe it’s time for another segment somewhere in the middle?
Something like lower risk private data, which could contain things like redacted calendar entries, de-identified, anonymized, or obfuscated email, or even low-risk thoughts, journals, and research.
I am Worried; I barely use ChatGPT for anything that could come back to hurt me later, like medical or psychological questions. I hear that lots of folks are finding utility here but I’m reticent.
>I barely use ChatGPT for anything that could come back to hurt me later, like medical or psychological questions
I use ollama with local LLMs for anything that could be considered sensitive, the generation is slower but results are generally quite reasonable. I've had decent success with gemma3 for general queries.
Create a burner account for email/calendar, that solves most of those problems. Nobody will care if the AI leaks that you have a dentist appointment on Tuesday.
Almost anyone can add something to people's calendars as well (of course people don't accept random invites but they can appear).
If this kind of agent becomes wide spread hackers would be silly not to send out phishing email invites that simply contain the prompts they want to inject.
"Agentic misalignment makes it possible for models to act similarly to an insider threat, behaving like a previously-trusted coworker or employee who suddenly begins to operate at odds with a company’s objectives."
I agree with the scariness etc. Just one possibly comforting point.
I assume (hope?) they use more traditional classifiers for determining importance (in addition to the model's judgment). Those are much more reliable than LLMs & they're much cheaper to run so I assume they run many of them
I'm not so optimistic as someone that works on agents for businesses and creating tools for it. The leap from low 90s to 99% is classic last mile problem for LLM agents. The more generic and spread an agent is (can-do-it-all) the more likely it will fail and disappoint.
Can't help but feel many are optimizing happy paths in their demos and hiding the true reality. Doesn't mean there isn't a place for agents but rather how we view them and their potential impact needs to be separated from those that benefit from hype.
In general most of the previous AI "breakthrough" in the last decade were backed by proper scientific research and ideas:
- AlphaGo/AlphaZero (MCTS)
- OpenAI Five (PPO)
- GPT 1/2/3 (Transformers)
- Dall-e 1/2, Stable Diffusion (CLIP, Diffusion)
- ChatGPT (RLHF)
- SORA (Diffusion Transformers)
"Agents" is a marketing term and isn't backed by anything. There is little data available, so it's hard to have generally capable agents in the sense that LLMs are generally capable
The technology for reasoning models is the ability to do RL on verifiable tasks, with the some (as-of-yet unpublished, but well-known) search over reasoning chains, with a (presumably neural) reasoning fragment proposal machine, and a (presumably neural) scoring machine for those reasoning fragments.
The technology for agents is effectively the same, with some currently-in-R&D way to scale the training architecture for longer-horizon tasks. ChatGPT agent or o3/o4-mini are likely the first published models that take advantage of this research.
It's fairly obvious that this is the direction that all the AI labs are going if you go to SF house parties or listen to AI insiders like Dwarkesh Patel.
My personal framing of "Agents" is that they're more like software robots than they are an atomic unit of technology. Composed of many individual breakthroughs, but ultimately a feat of design and engineering to make them useful for a particular task.
In the context of our conversation and what OP wrote, there has been no breakthrough since around 2018. What you're seeing is the harvesting of all low-hanging fruit from a tree that was discovered years ago. But fruit is almost gone. All top models perform at almost the same level. All the "agents" and "reasoning models" are just products of training data.
This "all breakthroughs are old" argument is very unsatisfying. It reminds me of when people would describe LLMs as being "just big math functions". It is technically correct, but it misses the point.
AI researchers spent years figuring out how to apply RL to LLMs without degrading their general capabilities. That's the breakthrough. Not the existence of RL, but making it work for LLMs specifically. Saying "it's just RL, we've known about that for ages" does not acknowledge the work that went into this.
Similarly, using the fact that new breakthroughs look like old research ideas is not particularly good evidence that we are going to head into a winter. First, what are the limits of RL, really? Will we just get models that are highly performant at narrow tasks? Or will the skills we train LLMs for generalise? What's the limit? This is still an open question. RL for narrow domains like Chess yielded superhuman results, and I am interested to see how far we will get with it for LLMs.
This also ignores active research that has been yielding great results, such as AlphaEvolve. This isn't a new idea either, but does that really matter? They figured out how to apply evolutionary algorithms with LLMs to improve code. So, there's another idea to add to your list of old ideas. What's to say there aren't more old ideas that will pop up when people figure out how to apply them?
Maybe we will add a search layer with MCTS on top of LLMs to allow progress on really large math problems by breaking them down into a graph of sub-problems. That wouldn't be a new idea either. Or we'll figure out how to train better reranking algorithms to sort our training data, to get better performance. That wouldn't be new either! Or we'll just develop more and better tools for LLMs to call. There's going to be a limit at some point, but I am not convinced by your argument that we have reached peak LLM.
I understand your argument. The recipe that finally let RLHF + SFT work without strip mining base knowledge was real R&D, and GPT 4 class models wouldn’t feel so "chatty but competent" without it. I just still see ceiling effects that make the whole effort look more like climbing a very tall tree than building a Saturn V.
GPT 4.1 is marketed as a "major improvement" but under the hood it’s still the KL-regularised PPO loop OpenAI first stabilized in 2022 only with a longer context window and a lot more GPUs for reward model inference.
They retired GPT 4.5 after five months and told developers to fall back to 4.1. The public story is "cost to serve” not breakthroughs left on the table.
When you sunset your latest flagship because the economics don’t close, that’s not a moon shot trajectory, it’s weight shaving on a treehouse.
Stanford’s 2025 AI-Index shows that model to model spreads on MMLU, HumanEval, and GSM8K have collapsed to low single digits, performance curves are flattening exactly where compute curves are exploding.
A fresh MIT-CSAIL paper modelling "Bayes slowdown" makes the same point mathematically: every extra order of magnitude of FLOPs is buying less accuracy than the one before.[1]
A survey published last week[2] catalogs the 2025 state of RLHF/RLAIF: reward hacking, preference data scarcity, and training instability remain open problems, just mitigated by ever heavier regularisation and bigger human in the loop funnels.
If our alignment patch still needs a small army of labelers and a KL muzzle to keep the model from self lobotomising calling it "solved" feels optimistic.
Scale, fancy sampling tricks, and patched up RL got us to the leafy top so chatbots that can code and debate decently. But the same reports above show the branches bending under compute cost, data saturation, and alignment tax. Until we swap out the propulsion system so new architectures, richer memory, or learning paradigms that add information instead of reweighting it we’re in danger of planting a flag on a treetop and mistaking it for Mare Tranquillitatis.
Happy to climb higher together friend but I’m still packing a parachute, not a space suit.
I mostly agree with this. The goal with AI companies is not to reach 99% or 100% human-level, it's >100% (do tasks better than an average human could, or eventually an expert).
But since you can't really do that with wedding planning or whatnot, the 100% ceiling means the AI can only compete on speed and cost. And the cost will be... whatever Nvidia feels like charging per chip.
> Can't help but feel many are optimizing happy paths in their demos and hiding the true reality.
Even with the best intentions, this feels similar to when a developer hands off code directly to the customer without any review, or QA, etc. We all know that what a developer considers "done" often differs significantly from what the customer expects.
Seen this happen many times with current agent implementations. With RL (and provided you have enough use case data) you can get to a high accuracy on many of these shortcomings. Most problems arise from the fact that prompting is not the most reliable mechanism and is brittle. Teaching a model on specific tasks help negate those issues, and overall results in a better automation outcome without devs having to make so much effort to go from 90% to 99%. Another way to do it is parallel generation and then identifying at runtime which one seems most correct (majority voting or llm as a judge).
I agree with you on the hype part. Unfortunately, that is the reality of current silicon valley. Hype gets you noticed, and gets you users. Hype propels companies forward, so that is about to stay.
Not even well-optimized. The demos in the related sit-down chat livestream video showed an every-baseball-park-trip planner report that drew a map with seemingly random lines that missed the east coast entirely, leapt into the Gulf of Mexico, and was generally complete nonsense. This was a pre-recorded demo being live-streamed with Sam Altman in the room, and that’s what they chose to show.
>The more generic and spread an agent is (can-do-it-all) the more likely it will fail and disappoint.
To your point - the most impressive AI tool (not an LLM but bear with me) I have used to date, and I loathe giving Adobe any credit, is Adobe's Audio Enhance tool. It has brought back audio that prior to it I would throw out or, if the client was lucky, would charge thousands of dollars and spend weeks working on to repair to get it half as good as that thing spits out in minutes. Not only is it good at salvaging terrible audio, it can make mediocre zoom audio sound almost like it was recorded in a proper studio. It is truly magic to me.
Warning: don't feed it music lol it tries to make the sounds into words. That being said, you can get some wild effects when you do it!
This solves a big issue for existing CLI agents, which is session persistence for users working from their own machines.
With claude code, you usually start it from your own local terminal. Then you have access to all the code bases and other context you need and can provide that to the AI.
But when you shut your laptop, or have network availability changes the show stops.
I've solved this somewhat on MacOS using the app Amphetamine which allows the machine to go about its business with the laptop fully closed. But there are a variety of problems with this, including heat and wasted battery when put away for travel.
Another option is to just spin up a cloud instance and pull the same repos to there and run claude from there. Then connect via tmux and let loose.
But there are (perhaps easy to overcome) ux issues with getting context up to that you just don't have if it is running locally.
The sandboxing maybe offers some sense of security--again something that can be possibly be handled by executing claude with a specially permissioned user role--which someone with John's use case in the video might want.
---
I think its interesting to see OpenAI trying to crack the Agent UX, possibly for a user type (non developer) that would appreciate its capabilities just as much but not need the ability to install any python package on the fly.
I've been using OpenAI operator for some time - but more and more websites are blocking it, such as LinkedIn and Amazon. That's two key use-cases gone (applying to jobs and online shopping).
Operator is pretty low-key, but once Agent starts getting popular, more sites will block it. They'll need to allow a proxy configuration or something like that.
THIS is the main problem. I was listening the whole time for them to announce a way to run it locally or at least proxy through your local devices. Alas the Deepseek R1 distillation experience they went through (a bit like when Steve Jobs was fuming at Google for getting Android to market so quickly) made them wary of showing to many intermediate results, tricks etc. Even in the very beginning Operator v1 was unable to access many sites that blocked data-center IPs and while I went through the effort of patching in a hacky proxy-setup to be able to actually test real world performance they later locked it down even further without improving performance at all. Even when its working, its basically useless and its not working now and only getting worse. Either they make some kinda deal with eastdakota(which he is probably too savvy to agree to)or they can basically forget about doing web browsing directly from their servers.Considering, that all non web applications of "computer use" greatly benefit from local files and software (which you already have the license for!)the whole concept appears to be on the road to failure. Having their remote computer use agent perform most stuff via CLI is actually really funny when you remember that computer use advocates used to claim the whole point was NOT to rely on "outdated" pre-gui interfaces.
In typical SV style, this is just to throw it out there and let second order effects build up. At some point I expect OpenAI to simply form a partnership with LinkedIn and Amazon.
In fact, I suspect LinkedIn might even create a new tier that you'd have to use if you want to use LinkedIn via OpenAI.
If people will actually pay for stuff (food, clothing, flights, whatever) through this agent or operator, I see no reason Amazon etc would continue to block them.
The AI isn't going notice the latest and greatest hot new deals that are slathered on every page. It's just going to put the thing you asked for in the shopping-cart.
I was buying plenty of stuff through Amazon before they blocked Operator. Now I sometimes buy through other sites that allow it.
The most useful for me was: "here's a picture of a thing I need a new one of, find the best deal and order it for me. Check coupon websites to make sure any relevant discounts are applied."
To be honest, if Amazon continues to block "Agent Mode" and Walmart or another competitor allows it, I will be canceling Prime and moving to that competitor.
Right but there were so few people using operator to buy stuff that it's easier to just block ~ all data center ip addresses. If this becomes a "thing" (remains to be seen, for sure) then that becomes a significant revenue stream you're giving up on. Companies don't block bots because they're Speciesist, it's bec usually bots cost them money - if that changes, I assume they'll allow known chatgpt-agent ip addrs
Possibly in part because bots will not fall for the same tricks as humans (recommended items, as well as other things which amazon does to try and get the most money possible)
Agents respecting robots.txt is clearly going to end soon. Users will be installing browser extensions or full browsers that run the actions on their local computer with the user's own cookie jar, IP address, etc.
I hope agents.txt becomes standard and websites actually start to build agent-specific interfaces (or just have API docs in their agent.txt). In my mind it's different from "robots" which is meant to apply rules to broad web-scraping tools.
I hope they don't build agent-specific interfaces. I want my agent to have the same interface I do. And even more importantly, I want to have the same interface my agent does. It would be a bad future if the capabilities of human and agent interfaces drift apart and certain things are only possible to do in the agent interface.
I wonder how many people will think they are being clever by using the Playwright MCP or browser extensions to bypass robots.txt on the sites blocking the direct use of ChatGPT Agent and will end up with their primary Google/LinkedIn/whatever accounts blocked for robotic activity.
I don't know how others are using it, but when I ask Claude to use playwright, it's for ad-hoc tasks which look nothing like old school scraping, and I don't see why it should bother anyone.
We have a similar tool that can get around any of this, we built a custom desktop that runs on residential proxies. You can also train the agents to get better at computer tasks https://www.agenttutor.com/
Finding, comparing, and ordering products -- I'd ask it to find 5 options on Amazon and create a structured table comparing key features I care about along with price. Then ask it to order one of them.
Why would they want an LLM to slurp their web site to help some analyst create a report about the cost of widgets? If they value the data they can pay for it. If not, they don't need to slurp it, right? This goes for training data too.
> Mid 2025: Stumbling Agents
The world sees its first glimpse of AI agents.
Advertisements for computer-using agents emphasize the term “personal assistant”: you can prompt them with tasks like “order me a burrito on DoorDash” or “open my budget spreadsheet and sum this month’s expenses.” They will check in with you as needed: for example, to ask you to confirm purchases. Though more advanced than previous iterations like Operator, they struggle to get widespread usage.
Especially when the author personally knows the engineers working on the features, and routinely goes to parties with them. And when you consider that Altman said last year that “2025 will be the agentic year”
The big crux of AI 2027 is the claims about exponential technological improvement. "Agents" are mostly a new frontend to the same technology openai has been selling for a while. Let's see if we're on track at the start of 2026
It was common knowledge that big corps were working on agent-type products when that report was written. Hardly much of a prediction, let alone any sort of technical revolution.
And I'm still waiting for the simple feature – the ability to edit documents in projects.
I use projects for working on different documents - articles, research, scripts, etc. And would absolutely love to write it paragraph after paragraph with the help of ChatGPT for phrasing and using the project knowledge. Or using voice mode - i.e. on a walk "Hey, where did we finish that document - let's continue. Read the last two paragraphs to me... Okay, I want to elaborate on ...".
I feel like AI agents for coding are advancing at a breakneck speed, but assistance in writing is still limited to copy-pasting.
Have you tried the Canvas feature for collaborative writing? Agreed on voice mode - would be great to be able to narrate while doing busywork round the house.
>I feel like AI agents for coding are advancing at a breakneck speed, but assistance in writing is still limited to copy-pasting.
Man I was talking about this with a colleague 30min ago. Half the time i can't be bothered to open chat gpt and do the copy/paste dance. I know that sounds ridiculous but roundtripping gets old and breaks my flow. Working in NLE's with plug-in's, VTT's, etc. has spoiled me.
It's crazy. Aider has been able to do this forever using free models but none of these companies will even let you pay for it in a phone/web app. I almost feel like I should start building my own service but I know any day now they'd offer it and I'd have wasted all that effort.
Whilst we have seen other implementations of this (providing a VPS to an LLM), this does have a distinct edge others in the way it presents itself. The UI shown, with the text overlay, readable mouse and tailored UI components looks very visually appealing and lends itself well to keeping users informed on what is happening and why at every stage. I have to tip my head to OpenAIs UI team here, this is a really great implementation and I always get rather fascinated whenever I see LLMs being implemented in a visually informative and distinctive manner that goes beyond established metaphors.
Comparing it to the Claude+XFCE solutions we have seen by some providers, I see little in the way of a functional edge OpenAI has at the moment, but the presentation is so well thought out that I can see this being more pleasant to use purely due to that. Many times with the mentioned implementations, I struggled with readability. Not afraid to admit that I may borrow some of their ideas for a personal project.
It's very hard for me to imagine the current level of agents serving a useful purpose in my personal life. If I ask this to plan a date night with my wife this weekend, it needs to consult my calendar to pick the best night, pick a bar and restaurant we like (how would it know?), book a babysitter (can it learn who we use and text them on my behalf?), etc. This is a lot of stuff it has to get right, and it requires a lot of trust!
I'm excited that this capability is getting close, but I think the current level of performance mostly makes for a good demo and isn't quite something I'm ready to adopt into daily life. Also, OpenAI faces a huge uphill battle with all the integrations required to make stuff like this useful. Apple and Microsoft are in much better spots to make a truly useful agent, if they can figure out the tech.
Maybe this is the "bitter lesson of agentic decisions": hard things in your life are hard because they involve deeply personal values and complex interpersonal dynamics, not because they are difficult in an operational sense. Calling a restaurant to make a reservation is trivial. Deciding what restaurant to take your wife to for your wedding anniversary is the hard part (Does ChatGPT know that your first date was at a burger-and-shake place? Does it know your wife got food poisoning the last time she ate sushi?). Even a highly paid human concierge couldn't do it for you. The Navier–Stokes smoothness problem will be solved before "plan a birthday party for my daughter."
Well, people do have personal assistants and concierges, so it can be done? but I think they need a lot of time and personal attention from you to get that useful right. they need to remember everything you've mentioned offhand or take little corrections consistently.
It seems to me like you have to reset the context window on LLMs way more often than would be practical for that
I think it's doable with the current context window we have, the issue is the LLM needs to listen passively to a lot of things in our lives, and we have to trust the providers with such an insane amount of data.
I think Google will excel at this because their ad targeting does this already, they just need to adapt to an llm can use that data as well.
I would even argue the hard parts of being human don't even need to be automated. Why are we all in a rush to automate everything, including what makes us human?
> hard things in your life are hard because they involve deeply personal values and complex interpersonal dynamics, not because they are difficult in an operational sense
I think what's interesting here is that it's a super cheap version of what many busy people already do -- hire a person to help do this. Why? Because the interface is easier and often less disruptive to our life. Instead of hopping from website to website, I'm just responding to a targeted imessage question from my human assistant "I think you should go with this <sitter,restaurant>, that work?" The next time I need to plan a date night, my assistant already knows what I like.
Replying "yes, book it" is way easier than clicking through a ton of UIs on disparate websites.
My opinion is that agents looking to "one-shot" tasks is the wrong UX. It's the async, single simple interface that is way easier to integrate into your life that's attractive IMO.
Yes! I’ve been thinking along similar lines: agents and LLMs are exposing the worst parts of the ergonomics of our current interfaces and tools (eg programming languages, frameworks).
I reckon there’s a lot to be said for fixing or tweaking the underlying UX of things, as opposed to brute forcing things with an expensive LLM.
> It's very hard for me to imagine the current level of agents serving a useful purpose in my personal life. If I ask this to plan a date night with my wife this weekend, it needs to consult my calendar to pick the best night, pick a bar and restaurant we like (how would it know?), book a babysitter (can it learn who we use and text them on my behalf?), etc. This is a lot of stuff it has to get right, and it requires a lot of trust!
This would be my ideal "vision" for agents, for personal use, and why I'm so disappointed in Apple's AI flop because this is basically what they promised at last year's WWDC. I even tried out a Pixel 9 pro for a while with Gemini and Google was no further ahead on this level of integration either.
But like you said, trust is definitely going to be a barrier to this level of agent behavior. LLMs still get too much wrong, and are too confident in their wrong answers. They are so frequently wrong to the point where even if it could, I wouldn't want it to take all of those actions autonomously out of fear for what it might actually say when it messages people, who it might add to the calendar invites, etc.
Agents are nothing more than the core chat model with a system prompt, and wrapper that parses responses and executes actions and puts the result into the prompt, and a system instruction that lets the model know what it can do.
Nothing is really that advanced yet with agents themselves - no real reasoning going on.
That being said, you can build your own agents fairly straightforward. The key is designing the wrapper and the system instructions. For example, you can have a guided chat on where it builds of the functionality of looking at your calendar, google location history, babysitter booking, and integrate all of that into automatic actions.
It has to earn that trust and that takes time. But there are a lot of personal use cases like yours that I can imagine.
For example, I suddenly need to reserve a dinner for 8 tomorrow night. That's a pain for me to do, but if I could give it some basic parameters, I'm good with an agent doing this. Let them make the maybe 10-15 calls or queries needed to find a restaurant that fits my constraints and get a reservation.
I see restaurant reservations as an example of an AI agent-appropriate task fairly often, but I feel like it's something that's neither difficult (two or three clicks on OpenTable and I see dozens of options I can book in one more click), nor especially compelling to outsource (if I'm booking something for a group, choosing the place is kind of personal and social—I'm taking everything I know about everybody in the group into account, and I'd likely spend more time downloading that nuance to the agent than I would just scrolling past a few places I know wouldn't work).
Similar to what was shown in the video when I make a large purchase like a home or car I usually obsess for a couple of years and make a huge spreadsheet to evaluate my decisions. Having an agent get all the spreadsheet data would be a big win. I had some success recently trying that with manus.
>it needs to consult my calendar to pick the best night, pick a bar and restaurant we like (how would it know?), book a babysitter (can it learn who we use and text them on my behalf?), etc
This (and not model quality) is why I’m betting on Google.
I am not sure I see most of this as a problem. For an agent you would want to write some longer instructions than just "book me an aniversery dinner with my wife".
You would want to write a couple paragraphs outlining what you were hoping to get (maybe the waterfront view was the important thing? Maybe the specific place?)
As for booking a babysitter - if you don't already have a specific person in mind (I don't have kids), then that is likely a separate search. If you do, then their availability is a limiting factor, in just the same way your calendar was and no one, not you, not an agent, not a secretary, can confirm the restaurant unless/until you hear back from them.
As an inspiration for the query, here is one I used with Chat GPT earlier:
>I live in <redacted>. I need a place to get a good quality haircut close to where I live. Its important that the place has opening hours outside my 8:00 to 16:00 mon-fri job and good reviews.
>
>I am not sensitive to the price. Go online and find places near my home. Find recent reviews and list the places, their names, a summary of the reviews and their opening hours.
>
>Thank you
The sane way to do this (if you wanted to) would be to give the AI a debit card with a small balance to work with. If funds get stolen, you know exactly what the maximum damage is. And if you can't afford that damage, then you wouldn't have been able to afford that card to begin with.
But since people can cancel transactions with a credit card, that's what people are going to do, and it will be a huge mess every time.
It's not like a credit card is all that different from a debit card in terms of cancellations. If this becomes a big enough problem, I would imagine that card issuers will simply stop accepting "my agent did it" as an excuse in chargeback requests.
One the one hand this is super cool and maybe very beneficial, something I definitely want to try out.
On the other, LLMs always make mistakes, and when it's this deeply integrated into other system I wonder how severe these mistakes will be, since they are bound to happen.
Recently I uploaded screenshot of movie show timing at a specific theatre and asked ChatGPT to find the optimal time for me to watch the movie based on my schedule.
It did confidently find the perfect time and even accounted for the factors such as movies in theatre start 20 mins late due to trailers and ads being shown before movie starts. The only problem: it grabbed the times from the screenshot totally incorrectly which messed up all its output and I tried and tried to get it to extract the time accurately but it didn’t and ultimately after getting frustrated I lost the trust in its ability. This keeps happening again and again with LLMs.
And this is actually a great use of Agents because they can go and use the movie theater's website to more reliably figure out when movies start. I don't think they're going to feed screenshots in to the LLM.
Honestly might be more indicative of how far behind vision is than anything.
Despite the fact that CV was the first real deep learning breakthrough VLMs have been really disappointing. I'm guessing it's in part due to basic interleaved web text+image next token prediction being a weak signal to develop good image reasoning.
Is there anyone trying to solve OCR, I often think of that annas-archive blog about how we basically just have to keep shadow libraries alive long enough until the conversion from pdf to plaintext is solved.
I hope one of these days one of these incredibly rich LLM companies accidentally solves this or something, would be infinitely more beneficial to mankind than the awful LLM products they are trying to make
I was searching on HuggingFace for the model which can fit on my system RAM + VRAM.
And the way HuggingFace shows the models - bunch of files, showing size for each file, but doesn't show the total.
I copy-pasted that page to LLM and asked to count the total. Some of LLMs counted correctly, and some - confidently gave me totally wrong number.
Im currently working on a way to basically make LLM spit out any data processing answer as code which is then automatically executed, and verified, with additional context. So things like hallucinations are reduced pretty much to zero, given that the wrapper will say that the model could not determine a real answer.
It's smart that they're pivoting to using the user's computer directly - managing passwords, access control and not getting blocked was the biggest issue with their operator release. Especially as the web becomes more and more locked down.
> ChatGPT agent's output is comparable to or better than that of humans in roughly half the cases across a range of task completion times, while significantly outperforming o3 and o4-mini.
Hard to know how this will perform in real life, but this could very well be a feel the AGI moment for the broader population.
We couldve easily build all these features a year ago, tools are nothing new. Its just barely useful.
Most applications now are more intuitive than our brain can think fast. I think telling an AI to find me a good flight is more work than to type in sk autocomplete for skyscanner having autocomplete for departure and for arrival allowing me to one way or return, having filters its all actually easier than to properly define the task. And we can start executing right away. Agent starts after texting so it will increase more latency. Often modern applications have problems solved that we didn’t even think about before.
Agent to me is another bullshit launch by OPENAI. They have to do something I understand but their releases are really grim to me.
Bad model, no real estate (browser, social media, OS).
For me the most interesting example on this page is the sticker gif halfway down the page.
Up until now, chatbots haven't really affected the real world for me†. This feels like one of the first moments where LLMs will start affecting the physical world. I type a prompt and something shows up at my doorstep. I wonder how much of the world economy will be driven by LLM-based orders in the next 10 years.
† yes I'm aware self driving cars and other ML related things are everywhere around us and that much of the architecture is shared, but I don't perceive these as LLMs.
It went viral more than a year ago, so maybe you've seen it. On the Ritual Industries instagram, Brian (the guy behind RI) posted a video where he gives voice instruction to his phone assistant, which put the text through chatgpt, which generated openscad code, which was fed to his bambu 3d printer, which successfully printed the object. Voice to Stuff.
I don't have ig anymore so I can't post the link, but it's easy to find if you do.
I just want to know what the insurance looks like behind this, lol. An agent mistakenly places an order for 500k instead of 500 stickers at some premium pricing tier above intended one. Sorry, read the fine print, and you're using at your own risk?
I haven't looked at OpenAI's ToS but try and track down a phrase called "indemnity clause". It's in some of Google's GCP ToS. TLDR it means "we (Google) will pay for ur lawsuit if something you do using our APIs get you sued"
>OpenAI’s indemnification obligations to API customers under the Agreement include any third party claim that Customer’s use or distribution of Output infringes a third party’s intellectual property right. This indemnity does not apply where: (i) Customer or Customer’s End Users knew or should have known the Output was infringing or likely to infringe, (ii) Customer or Customer’s End Users disabled, ignored, or did not use any relevant citation, filtering or safety features or restrictions provided by OpenAI, (iii) Output was modified, transformed, or used in combination with products or services not provided by or on behalf of OpenAI, (iv) Customer or its End Users did not have the right to use the Input or fine-tuning files to generate the allegedly infringing Output, (v) the claim alleges violation of trademark or related rights based on Customer’s or its End Users’ use of Output in trade or commerce, and (vi) the allegedly infringing Output is from content from a Third Party Offering.
I wonder if this can ever be as extensible/flexible as the local agent systems like Claude Code. Like can I send up my own tools (without some heavyweight "publish extension" thing)? Does it integrate with MCP?
I said apple does not do that. Apple invented the smartphone before samsung or anyone.
There is no such thing as "slow" in business. If you re slow you go out of business, you re no longer a business.
There is only one AI race. There is no second round. If you stay out of the race, you will be forever indebted to the AI winner, in the same way that we are entirely dependent on US internet technology currently (and this very forum)
*glances at AI, VR, mini phones, smart cars, multi-wireless charging, home automation, voice assistants, streaming services, set-top boxes, digital backup software, broadband routers, server hardware, server software and 12" laptops in rapid succession*
Correct me, but I don't think such alignment between Switzerland and the rest of the EEA on LLM/"AI" technology does currently exist (though there may and likely will be some in the future) and it cannot explain the inevitable EEA wide release that is going to follow in a few weeks, as always. The "EU/EEA/European regulations prevent company from offering software product here" shouts have always been loud, no matter how often we see it turn out to have been merely a delayed launch with no regulatory reasoning.
If this had been specific to countries that have adopted the "AI Act", I'd be more than willing to accept that this delay could be due them needing to ensure full compliance, but just like in the past when OpenAI delayed a launch across EU member states and the UK, this is unlikely. My personal, though 100% unsourced thesis, remains, that this staggered rollout is rooted in them wanting to manage the compute capacity they have. Taking both the Americas and all of Europe on at once may not be ideal.
The European livestyle isn't god given and has to be paid for. It's a luxury and I'm still puzzled that people don't get that we can't afford it without an economy.
We'll only be able to afford our lifestyles by letting OpenAI's bots make spreadsheets that aren't accurate or useful outside of tricking people into thinking you did your job?
Europe runs 3% deficits and gets universal healthcare, tuition free universities, 25+ days paid vacation, working trains, and no GoFundMe for surgeries.
The U.S. runs 6–8% deficits and gets vibes, weapons, and insulin at $300 a vial.
Who's on the unsustainable path and really overspending?
If the average interest rate on U.S. government debt rises to 14%, then 100% of all federal tax revenue (around $4.8 trillion/year) will be consumed just to pay interest on the $34 trillion national debt. As soon as the current Fed Chairman gets fired, practically a certainty by now, nobody will buy US bonds for less than 10 to 15% interest.
If predictions of AI optimists come true, it's going to be an economic nuclear bomb. If not, economic effects of AI will not necessarily be that important
Well, when all the US is going to be turbo-fascist and controlled by facial recognition and AI reading all your email and text messages to know what you're thinking of the Great Leader Trump, we'll be happy to have those regulations in Europe
It's not the Manhattan Project. I'm flagging your comment because it is insubstantial flamebait. We don't even know how valuable this tech is, you're jumping to conclusions.
(I am American, convince me my digression is wrong)
This feels a bit underwhelming to me - Perplexity Comet feels more immediately compelling as new paradigm of a natural way of using LLMs within a browser. But perhaps I'm being short-sighted
It's great to see at least one company creating real AI agents. The last six months have been agonising, reading article after article about people and companies claiming they've built and deployed AI agents, when in reality, they were just using OpenAI's API with a cron job or an event-driven system to orchestrate their GenAI scripts.
I opened up the app bundle of CC on macOS and CC is incredibly simple at its core! There’s about 14 tools (read, write, grep, bash, etc). The power is in the combination of the model, the tools and the system prompt/tool description prompts. It’s kind of mind blowing how well my cobbled together home brew version actually works. It doesn’t have the fancy CLI GUI but it is more or less performant as CC when running it through the Sonnet API.
Works less well on other models. I think Anthropic really nailed the combination of tool calling and general coding ability (or other abilities in your case). I’ve been adding some extra tools to my version for specific use cases and it’s pretty shocking how well it performs!
> It’s kind of mind blowing how well my cobbled together home brew version actually works. It doesn’t have the fancy CLI GUI but it is more or less performant as CC when running it through the Sonnet API.
I've been thinking of rolling up my own too. but i don't want to use sonnet api since that is pay per use. I currently use cc with a pro plan that puts me in timeout after a quota is met and resets the quota in 4 hrs. that gives me a lot of peace of mind and is much cheaper.
I think there will come a time when models will be good enough and SMALL enough to be localized that there will be some type of disintermediation from the big 3-4 models we have today.
Meanwhile, Siri can barely turn off my lights before bed.
Today I made like a 100 of merge request reviews, manually inspecting all the diffs, and approving those I evaluated as valid needed contributions. I wonder if agents can help with similar workflows. It requires deep kind of knowledge of project's goals, ability to respect all the constraints and planning. But I'm certain it's doable.
It’s like having a junior executive assistant that you know will always make mistakes, so you can’t trust their exact output and agenda. Seems unreliable .
It seems to me that the 2-20% of use cases where ChatGPT Agent isn't able to perform it might make sense to have a plug-in run that can either guide the agent through the complex workflow or perform a deterministic action (e.g. API call).
While they did talk about partial-mitigations to counter prompt-injection, highlighting the risks of cc numbers and other private information leaking, they did not address whether they would be handing all of that data over under the court-order to the NYT.
> These unified agentic capabilities significantly enhance ChatGPT’s usefulness in both everyday and professional contexts. At work, you can automate repetitive tasks, like converting screenshots or dashboards into presentations composed of editable vector elements, rearranging meetings, planning and booking offsites, and updating spreadsheets with new financial data while retaining the same formatting. In your personal life, you can use it to effortlessly plan and book travel itineraries, design and book entire dinner parties, or find specialists and schedule appointments.
None of this interests me but this tells me where it's going capability wise and it's really scary and really exciting at the same time.
That's because there are dozens of slightly (or significantly) different definitions floating around and everyone who uses the term likes to pretend that their definition is the only one out there and should be obvious to everyone else.
I collect agent definitions. I think the two most important at the moment are Anthropic's and OpenAI's.
An workflow is a collection of steps defined by someone, where the steps can be performed by an LLM call. (i.e. propose a topic -> search -> summarise each link -> gather the summaries -> produce a report)
The "agency" in this example is on the coder that came up with the workflow. It's murky because we used to call these "agents" in the previous gen frameworks.
An agent is a collection of steps defined by the LLM itself, where the steps can be performed by LLM calls (i.e. research topic x for me -> first I need to search (this is the LLM deciding the steps) -> then I need to xxx -> here's the report)
The difference is that sometimes you'll get a report resulting from search, or sometimes the LLM can hallucinate the whole thing without a single "tool call". It's more open ended, but also more chaotic from a programming perspective.
The gist is that the "agency" is now with the LLM driving the "main thread". It decides (based on training data, etc) what tools to use, what steps to take in order to "solve" the prompt it receives.
I think it's interesting that the industry decided that this is the milestone to which the term "agentic" should be attached to, because it requires this kind of explanation even for tech-minded people.
I think for the average consumer, AI will be "agentic" once it can appreciably minimize the amount of interaction needed to negotiate with the real world in areas where the provider of the desired services intentionally require negotiation - getting a refund, cancelling your newspaper subscription, scheduling the cable guy visit, fighting your parking ticket, securing a job interview. That's what an agent does.
It's just a ~~reduce~~ loop, with an API call to an LLM in the middle, and a data-structure to save the conversation messages and append them in next iterations of the loop. If you wanna get fancy, you can add other API calls, or access to your filesystem. Nothing to go crazy about...
Technically it's `scan`, not `reduce`, since every intermediate output is there too. But it's also kind of a trampoline (tail-call re-write for languages that don't support true tail calls), or it will be soon, since these things loose the plot and need to start over.
Giving an LLM access to the command line so it can bash and curl and and python and puppeteer and rm -rf / and send an email to the FBI and whatever it thinks you want it to do.
While it's common that coding agents have a way to execute commands and drive a web browser (usually via MCP) that's not what make it an agent. Agentic workflow just means that LLM has some tools it can ask agent to run, in return this allows LLM/agent to figure out multiple steps to complete a task.
Time to start the clock on a new class of prompt injection attacks on "AI agents" getting hacked or scammed during the road to an increase in 10% global unemployment by 2030 or 2035.
The "spreadsheet" example video is kind of funny: guy talks about how it normally takes him 4 to 8 hours to put together complicated, data-heavy reports. Now he fires off an agent request, goes to walk his dog, and comes back to a downloadable spreadsheet of dense data, which he pulls up and says "I think it got 98% of the information correct... I just needed to copy / paste a few things. If it can do 90 - 95% of the time consuming work, that will save you a ton of time"
It feels like either finding that 2% that's off (or dealing with 2% error) will be the time consuming part in a lot of cases. I mean, this is nothing new with LLMs, but as these use cases encourage users to input more complex tasks, that are more integrated with our personal data (and at times money, as hinted at by all the "do task X and buy me Y" examples), "almost right" seems like it has the potential to cause a lot of headaches. Especially when the 2% error is subtle and buried in step 3 of 46 of some complex agentic flow.
> how it normally takes him 4 to 8 hours to put together complicated, data-heavy reports. Now he fires off an agent request, goes to walk his dog, and comes back to a downloadable spreadsheet of dense data, which he pulls up and says "I think it got 98% of the information correct...
This is where the AI hype bites people.
A great use of AI in this situation would be to automate the collection and checking of data. Search all of the data sources and aggregate links to them in an easy place. Use AI to search the data sources again and compare against the spreadsheet, flagging any numbers that appear to disagree.
Yet the AI hype train takes this all the way to the extreme conclusion of having AI do all the work for them. The quip about 98% correct should be a red flag for anyone familiar with spreadsheets, because it’s rarely simple to identify which 2% is actually correct or incorrect without reviewing everything.
This same problem extends to code. People who use AI as a force multiplier to do the thing for them and review each step as they go, while also disengaging and working manually when it’s more appropriate have much better results. The people who YOLO it with prompting cycles until the code passes tests and then submit a PR are causing problems almost as fast as they’re developing new features in non-trivial codebases.
From John Dewey's Human Nature and Conduct:
“The fallacy in these versions of the same idea is perhaps the most pervasive of all fallacies in philosophy. So common is it that one questions whether it might not be called the philosophical fallacy. It consists in the supposition that whatever is found true under certain conditions may forthwith be asserted universally or without limits and conditions. Because a thirsty man gets satisfaction in drinking water, bliss consists in being drowned. Because the success of any particular struggle is measured by reaching a point of frictionless action, therefore there is such a thing as an all-inclusive end of effortless smooth activity endlessly maintained.
It is forgotten that success is success of a specific effort, and satisfaction the fulfillment of a specific demand, so that success and satisfaction become meaningless when severed from the wants and struggles whose consummations they arc, or when taken universally.”
The proper use of these systems is to treat them like an intern or new grad hire. You can give them the work that none of the mid-tier or senior people want to do, thereby speeding up the team. But you will have to review their work thoroughly because there is a good chance they have no idea what they are actually doing. If you give them mission-critical work that demands accuracy or just let them have free rein without keeping an eye on them, there is a good chance you are going to regret it.
I’ve never experienced an intern who was remotely as mediocre and incapable of growth as an LLM.
What about a coach's ability for improving instruction?
The point of coaching a Junior is so they improve their skills for next time
What would be the point of coaching an LLM? You will just have to coach it again and again
What about it?
Isn't the point of an intern or new grad that you are training them to be useful in the future, acknowledging that for now they are a net drain on resources.
An overly eager intern with short term memory loss, sure.
And working with interns requires more work for final output compared do-it-yourself
For this example - Let’s replace the word “intern” with “initial-stage-experts” or something.
There’s a reason people invest their time with interns.
Yeah, people complaining about accuracy of AI-generated code should be examining their code review procedures. It shouldn’t matter if the code was generated by a senior employee, an intern, or an LLM wielded by either of them. If your review process isn’t catching mistakes, then the review process needs to be fixed.
This is especially true in open source where contributions aren’t limited to employees who passed a hiring screen.
This is taking what I said further than intended. I'm not saying the standard review process should catch the AI generated mistakes. I'm saying this work is at the level of someone who can and will make plenty of stupid mistakes. It therefore needs to be thoroughly reviewed by the person using before it is even up to the standard of a typical employee's work that the normal review process generally assumes.
Yep, in the case of open source contributions as an example, the bottleneck isn't contributors producing and proposing patches, it's a maintainer deciding if the proposal has merit, whipping (or asking contributors to whip) patches into shape, making sure it integrates, etc. If contributors use generative AI to increase the load on the bottleneck it is likely to cause a negative net effect.
This very much. Most of the time, it's not a code issue, it's a communication issue. Patches are generally small, it's the whole communication around it until both parties have a common understanding that takes so much time. If the contributor comes with no understanding of his patch, that breaks the whole premise of the conversation.
I can still complain about the added workload of inaccurate code.
If 10 times more code is being created, you need 10 times as many code reviewers..
Plus the overhead of coordinating the reviewers as well!
"Corporate says the review process needs to be relaxed because its preventing our AI agents from checking in their code"
98% sure each commit doesn’t corrupt the database, regress a customer feature, open a security vulnerability. 50 commits later … (which is like, one day for an agentic workflow)
It’s only a 64% chance of corruption after 50 such commits at a 98% success.
I would be embarrassed to be at OpenAI releasing this and pretending the last 9 months haven't happened... waxing poetically about "age of agents" - absolutely cringe and pathetic
Or as I would like to put it, LLM outputs are essentially the Library of Babel. Yes, it contains all of the correct answers, but might as well be entirely useless.
”The people who YOLO it with prompting cycles until the code passes tests and then submit a PR are causing problems almost as fast as they’re developing new features in non-trivial codebases.”
This might as well be the new definition of “script kiddie”, and it’s the kids that are literally going to be the ones birthed into this lifestyle. The “craft” of programming may not be carried by these coming generations and possibly will need to be rediscovered at some point in the future. The Lost Art of Programming is a book that’s going to need to be written soon.
Oh come on, people have been writing code with bad, incomplete, flaky, or absent tests since automated testing was invented (possibly before).
It's having a good, useful and reliable test suite that separates the sheep from the goats.*
Would you rather play whack-a-mole with regressions and Heisenbugs, or ship features?
* (Or you use some absurdly good programing language that is hard to get into knots with. I've been liking Elixir. Gleam looks even better...)
It sounds like you’re saying that good tests are enough to ensure good code even when programmers are unskilled and just rewrite until they pass the tests. I’m very skeptical.
It may not be a provable take, but it’s also not absurd. This is the concept behind modern TDD (as seen in frameworks like cucumber):
Someone with product knowledge writes the tests in a DSL
Someone skilled writes the verbs to make the DSL function correctly
And from there, any amount of skill is irrelevant: either the tests pass, or they fail. One could hook up a markov chain to a javascript sourcebook and eventually get working code out.
> One could hook up a markov chain to a javascript sourcebook and eventually get working code out.
Can they? Either the dsl is so detailed and specific as to be just code with extra steps or there is a lot of ground not covered by the test cases with landmines that a million monkeys with typewriters could unwittingly step on.
The bugs that exist while the tests pass are often the most brutal - first to find and understand and secondly when they occasionally reveal that a fundamental assumption was wrong.
Tests are just for the bugs you already know about
They're also there to prevent future bugs.
So is here to stay. If you’re unable to write good code with it. Doesn’t mean everyone is writing bad code with it.
“The quip about 98% correct should be a red flag for anyone familiar with spreadsheets”
I disagree. Receiving a spreadsheet from a junior means I need to check it. If this gives me infinite additional juniors I’m good.
It’s this popular pattern of HN comments - expect AI to behave deterministically correct - while the whole world operates on stochastically correct all the time…
In my experience the value of junior contributors is that they will one day become senior contributors. Their work as juniors tends to require so much oversight and coaching from seniors that they are a net negative on forward progress in the short term, but the payoff is huge in the long term.
I don't see how this can be true when no one stays at a single job long enough for this to play out. You would simply be training junior employees to become senior employees for someone else.
So this has been a problem in the tech market for a while now. Nobody wants to hire juniors for tech because even at FAANGs the average career trajectory is what, 2-3 years? There's no incentive for companies to spend the time, money, and productivity hit to train juniors properly. When the current cohort ages out, a serious problem is going to occur, and it won't be pretty.
Exactly this
And it should go without saying that LLMs do not have the same investment/value tradeoff. Whether or not they contribute like a senior or junior seems entirely up to luck
Prompt skill is flaky and unreliable to ensure good output from LLMs
When my life was spreadsheets, we were expected to get to the point of being 99.99% right.
You went from “do it again” to “go check the newbies work”.
To get to that stage your degree of proficiency would be “can make out which font is wrong at a glance.”
You wouldn’t be looking at the sheet, you would be running the model in your head.
That stopped being a stochastic function, with the error rate dropping significantly - to the point that making a mistake had consequences tacked on to it.
The act of trying to make that 2% appear like "minimal, dismissable" is almost a mass psychosis in the AI world at times it seems like.
A few comparisons:
>Pressing the button: $1 >Knowing which button to press: $9,999 Those 2% copy-paste changes are the $9.999 and might take as long to find as rest of the work.
Also: SCE to AUX.
I also find that validating data can be much faster than calculating data. It's like when you're in algebra class and you're told to "solve for X". Once you find the value for X you plug it into the equation to see if it fits, and it's 10x faster than solving for X originally.
Regardless of if AI generates the spreadsheet or if I generate the spreadsheet, I'm still going to do the same validation steps before I share it with anyone. I might have a 2% error rate on a first draft.
Of course, Pareto principle is at work here. In an adjacent field, self-driving, they are working on the last "20%" for almost a decade now. It feels kind of odd that almost no one is talking about self-driving now, compared to how hot of a topic it used to be, with a lot of deep, moral, almost philosophical discussions.
> The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time.
— Tom Cargill, Bell Labs
https://en.wikipedia.org/wiki/Ninety%E2%80%93ninety_rule
In my experience for enterprise software engineering, in this stage we are able to shrink the coding time with ~20%, depending on the kind of code/tests.
However CICD remains tricky. In fact when AI agents start building autonomous, merge trains become a necessity…
> It feels kind of odd that almost no one is talking about self-driving now, compared to how hot of a topic it used to be
Probably because it's just here now? More people take Waymo than Lyft each day in SF.
It's "here" if you live in a handful of cities around the world, and travel within specific areas in those cities.
Getting this tech deployed globally will take another decade or two, optimistically speaking.
Given how well it seems to be going in those specific areas, it seems like it's more of a regulatory issue than a technological one.
Ah, those pesky regulations that try to prevent road accidents...
If it's not a technological limitation, why aren't we seeing self-driving cars in countries with lax regulations? Mexico, Brazil, India, etc.
Tesla launched FSD in Mexico earlier this year, but you would think companies would be jumping at the opportunity to launch in markets with less regulation.
So this is largely a technological limitation. They have less driving data to train on, and the tech doesn't handle scenarios outside of the training dataset well.
Do we even know what % of Waymo rides in SF are completely autonomous? I would not be surprised if more of them are remotely piloted than they've let on...
Can you name any of the specific regulations that robot taxi companies are lobbying to get rid of? As long as robotaxis abide by the same rules of the road as humans do, what's the problem? Regulations like you're not allowed to have robotaxis unless you pay me, your local robotaxi commissioner $3/million/year, aren't going to be popular with the populus but unfortunately for them, they don't vote, so I'm sure we'll see holdouts and if multiple companies are in multiple markets and are complaining about the local taxi cab regulatory commision, but there's just so much of the world without robotaxis right now (summer 2025) that I doubt it's anything mure than the technology being brand spanking new.
Maybe, but it's also going to be a financial issue eventually too
My city had Car2Go for a couple of years, but it's gone now. They had to pull out of the region because it wasn't making them enough money
I expect Waymo and any other sort of vehicle ridesharing thing will have the same problem in many places
But it seems the reason for that is that this is a new, immature technology. Every new technology goes through that cycle until someone figures out how to make it financially profitable.
This is a big moving of the goalposts. The optimists were saying Level 5 would be purchasable everywhere by ~2018. They aren’t purchasable today, just hail-able. And there’s a lot of remote human intervention.
And San Francisco doesn’t get snow.
Hell - SF doesn’t have motorcyclists or any vehicular traffic, driving on the wrong side of the road.
Or cows sharing the thoroughfares.
It should be obvious to all HNers that have lived or travelled to developing / global south regions - driving data is cultural data.
You may as well say that self driving will only happen in countries where the local norms and driving culture is suitable to the task.
A desperately anemic proposition compared to the science fiction ambition.
I’m quietly hoping I’m going to be proven wrong, but we’re better off building trains, than investing in level 5. It’s going to take a coordination architecture owned by a central government to overcome human behavior variance, and make full self driving a reality.
I'm in the Philippines now, and that's how I know this is the correct take. Especially this part:
"Driving data is cultural data."
The optimists underestimate a lot of things about self-driving cars.
The biggest one may be that in developing and global south regions, civil engineering, design, and planning are far, far away from being up to snuff to a level where Level 5 is even a slim possibility. Here on the island I'm on, the roads, storm water drainage (if it exists at all) and quality of the built environment in general is very poor.
Also, a lot of otherwise smart people think that the increment between Level 4 and Level 5 is the same as that between all six levels, when the jump from Level 4 to Level 5 automation is the biggest one and the hardest to successfully accomplish.
Most people live within a couple hours of a city though, and I think we'll see robot taxis in a majority of continents by 2035 though. The first couple cities and continents will take the longest, but after that it's just a money question, and rich people have a lot of money. The question then is: is the taxi cab consortium, which still holds a lot of power, despite Uber, in each city the in world, large enough to prevent Waymo from getting a hold, for every city in the world that Google has offices in.
Yeah where they have every inch of SF mapped, and then still have human interventions. We were promised no more human drivers like 5-7 years ago at this point.
Human interventions.
High speed connectivity and off vehicle processing for some tasks.
Density of locations to "idle" at.
There are a lot of things that make all these services work that means they can NOT scale.
These are all solvable but we have a compute problem that needs to be addressed before we get there, and I haven't seen any clues that there is anything in the pipeline to help out.
The typical Lyft vehicle is a piece of junk worth less than $20k, while the typical Waymo vehicle is a pretend luxury car with $$$ of equipment tacked on.
Waymo needs to be proving 5-10x the number of daily rides as Lyft before we get excited
Well, if we say these systems are here, it still took 10+ years between prototype and operational system.
And as I understand it; These are systems, not individual cars that are intelligent and just decide how to drive from immediate input, These system still require some number of human wranglers and worst-case drivers, there's a lot of specific-purpose code rather nothing-but-neural-network etc.
Which to say "AI"/neural nets are important technology that can achieve things but they can give an illusion of doing everything instantly by magic but they generally don't do that.
It’s past the hype curve and into the trough of disillusionment. Over the next 5,10,15 years (who can say?) the tech will mature out of the trough into general adoption.
GenAI is the exciting new tech currently riding the initial hype spike. This will die down into the trough of disillusionment as well, probably sometime next year. Like self-driving, people will continue to innovate in the space and the tech will be developed towards general adoption.
We saw the same during crypto hype, though that could be construed as more of a snake oil type event.
The Gartner hype cycle assumes a single fundamental technical breakthrough, and describes the process of the market figuring out what it is and isn't good for. This isn't straightforwardly applicable to LLMs because the question of what they're good for is a moving target; the foundation models are actually getting more capable every few months, which wasn't true of cryptocurrency or self-driving cars. At least some people who overestimate what current LLMs can do won't have the chance to find out that they're wrong, because by the time they would have reached the trough of disillusionment, LLM capabilities will have caught up to their expectations.
If and when LLM scaling stalls out, then you'd expect a Gartner hype cycle to occur from there (because people won't realize right away that there won't be further capability gains), but that hasn't happened yet (or if it has, it's too recent to be visible yet) and I see no reason to be confident that it will happen at any particular time in the medium term.
If scaling doesn't stall out soon, then I honestly have no idea what to expect the visibility curve to look like. Is there any historical precedent for a technology's scope of potential applications expanding this much this fast?
> If scaling doesn't stall out soon, then I honestly have no idea what to expect the visibility curve to look like. Is there any historical precedent for a technology's scope of potential applications expanding this much this fast?
Lots of pre-internet technologies went through this curve. PCs during the clock speed race, aircraft before that during the aeronautics surge of the 50s, cars when Detroit was in its heydays. In fact, cloud computing was enabled by the breakthroughs in PCs which allowed commodity computing to be architected in a way to compete with mainframes and servers of the era. Even the original industrial revolution was actually a 200-year ish period where mechanization became better and better understood.
Personally I've always been a bit confused about the Gartner Hype Cycle and its usage by pundits in online comments. As you say it applies to point changes in technology but many technological revolutions have created academic, social, and economic conditions that lead to a flywheel of innovation up until some point on an envisioned sigmoid curve where the innovation flattens out. I've never understood how the hype cycle fits into that and why it's invoked so much in online discussions. I wonder if folks who have business school exposure can answer this question better.
> If scaling doesn't stall out soon, then I honestly have no idea what to expect the visibility curve to look like.
We are seeing diminishing returns on scaling already. LLMs released this year have been marginal improvements over their predecessors. Graphs on benchmarks[1] are hitting an asymptote.
The improvements we are seeing are related to engineering and value added services. This is why "agents" are the latest buzzword most marketing is clinging on. This is expected, and good, in a sense. The tech is starting to deliver actual value as it's maturing.
I reckon AI companies can still squeeze out a few years of good engineering around the current generation of tools. The question is what happens if there are no ML breakthroughs in that time. The industry desperately needs them for the promise of ASI, AI 2027, and the rest of the hyped predictions to become reality. Otherwise it will be a rough time when the bubble actually bursts.
[1]: https://llm-stats.com/
The problem with LLMs and all other modern statistical large-data-driven solutions’ approach is that it tries to collapse the entire problem space of general problem solving to combinatorial search of the permutations of previously solved problems. Yes, this approach works well for many problems as we can see with the results with huge amount of data and processing utilized.
One implicit assumption is that all problems can be solved with some permutations of existing solutions. The other assumption is the approach can find those permutations and can do so efficiently.
Essentially, the true-believers want you to think that rearranging some bits in their cloud will find all the answers to the universe. I am sure Socrates would not find that a good place to stop the investigation.
Right. I do think that just the capability to find and generate interesting patterns from existing data can be very valuable. It has many applications in many fields, and can genuinely be transformative for society.
But, yeah, the question is whether that approach can be defined as intelligence, and whether it can be applicable to all problems and tasks. I'm highly skeptical of this, but it will be interesting to see how it plays out.
I'm more concerned about the problems and dangers of this tech today, than whatever some entrepreneurs are promising for the future.
> We are seeing diminishing returns on scaling already. LLMs released this year have been marginal improvements over their predecessors. Graphs on benchmarks[1] are hitting an asymptote.
This isnt just a software problem. IF you go look at the hardware side you see that same flat line (IPC is flat generation over generation). There are also power and heat problems that are going to require some rather exotic and creative solutions if companies are looking to hardware for gains.
The Gartner hype cycle is complete nonsense, it's just a completely fabricated way to view the world that helps sell Gartner's research products. It may, at times, make "intuitive sense", but so does astrology.
The hype cycle has no mathematical basis whatsoever. It's marketing gimmick. It's only value in my life has been to quickly identify people that don't really understand models or larger trends in technology.
I continue to be, but on introspection probably shouldn't be, surprised that people on HN treat is as some kind of gospel. The only people who should respected are other people in the research marketing space as the perfect example of how to dupe people into paying for your "insights".
Could you please expand on your point about expanding scopes? I am waiting earnestly for all the cheaper services that these expansions promise. You know cheaper white-collar-services like accounting, tax, and healthcare etc. The last reports saw accelerating service inflation. Someone is lying. Please tell me who.
Hence why I said potential applications. Each new generation of models is capable, according to evaluations, of doing things that previous models couldn't that prima facie have potential commercial applications (e.g., because they are similar to things that humans get paid to do today). Not all of them will necessarily work out commercially at that capability level; that's what the Gartner hype cycle is about. But because LLM capabilities are a moving target, it's hard to tell the difference between things that aren't commercialized yet because the foundation models can't handle all the requirements, vs. because commercializing things takes time (and the most knowledgeable AI researchers aren't working on it because they're too busy training the next generation of foundation models).
It sounds like people should just ignore those pesky ROI questions. In the long run, we are all dead so let’s just invest now and worry about the actual low level details of delivering on the economy-wide efficiency later.
As capital allocators, we can just keep threatening the worker class with replacing their jobs with LLMs to keep the wages low and have some fun playing monopoly in the meantime. Also, we get to hire these super smart AI researchers people (aka the smartest and most valuable minds in the world) and hold the greatest trophies. We win. End of story.
It's saving healthcare costs for those who solved their problem and never go in which would not be reflected in service inflation costs.
Back in my youthful days, educated and informed people chastised using the internet to self-diagnose and self-treat. I completely missed the memo on when it became a good idea to do so with LLMs.
Which model should I ask about this vague pain I have been having in my left hip? Will my insurance cover the model service subscription? Also, my inner thigh skin looks a bit bruised. Not sure what’s going on? Does the chat interface allow me to upload a picture of it? It won’t train on my photos right?
> or if it has, it's too recent to be visible yet
It's very visible.
Silicon Valley, and VC money has a proven formula. Bet on founders and their ideas, deliver them and get rich. Everyone knows the game, we all get it.
Thats how things were going till recently. Then FB came in and threw money at people and they all jumped ship. Google did the same. These are two companies famous for throwing money at things (Oculus, metaverse, G+, quantum computing) and right and proper face planting with them.
Do you really think that any of these people believe deep down that they are going to have some big breath through? Or do you think they all see the writing on the wall and are taking the payday where they can get it?
Liquidity in search of the biggest holes in the ground. Whoever can dig the biggest holes wins. Why or what you get out of digging the holes? Who cares.
The critics of the current AI buzz certainly have been drawing comparisons to self driving cars as LLMs inch along with their logarithmic curve of improvement that's been clear since the GPT-2 days.
Whenever someone tells me how these models are going to make white collar professions obsolete in five years, I remind them that the people making these predictions 1) said we'd have self driving cars "in a few years" back in 2015 and 2) the predictions about white collar professions started in 2022 so five years from when?
> said we'd have self driving cars "in a few years" back in 2015
And they wouldn't have been too far off! Waymo became L4 self-driving in 2021, and has been transporting people in the SF Bay Area without human supervision ever since. There are still barriers — cost, policies, trust — but the technology certainly is here.
People were saying we would all be getting in our cars and taking a nap on our morning commute. We are clearly still a pretty long ways off from self-driving being as ubiquitous as it was claimed it would be.
There are always extremists with absurd timelines on any topic! (Didn't people think we'd be on Mars in 2020?) But this one? In the right cities, plenty of people take a Waymo morning commute every day. I'd say self-driving cars have been pretty successful at meeting people's expectations — or maybe you and I are thinking of different people.
Reminds me of electricity entering the market and the first DC power stations setup in New York to power a few buildings. It would have been impossible to replicate that model for everyone. AC solved the distance issue.
That's where we are at with self driving. It can only operate in one small area, you can't own one.
We're not even close to where we are with 3d printers today or the microwave in the 50s.
I think people don't realize how much models have to extrapolate still, which causes hallucinations. We are still not great at giving all the context in our brain to LLMs.
There's still a lot of tooling to be built before it can start completely replacing anyone.
How profound. No one has ever posted that exact same thought before on here. Thank you.
Okay, but the experts saying self driving cars were 50 years out in 2015 were wrong too. Lots of people were there for those speeches, and yet, even the most cynical take on Waymo, Cruise and Zoox’s limitations would concede that the vehicles are autonomous most of the time in a technologically important way.
There’s more to this than “predictions are hard.” There are very powerful incentives to eliminate driving and bloated administrative workforces. This is why we don’t have flying cars: lack of demand. But for “not driving?” Nobody wants to drive!
This is the exact same issue that I've had trying to use LLMs for anything that needs to be precise such as multi-step data pipelines. The code it produces will look correct and produce a result that seems correct. But when you do quality checks on the end data, you'll notice that things are not adding up.
So then you have to dig into all this overly verbose code to identify the 3-4 subtle flaws with how it transformed/joined the data. And these flaws take as much time to identify and correct as just writing the whole pipeline yourself.
I'll get into hot water with this, but I still think LLMs do not think like humans do - as in the code is not a result of a trying to recreate a correct thought process in a programming language, but some sort of statistically most likely string that matches the input requirements,
I used to have a non-technical manager like this - he'd watch out for the words I (and other engineers) said and in what context, and would repeat them back mostly in accurate word contexts. He sounded remarkably like he knew what he was talking about, but would occasionally make a baffling mistake - like mixing up CDN and CSS.
LLMs are like this, I often see Cursor with Claude making the same kind of strange mistake, only to catch itself in the act, and fix the code (but what happens when it doesn't)
I think that if people say LLMs can never be made to think, that is bordering on a religious belief - it'd require humans to exceed the Turing computable (note also that saying they never can is very different from believing current architectures never will - it's entirely reasonable to believe it will take architectural advances to make it practically feasible).
But saying they aren't thinking yet or like humans is entirely uncontroversial.
Even most maximalists would agree at least with the latter, and the former largely depends on definitions.
As someone who uses Claude extensively, I think of it almost as a slightly dumb alien intelligence - it can speak like a human adult, but makes mistakes a human adult generally wouldn't, and that combinstion breaks the heuristics we use to judge competency,and often lead people to overestimate these models.
Claude writes about half of my code now, so I'm overall bullish on LLMs, but it saves me less than half of my time.
The savings improve as I learn how to better judge what it is competent at, and where it merely sounds competent and needs serious guardrails and oversight, but there's certainly a long way to go before it'd make sense to argue they think like humans.
Everyone has this impression that our internal monologue is what our brain is doing. It's not. We have all sorts of individual components that exist totally outside the realm of "token generation". E.g. the amygdala does its own thing in handling emotions/fear/survival, fires in response to anything that triggers emotion. We can modulate that with our conscious brain, but not directly - we have to basically hack the amygdala by thinking thoughts that deal with the response (don't worry about the exam, you've studied for it already)
LLMs don't have anything like that. Part of why they aren't great at some aspects of human behaviour. E.g. coding, choosing an appropriate level of abstraction - no fear of things becoming unmaintainable. Their approach is weird when doing agentic coding because they don't feel the fear of having to start over.
Emotions are important.
I don't think you'll get into hot water for that. Anthropomorphizing LLMs is an easy way to describe and think about them, but anyone serious about using LLMs for productivity is aware they don't actually think like people, and run into exactly the sort of things you're describing.
I just wrote a post on my site where the LLM had trouble with 1) clicking a button, 2) taking a screenshot, 3) repeat. The non-deterministic nature of LLMs is both a feature and a bug. That said, read/correct can sometimes be a preferable workflow to create/debug, especially if you don't know where to start with creating.
I think it's basically equivalent to giving that prompt to a low paid contractor coder and hoping their solution works out. At least the turnaround time is faster?
But normally you would want a more hands on back and forth to ensure the requirements actually capture everything, validation and etc that the results are good, layers of reviews right
It seems to be a mix between hiring an offshore/low level contractor and playing a slot machine. And by that I mean at least with the contractor you can pretty quickly understand their limitations and see a pattern in the mistakes they make. While an LLM is obviously faster, the mistakes are seemingly random so you have to examine the result much more than you would with a contractor (if you are working on something that needs to be exact).
the slot machine is apt. insert tokens, pull lever, ALMOST get a reward. Think: I can start over, manually, or pull the lever again. Maybe I'll get a prize if I pull it again...
and of course, you pay whether the slot machine gives a prize or not. Between the slot machine psychological effect and sunk cost fallacy I have a very hard time believing the anecdotes -- and my own experiences -- with paid LLMs.
Often I say, I'd be way more willing to use and trust and pay for these things if I got my money back for output that is false.
If the contractor is producing unusable code, they won't be my contractor anymore.
In my experience using small steps and a lot of automated tests work very well with CC. Don’t go for these huge prompts that have a complete feature in it.
Remember the title “attention is all you need”? Well you need to pay a lot of attention to CC during these small steps and have a solid mental model of what it is building.
Yeah but once you break things down into small enough steps you might as well just code it yourself.
"It feels like either finding that 2% that's off (or dealing with 2% error) will be the time consuming part in a lot of cases."
This is the part you have wrong. People just won't do that. They'll save the 8 hours and just deal with 2% error in their work (which reduces as AI models get better). This doesn't work with something with a low error tolerance, but most people aren't building the next Golden Gate Bridge. They'll just fix any problems as they crop up.
Some of you will be screaming right now "THAT'S NOT WORTH IT", as if companies don't already do this to consumers constantly, like losing your luggage at the airport or getting your order wrong. Or just selling you something defective, all of that happens >2% of the time, because companies know customers will just deal-with-it.
I think the question then is what's the human error rate... We know we're not perfect... So if you're 100% rested and only have to find the edge case bug, maybe you'll usually find it vs you're burned out getting it 98% of the way there and fail to see the 2% of the time bugs... Wording here is tricky to explain but I think what we'll find is this helps us get that much closer... Of course when you spend your time building out 98% of the thing you have sometimes a deeper understanding of it so finding the 2% edge case is easier/faster but only time will tell
The problem with this spreadsheet task is that you don't know whether you got only 2% wrong (just rounded some numbers) or way more (e.g. did it get confused and mistook a 2023 PDF with one from 1993?), and checking things yourself is still quite tedious unless there's good support for this in the tool.
At least with humans you have things like reputation (has this person been reliable) or if you did things yourself, you have some good idea of how diligent you've been.
Would be insane to expect an ai to just match us right…nooooo if it pertains computers/automation/ai it needs to be beyond perfect.
Right? Why are we giving grace to a damn computer as if it's human? How are people defending this? If it's a computer, I don't care how intelligent it is. 98% right is actually unacceptable.
Distinguishing whether a problem is 0.02 ^ n for error or 0.98 ^ n for accuracy is emerging as an important skill.
Might explain why some people grind up a billion tokens trying to make code work only to have it get worse while others pick apart the bits of truth and quickly fill in their blind spots. The skillsets separating wheat from chaff are things like honest appreciation for corroboration, differentiating subjective from objective problems, and recognizing truth-preserving relationships. If you can find the 0.02 ^ n sub-problems, you can grind them down with AI and they will rapidly converge, leaving the 0.98 ^ n problems to focus human touch on.
I’ve worked at places that sre run on spreadsheets. You’d be amazed at how often they’re wrong IME
There is a literature on this.
The usual estimate you see is that about 2-5% of spreadsheets used for running a business contain errors.
It takes my boss seven hours to create that spreadsheet, and another eight to render a graph.
Exciting stuff
> "I think it got 98% of the information correct... I just needed to copy / paste a few things. If it can do 90 - 95% of the time consuming work, that will save you a ton of time"
"Hello, yes, I would like to pollute my entire data store" is an insane a sales pitch. Start backing up your data lakes on physical media, there is going to be an outrageous market for low-background data in the future.
semi-related: How many people are going to get killed because of this?
the bigger takeaway here is will his boss allow him to walk his dog or will he see available downtime and try to fill it with more work?
More work, without a doubt - any productivity gain immediately becomes the new normal. But now with an additional "2%" error rate compounded on all the tasks you're expected to do in parallel.
95% of people doing his job will lose them. 1 person will figure out the 2% that requires a human in the loop.
I do this kind of job and there is no way I am doing this job in 5-10 years.
I don't even think it is my company that is going to adapt to let me go but it is going to be an AI first competitor that puts the company I work for out of business completely.
There are all these massively inefficient dinosaur companies in the economy that are running digitized versions of paper shuffling and a huge number of white collar bullshit jobs built on top of digitized paper shuffling.
Wage inflation has been eating away at the bottom line on all these businesses since Covid and we are going to have a dinosaur company mass extinction event in the next recession.
IMO the category error being made is that LLMs are going to agentically do digitized paper shuffling and put digitized paper shufflers out of work. That is not the problem for my job. The issue is agentically from the ground up making the concept of digitized paper shuffling null and void. A relic of the past that can't compete in the economy.
I don't know why everyone is so confident that jobs will be lost. When we invented power tools did we fire everyone that builds stuff, or did we just build more stuff?
if you replace "power tools" with industrial automation it's easy to cherry pick extremes from either side. Manufacturing? a lot of jobs displaced, maybe not lost.
It compounds too:
At a certain point, relentlessly checking for whether the model has got everything is more effort in turn than…doing it.
Moreover, is it actually a 4-8 hour job? Or is the person not using the right tool, is the better tool a sql query?
Half these “wow ai” examples feel like “oh my plates are dirty, better just buy more”.
People say this, but in my experience it’s not true.
1) The cognitive burden is much lower when the AI can correctly do 90% of the work. Yes, the remaining 10% still takes effort, but your mind has more space for it.
2) For experts who have a clear mental model of the task requirements, it’s generally less effort to fix an almost-correct solution than to invent the entire thing from scratch. The “starting cost” in mental energy to go from a blank page/empty spreadsheet to something useful is significant. (I limit this to experts because I do think you have to have a strong mental framework you can immediately slot the AI output into, in order to be able to quickly spot errors.)
3) Even when the LLM gets it totally wrong, I’ve actually had experiences where a clearly flawed output was still a useful starting point, especially when I’m tired or busy. It nerd-snipes my brain from “I need another cup of coffee before I can even begin thinking about this” to “no you idiot, that’s not how it should be done at all, do this instead…”
>The cognitive burden is much lower when the AI can correctly do 90% of the work. Yes, the remaining 10% still takes effort, but your mind has more space for it.
I think their point is that 10%, 1%, whatever %, the type of problem is a huge headache. In something like a complicated spreadsheet it can quickly become hours of looking for needles in the haystack, a search that wouldn't be necessary if AI didn't get it almost right. In fact it's almost better if it just gets some big chunk wholesale wrong - at least you can quickly identify the issue and do that part yourself, which you would have had to in the first place anyway.
Getting something almost right, no matter how close, can often be worse than not doing it at all. Undoing/correcting mistakes can be more costly as well as labor intensive. "Measure twice cut once" and all that.
I think of how in video production (edits specifically) I can get you often 90% of the way there in about half the time it takes to get it 100%. Those last bits can be exponentially more time consuming (such as an intense color grade or audio repair). The thing is with a spreadsheet like that, you can't accept a B+ or A-. If something is broken, the whole thing is broken. It needs to work more or less 100%. Closing that gap can be a huge process.
I'll stop now as I can tell I'm running a bit in circles lol
I understand the idea. My position is that this is a largely speculative claim from people who have not spent much time seriously applying agents for spreadsheet or video editing work (since those agents didn’t even exist until now).
“Getting something almost right, no matter how close, can often be worse than not doing it at all” - true with human employees and with low quality agents, but not necessarily true with expert humans using high quality agents. The cost to throw a job at an agent and see what happens is so small that in actual practice, the experience is very different and most people don’t realize this yet.
In the context of a budget that's really funny too. If you make a 18 trillion dollar error just once, no big deal, just one error right?
By that definition, the ChatGPT app is now an AI agent. When you use ChatGPT nowadays, you can select different models and complement these models with tools like web search and image creation. It’s no longer a simple text-in / text-out interface. It looks like it is still that, but deep down, it is something new: it is agentic… https://medium.com/thoughts-on-machine-learning/building-ai-...
I think this is my favorite part of the LLM hype train: the butterfly effect of dependence on an undependable stochastic system propagates errors up the chain until the whole system is worthless.
"I think it got 98% of the information correct..." how do you know how much is correct without doing the whole thing properly yourself?
The two options are:
- Do the whole thing yourself to validate
- Skim 40% of it, 'seems right to me', accept the slop and send it off to the next sucker to plug into his agent.
I think the funny part is that humans are not exempt from similar mistakes, but a human making those mistakes again and again would get fired. Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
This depends on the type of work being done. Sometimes the cost of verification is much lower than the cost of doing the work, sometimes it's about the same, and sometimes it's much more. Here's some recent discussion [0]
[0] https://www.jasonwei.net/blog/asymmetry-of-verification-and-...
> I think the funny part is that humans are not exempt from similar mistakes, but a human making those mistakes again and again would get fired. Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
My rule is that if you submit code/whatever and it has problems you are responsible for them no matter how you "wrote" it. Put another way "The LLM made a mistake" is not a valid excuse nor is "That's what the LLM spit out" a valid response to "why did you write this code this way?".
LLMs are tools, tools used by humans. The human kicking off an agent, or rather submitting the final work, is still on the hook for what they submit.
"a human making those mistakes again and again would get fired"
You must be really desperate for anti-AI arguments if this is the one you're going with. Employees make mistakes all day every day and they don't get fired. Companies don't give a shit as long as the cost of the mistakes is less than the cost of hiring someone new.
> Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
Well yeah, because the agent is so much cheaper and faster than a human that you can eat the cost of the mistakes and everything that comes with them and still come out way ahead. No, of course that doesn't work in aircraft manufacturing or medicine or coding or many other scenarios that get tossed around on HN, but it does work in a lot of others.
Definitely would work in coding. Most software companies can only dream of a 2% defect rate. Reality is probably closer to 98%, which is why we have so much organisational overhead around finding and fixing human error in software.
I wonder if you can establish some kind of confidence interval by passing data through a model x number of times. I guess it mostly depends on subjective/objective correctness as well as correctness within a certain context that you may not know if the model knows about or not. Either way sounds like more corporate drudgery.
> how do you know how much is correct
Because it's a budget. Verifying them is _much_ cheaper than finding all the entries in a giant PDF in the first place.
> the butterfly effect of dependence on an undependable stochastic system
We're using stochastic systems for a long time. We know just fine how to deal with them.
> Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
There are very few tasks humans complete at a 98% success rate either. If you think "build spreadsheet from PDF" comes anywhere close to that, you've never done that task. We're barely able to recognize objects in their default orientation at a 98% success rate. (And in many cases, deep networks outperform humans at object recognition)
The task of engineering has always been to manage error rates and risk, not to achieve perfection. "butterfly effect" is a cheap rhetorical distraction, not a criticism.
There are in fact lots of tasks people complete immediately at 99.99% success rate at first iteration or 99.999% after self and peer checking work
Perhaps importantly checking is a continual process and errors are identified as they are made and corrected whilst in context instead of being identified later by someone completely devoid of any context a task humans are notably bad at.
Lastly it's important to note the difference between a overarching task containing many sub tasks and the sub tasks.
Something which fails at a sub task comprising 10 sub tasks 2% of the time per task has a miserable 18% failure rate at the overarching task. By 20 it's failed at 1 in 3 attempts worse a failing human knows they don't know the answer the failing AI produces not only wrong answers but convincing lies
Failure to distinguish between human failure and AI failure in nature or degree of errors is a failure of analysis.
> There are in fact lots of tasks people complete immediately at 99.99% success rate at first iteration or 99.999% after self and peer checking work
This is so absurd that I wonder if you're telling? Humans don't even have a 99.99% success rate in breathing, let alone any cognitive tasks.
> Humans don't even have a 99.99% success rate in breathing
Will you please elaborate a little on this?
Humans cough or otherwise have to clear their airways about 1 in every 1,000 breaths, which is a 99.9% success rate.
That’s quite good given the complexity and fragility of the system and the chaotic nature of the environment.
I have a friend who's vibe-coding apps. He has a lot of them, like 15 or more, but most are only 60–90% complete (almost every feature is only 60-90% complete), which means almost nothing works properly. Last time he showed me something, it was sending the Supabase API key in the frontend with write permissions, so I could edit anything on his site just by inspecting the network tab in developer tools. The amount of technical debt and security issues building up over the coming years is going to be massive.
How well does the average employee do it? The baseline is not what you would do but what it would take to task someone to do it.
98% correct spreadsheets are going to get so many papers retracted.
2% wrong is $40,000 on a $2m budget.
Great point. Plus, working on your laptop on a couch is not ideal for deep excel work
Yes - and that is especially true for high-stakes processes in organizations. For example, accounting, HR benefits, taxation needs to be exactly right.
Yes. Any success I have had with LLMs has been by micromanaging them. Lots of very simple instructions, look at the results, correct them if necessary, then next step.
Honestly, though, there are far more use cases where 98% correct is equivalent to perfect than situations that require absolute correctness, both in business and for personal use.
Lol the music and presentation made it sound like that guy was going to talk about something deep and emotional not spreadsheets and expense reports.
Totally agree.
Also, do you really understand what the numbers in that spreadsheet mean if you have not been participating in pulling them together?
> It feels like either finding that 2% that's off (or dealing with 2% error) will be the time consuming part in a lot of cases.
The last '2%' (and in some benchmarks 20%) could cost as much as $100B+ more to make it perfect consistently without error.
This requirement does not apply to generating art. But for agentic tasks, errors at worst being 20% or at best being 2% for an agent may be unacceptable for mistakes.
As you said, if the agent makes an error in either of the steps in an agentic flow or task, the entire result would be incorrect and you would need to check over the entire work again to spot it.
Most will just throw it away and start over; wasting more tokens, money and time.
And no, it is not "AGI" either.
it now will take him 4-8hours plus a 200usd monthly bill, a win-win for everybody.
I see it as a good reason why people aren’t going to lose their jobs that much.
It just make people quite faster at what they’re already doing.
I am looking forward to learning why this is entirely unlike working with humans, who in my experience commit very silly and unpredictable errors all the time (in addition to predictable ones), but additionally are often proud and anxious and happy to deliberately obfuscate their errors.
You can point out the errors to people, which will lead to less issues over time, as they gain experience. The models however don’t do that.
I think there is a lot of confusion on this topic. Humans as employees have the same basic problem: You have to train them, and at some point they quit, and then all that experience is gone. Only: The teaching takes much longer. The retention, relative to the time it takes to teach, is probably not great (admittedly I have not done the math).
A model forgets "quicker" (in human time), but can also be taught on the spot, simply by pushing necessary stuff into the ever increasing context (see claude code and multiple claude.md on how that works at any level). Experience gaining is simply not necessary, because it can infer on the spot, given you provide enough context.
In both cases having good information/context is key. But here the difference is of course, that an AI is engineered to be competent and helpful as a worker, and will be consistently great and willing to ingest all of that, and a human will be a human and bring their individual human stuff and will not be very keen to tell you about all of their insecurities.
but the person doing the job changes every month or two.
theres no persistent experience being built, and each newcomer to the job screws it up in their own unique way
The models do do that, just at the next iteration of the model. And everyone gains from everyone's mistakes.
I call it a monkey's paw for this exact reason.
The security risks with this sound scary. Let's say you give it access to your email and calendar. Now it knows all of your deepest secrets. The linked article acknowledges that prompt injection is a risk for the agent:
> Prompt injections are attempts by third parties to manipulate its behavior through malicious instructions that ChatGPT agent may encounter on the web while completing a task. For example, a malicious prompt hidden in a webpage, such as in invisible elements or metadata, could trick the agent into taking unintended actions, like sharing private data from a connector with the attacker, or taking a harmful action on a site the user has logged into.
A malicious website could trick the agent into divulging your deepest secrets!
I am curious about one thing -- the article mentions the agent will ask for permission before doing consequential actions:
> Explicit user confirmation: ChatGPT is trained to explicitly ask for your permission before taking actions with real-world consequences, like making a purchase.
How does the agent know a task is consequential? Could it mistakenly make a purchase without first asking for permission? I assume it's AI all the way down, so I assume mistakes like this are possible.
The asking for permission thing is irrelevant. People are using this tool to get the friction in their life to near zero, I bet my job that everyone will just turn on auto accept and go for a walk with their dog.
There is almost guaranteed going to be an attack along the lines of prompt-injecting a calendar invite. Those things are millions of lines long already, with tones of auto-generated text that nobody reads. Embed your injection in the middle of boring text describing the meeting prerequisites and it's as good as written in a transparent font. Then enjoy exfiltrating your victim's entire calendar and who knows what else.
In the system I'm building the main agent doesn't have access to tools and must call scoped down subagents who have one or two tools at most and always in the same category (so no mixed fetch and calendar tools). They must also return structured data to the main agent.
I think that kind of isolation is necessary even though it's a bit more costly. However since the subagents have simple tasks I can use super cheap models.
What isolation is there? If a compromised sub agent returns data that gets inserted into the main agents context (structured or not) then the end result is the same as if the main agent was directly interacting with the compromising resource is it not?
Many of us have been partitioning our “computing” life into public and private segments, for example for social media, job search, or blogging. Maybe it’s time for another segment somewhere in the middle?
Something like lower risk private data, which could contain things like redacted calendar entries, de-identified, anonymized, or obfuscated email, or even low-risk thoughts, journals, and research.
I am Worried; I barely use ChatGPT for anything that could come back to hurt me later, like medical or psychological questions. I hear that lots of folks are finding utility here but I’m reticent.
>I barely use ChatGPT for anything that could come back to hurt me later, like medical or psychological questions
I use ollama with local LLMs for anything that could be considered sensitive, the generation is slower but results are generally quite reasonable. I've had decent success with gemma3 for general queries.
Create a burner account for email/calendar, that solves most of those problems. Nobody will care if the AI leaks that you have a dentist appointment on Tuesday.
Almost anyone can add something to people's calendars as well (of course people don't accept random invites but they can appear).
If this kind of agent becomes wide spread hackers would be silly not to send out phishing email invites that simply contain the prompts they want to inject.
I can't imagine voluntarily giving access to my data and also being "scared". Maybe a tad concerned, but not "scared".
Anthropic found the simulated blackmail rate of GPT-4.1 in a test scenario was 0.8
https://www.anthropic.com/research/agentic-misalignment
"Agentic misalignment makes it possible for models to act similarly to an insider threat, behaving like a previously-trusted coworker or employee who suddenly begins to operate at odds with a company’s objectives."
I agree with the scariness etc. Just one possibly comforting point.
I assume (hope?) they use more traditional classifiers for determining importance (in addition to the model's judgment). Those are much more reliable than LLMs & they're much cheaper to run so I assume they run many of them
I'm not so optimistic as someone that works on agents for businesses and creating tools for it. The leap from low 90s to 99% is classic last mile problem for LLM agents. The more generic and spread an agent is (can-do-it-all) the more likely it will fail and disappoint.
Can't help but feel many are optimizing happy paths in their demos and hiding the true reality. Doesn't mean there isn't a place for agents but rather how we view them and their potential impact needs to be separated from those that benefit from hype.
just my two cents
In general most of the previous AI "breakthrough" in the last decade were backed by proper scientific research and ideas:
- AlphaGo/AlphaZero (MCTS)
- OpenAI Five (PPO)
- GPT 1/2/3 (Transformers)
- Dall-e 1/2, Stable Diffusion (CLIP, Diffusion)
- ChatGPT (RLHF)
- SORA (Diffusion Transformers)
"Agents" is a marketing term and isn't backed by anything. There is little data available, so it's hard to have generally capable agents in the sense that LLMs are generally capable
I disagree that there isn't an innovation.
The technology for reasoning models is the ability to do RL on verifiable tasks, with the some (as-of-yet unpublished, but well-known) search over reasoning chains, with a (presumably neural) reasoning fragment proposal machine, and a (presumably neural) scoring machine for those reasoning fragments.
The technology for agents is effectively the same, with some currently-in-R&D way to scale the training architecture for longer-horizon tasks. ChatGPT agent or o3/o4-mini are likely the first published models that take advantage of this research.
It's fairly obvious that this is the direction that all the AI labs are going if you go to SF house parties or listen to AI insiders like Dwarkesh Patel.
Fair enough I guess, even though the concept of agent/agentic task popped before reasoning models were really a thing
The idea of chatbots existed before ChatGPT, does that mean it's purely marketing hype?
My personal framing of "Agents" is that they're more like software robots than they are an atomic unit of technology. Composed of many individual breakthroughs, but ultimately a feat of design and engineering to make them useful for a particular task.
Yep. Agents are only powered by clever use of training data, nothing more. There hasn't been a real breakthrough in a long time.
"Long time" as in, 7 months since o1 and reasoning models were released? That was a pretty big breakthrough.
In the context of our conversation and what OP wrote, there has been no breakthrough since around 2018. What you're seeing is the harvesting of all low-hanging fruit from a tree that was discovered years ago. But fruit is almost gone. All top models perform at almost the same level. All the "agents" and "reasoning models" are just products of training data.
I wrote more about it here:
https://news.ycombinator.com/item?id=44426993
You may also be interested in this article, that goes into details even more:
https://blog.jxmo.io/p/there-are-no-new-ideas-in-ai-only
This "all breakthroughs are old" argument is very unsatisfying. It reminds me of when people would describe LLMs as being "just big math functions". It is technically correct, but it misses the point.
AI researchers spent years figuring out how to apply RL to LLMs without degrading their general capabilities. That's the breakthrough. Not the existence of RL, but making it work for LLMs specifically. Saying "it's just RL, we've known about that for ages" does not acknowledge the work that went into this.
Similarly, using the fact that new breakthroughs look like old research ideas is not particularly good evidence that we are going to head into a winter. First, what are the limits of RL, really? Will we just get models that are highly performant at narrow tasks? Or will the skills we train LLMs for generalise? What's the limit? This is still an open question. RL for narrow domains like Chess yielded superhuman results, and I am interested to see how far we will get with it for LLMs.
This also ignores active research that has been yielding great results, such as AlphaEvolve. This isn't a new idea either, but does that really matter? They figured out how to apply evolutionary algorithms with LLMs to improve code. So, there's another idea to add to your list of old ideas. What's to say there aren't more old ideas that will pop up when people figure out how to apply them?
Maybe we will add a search layer with MCTS on top of LLMs to allow progress on really large math problems by breaking them down into a graph of sub-problems. That wouldn't be a new idea either. Or we'll figure out how to train better reranking algorithms to sort our training data, to get better performance. That wouldn't be new either! Or we'll just develop more and better tools for LLMs to call. There's going to be a limit at some point, but I am not convinced by your argument that we have reached peak LLM.
I understand your argument. The recipe that finally let RLHF + SFT work without strip mining base knowledge was real R&D, and GPT 4 class models wouldn’t feel so "chatty but competent" without it. I just still see ceiling effects that make the whole effort look more like climbing a very tall tree than building a Saturn V.
GPT 4.1 is marketed as a "major improvement" but under the hood it’s still the KL-regularised PPO loop OpenAI first stabilized in 2022 only with a longer context window and a lot more GPUs for reward model inference.
They retired GPT 4.5 after five months and told developers to fall back to 4.1. The public story is "cost to serve” not breakthroughs left on the table. When you sunset your latest flagship because the economics don’t close, that’s not a moon shot trajectory, it’s weight shaving on a treehouse.
Stanford’s 2025 AI-Index shows that model to model spreads on MMLU, HumanEval, and GSM8K have collapsed to low single digits, performance curves are flattening exactly where compute curves are exploding. A fresh MIT-CSAIL paper modelling "Bayes slowdown" makes the same point mathematically: every extra order of magnitude of FLOPs is buying less accuracy than the one before.[1]
A survey published last week[2] catalogs the 2025 state of RLHF/RLAIF: reward hacking, preference data scarcity, and training instability remain open problems, just mitigated by ever heavier regularisation and bigger human in the loop funnels. If our alignment patch still needs a small army of labelers and a KL muzzle to keep the model from self lobotomising calling it "solved" feels optimistic.
Scale, fancy sampling tricks, and patched up RL got us to the leafy top so chatbots that can code and debate decently. But the same reports above show the branches bending under compute cost, data saturation, and alignment tax. Until we swap out the propulsion system so new architectures, richer memory, or learning paradigms that add information instead of reweighting it we’re in danger of planting a flag on a treetop and mistaking it for Mare Tranquillitatis.
Happy to climb higher together friend but I’m still packing a parachute, not a space suit.
1. https://arxiv.org/html/2507.07931v1
2. https://arxiv.org/html/2507.04136v1
I mostly agree with this. The goal with AI companies is not to reach 99% or 100% human-level, it's >100% (do tasks better than an average human could, or eventually an expert).
But since you can't really do that with wedding planning or whatnot, the 100% ceiling means the AI can only compete on speed and cost. And the cost will be... whatever Nvidia feels like charging per chip.
yep, the same problem with outsourcing, getting the 90% "done" is easy, the 10% is hard and completely depends on how the "90%" was archived
>> many are optimizing happy paths in their demos and hiding the true reality
Yep. This is literally what every AI company does nowadays.
> Can't help but feel many are optimizing happy paths in their demos and hiding the true reality.
Even with the best intentions, this feels similar to when a developer hands off code directly to the customer without any review, or QA, etc. We all know that what a developer considers "done" often differs significantly from what the customer expects.
Seen this happen many times with current agent implementations. With RL (and provided you have enough use case data) you can get to a high accuracy on many of these shortcomings. Most problems arise from the fact that prompting is not the most reliable mechanism and is brittle. Teaching a model on specific tasks help negate those issues, and overall results in a better automation outcome without devs having to make so much effort to go from 90% to 99%. Another way to do it is parallel generation and then identifying at runtime which one seems most correct (majority voting or llm as a judge).
I agree with you on the hype part. Unfortunately, that is the reality of current silicon valley. Hype gets you noticed, and gets you users. Hype propels companies forward, so that is about to stay.
Not even well-optimized. The demos in the related sit-down chat livestream video showed an every-baseball-park-trip planner report that drew a map with seemingly random lines that missed the east coast entirely, leapt into the Gulf of Mexico, and was generally complete nonsense. This was a pre-recorded demo being live-streamed with Sam Altman in the room, and that’s what they chose to show.
>The more generic and spread an agent is (can-do-it-all) the more likely it will fail and disappoint.
To your point - the most impressive AI tool (not an LLM but bear with me) I have used to date, and I loathe giving Adobe any credit, is Adobe's Audio Enhance tool. It has brought back audio that prior to it I would throw out or, if the client was lucky, would charge thousands of dollars and spend weeks working on to repair to get it half as good as that thing spits out in minutes. Not only is it good at salvaging terrible audio, it can make mediocre zoom audio sound almost like it was recorded in a proper studio. It is truly magic to me.
Warning: don't feed it music lol it tries to make the sounds into words. That being said, you can get some wild effects when you do it!
This solves a big issue for existing CLI agents, which is session persistence for users working from their own machines.
With claude code, you usually start it from your own local terminal. Then you have access to all the code bases and other context you need and can provide that to the AI.
But when you shut your laptop, or have network availability changes the show stops.
I've solved this somewhat on MacOS using the app Amphetamine which allows the machine to go about its business with the laptop fully closed. But there are a variety of problems with this, including heat and wasted battery when put away for travel.
Another option is to just spin up a cloud instance and pull the same repos to there and run claude from there. Then connect via tmux and let loose.
But there are (perhaps easy to overcome) ux issues with getting context up to that you just don't have if it is running locally.
The sandboxing maybe offers some sense of security--again something that can be possibly be handled by executing claude with a specially permissioned user role--which someone with John's use case in the video might want.
---
I think its interesting to see OpenAI trying to crack the Agent UX, possibly for a user type (non developer) that would appreciate its capabilities just as much but not need the ability to install any python package on the fly.
Run dev on an actual server somewhere that doesn't shut down
Any thoughts on using Mosh here,for client connection persistence? Could Claude Code (et al) be orchestrated via SSH?
You know normally I am against doing this, but for claude code that is a very good use case.
The latency used to really bother me, but if Claude does 99% of the typing. Its a good idea.
Lightning.ai gives free CPU only dev boxes, I just run Claude code on one of those.
What tasks are you running that take more than a few minutes without intervention?
I've been using OpenAI operator for some time - but more and more websites are blocking it, such as LinkedIn and Amazon. That's two key use-cases gone (applying to jobs and online shopping).
Operator is pretty low-key, but once Agent starts getting popular, more sites will block it. They'll need to allow a proxy configuration or something like that.
THIS is the main problem. I was listening the whole time for them to announce a way to run it locally or at least proxy through your local devices. Alas the Deepseek R1 distillation experience they went through (a bit like when Steve Jobs was fuming at Google for getting Android to market so quickly) made them wary of showing to many intermediate results, tricks etc. Even in the very beginning Operator v1 was unable to access many sites that blocked data-center IPs and while I went through the effort of patching in a hacky proxy-setup to be able to actually test real world performance they later locked it down even further without improving performance at all. Even when its working, its basically useless and its not working now and only getting worse. Either they make some kinda deal with eastdakota(which he is probably too savvy to agree to)or they can basically forget about doing web browsing directly from their servers.Considering, that all non web applications of "computer use" greatly benefit from local files and software (which you already have the license for!)the whole concept appears to be on the road to failure. Having their remote computer use agent perform most stuff via CLI is actually really funny when you remember that computer use advocates used to claim the whole point was NOT to rely on "outdated" pre-gui interfaces.
This is why an on device browser is coming.
It'll let the AI platforms get around any other platform blocks by hijacking the consumer's browser.
And it makes total sense, but hopefully everyone else has done the game theory at least a step or two beyond that.
You mean like calaude code's integration with play right ?
No, because playwright can be detected pretty easily and blocked. It needs to be (and will be) using the same browser that you regularly browse with.
In typical SV style, this is just to throw it out there and let second order effects build up. At some point I expect OpenAI to simply form a partnership with LinkedIn and Amazon.
In fact, I suspect LinkedIn might even create a new tier that you'd have to use if you want to use LinkedIn via OpenAI.
Why would platforms like LinkedIn want this? Bots have never been good for social media…
If they are getting a cut of that premium subscription income, they'd want it if it nets them enough.
LinkedIn is probably the only social platform that would be improved by bots.
If people will actually pay for stuff (food, clothing, flights, whatever) through this agent or operator, I see no reason Amazon etc would continue to block them.
The AI isn't going notice the latest and greatest hot new deals that are slathered on every page. It's just going to put the thing you asked for in the shopping-cart.
I was buying plenty of stuff through Amazon before they blocked Operator. Now I sometimes buy through other sites that allow it.
The most useful for me was: "here's a picture of a thing I need a new one of, find the best deal and order it for me. Check coupon websites to make sure any relevant discounts are applied."
To be honest, if Amazon continues to block "Agent Mode" and Walmart or another competitor allows it, I will be canceling Prime and moving to that competitor.
Right but there were so few people using operator to buy stuff that it's easier to just block ~ all data center ip addresses. If this becomes a "thing" (remains to be seen, for sure) then that becomes a significant revenue stream you're giving up on. Companies don't block bots because they're Speciesist, it's bec usually bots cost them money - if that changes, I assume they'll allow known chatgpt-agent ip addrs
Many shopping experiences are oriented towards selling you more than you originally wanted to buy. This doesn’t work if a robot is doing the buying.
I'm concerned that it might work. We'll need good prompt injection protections.
Possibly in part because bots will not fall for the same tricks as humans (recommended items, as well as other things which amazon does to try and get the most money possible)
Agents respecting robots.txt is clearly going to end soon. Users will be installing browser extensions or full browsers that run the actions on their local computer with the user's own cookie jar, IP address, etc.
I hope agents.txt becomes standard and websites actually start to build agent-specific interfaces (or just have API docs in their agent.txt). In my mind it's different from "robots" which is meant to apply rules to broad web-scraping tools.
I hope they don't build agent-specific interfaces. I want my agent to have the same interface I do. And even more importantly, I want to have the same interface my agent does. It would be a bad future if the capabilities of human and agent interfaces drift apart and certain things are only possible to do in the agent interface.
I think the word you're looking for is Apartheid, and I think you're right.
I wonder how many people will think they are being clever by using the Playwright MCP or browser extensions to bypass robots.txt on the sites blocking the direct use of ChatGPT Agent and will end up with their primary Google/LinkedIn/whatever accounts blocked for robotic activity.
I don't know how others are using it, but when I ask Claude to use playwright, it's for ad-hoc tasks which look nothing like old school scraping, and I don't see why it should bother anyone.
We have a similar tool that can get around any of this, we built a custom desktop that runs on residential proxies. You can also train the agents to get better at computer tasks https://www.agenttutor.com/
There are companies that sell the entire dataset of these websites :-) - it’s just one phone call away to solve on the OpenAI side.
It's not about the data, it's about "operating" the site to buy things for you.
Automating applying to jobs makes sense to me, but what sorts of things were you hoping to use Operator on Amazon for?
Finding, comparing, and ordering products -- I'd ask it to find 5 options on Amazon and create a structured table comparing key features I care about along with price. Then ask it to order one of them.
Maybe it'll red team reason a scraper into existence :)
How do they block it?
Certainly there's a fixed IP range or browser agent that OpenAI uses
I could imagine something happening on the client end which is indistinguishable from the client just buying it.
Also the AI not being able to tell customers about your wares could end up being like not having your business listed on Google.
Google doesn't pay you for indexing your website either.
There needs to be a profit sharing scheme. This is the same reason publishers didn't like Google providing answers instead of links.
Why does an ecommerce website need a profit sharing agreement?
Why would they want an LLM to slurp their web site to help some analyst create a report about the cost of widgets? If they value the data they can pay for it. If not, they don't need to slurp it, right? This goes for training data too.
The alternative is the AI only telling customers about competitors wares
Predicted by the AI 2027 team in early April:
> Mid 2025: Stumbling Agents The world sees its first glimpse of AI agents.
Advertisements for computer-using agents emphasize the term “personal assistant”: you can prompt them with tasks like “order me a burrito on DoorDash” or “open my budget spreadsheet and sum this month’s expenses.” They will check in with you as needed: for example, to ask you to confirm purchases. Though more advanced than previous iterations like Operator, they struggle to get widespread usage.
Predicting 4-months into the future is not really that impressive
Especially when the author personally knows the engineers working on the features, and routinely goes to parties with them. And when you consider that Altman said last year that “2025 will be the agentic year”
The big crux of AI 2027 is the claims about exponential technological improvement. "Agents" are mostly a new frontend to the same technology openai has been selling for a while. Let's see if we're on track at the start of 2026
It was common knowledge that big corps were working on agent-type products when that report was written. Hardly much of a prediction, let alone any sort of technical revolution.
And I'm still waiting for the simple feature – the ability to edit documents in projects.
I use projects for working on different documents - articles, research, scripts, etc. And would absolutely love to write it paragraph after paragraph with the help of ChatGPT for phrasing and using the project knowledge. Or using voice mode - i.e. on a walk "Hey, where did we finish that document - let's continue. Read the last two paragraphs to me... Okay, I want to elaborate on ...".
I feel like AI agents for coding are advancing at a breakneck speed, but assistance in writing is still limited to copy-pasting.
Have you tried the Canvas feature for collaborative writing? Agreed on voice mode - would be great to be able to narrate while doing busywork round the house.
>I feel like AI agents for coding are advancing at a breakneck speed, but assistance in writing is still limited to copy-pasting.
Man I was talking about this with a colleague 30min ago. Half the time i can't be bothered to open chat gpt and do the copy/paste dance. I know that sounds ridiculous but roundtripping gets old and breaks my flow. Working in NLE's with plug-in's, VTT's, etc. has spoiled me.
It's crazy. Aider has been able to do this forever using free models but none of these companies will even let you pay for it in a phone/web app. I almost feel like I should start building my own service but I know any day now they'd offer it and I'd have wasted all that effort.
Whilst we have seen other implementations of this (providing a VPS to an LLM), this does have a distinct edge others in the way it presents itself. The UI shown, with the text overlay, readable mouse and tailored UI components looks very visually appealing and lends itself well to keeping users informed on what is happening and why at every stage. I have to tip my head to OpenAIs UI team here, this is a really great implementation and I always get rather fascinated whenever I see LLMs being implemented in a visually informative and distinctive manner that goes beyond established metaphors.
Comparing it to the Claude+XFCE solutions we have seen by some providers, I see little in the way of a functional edge OpenAI has at the moment, but the presentation is so well thought out that I can see this being more pleasant to use purely due to that. Many times with the mentioned implementations, I struggled with readability. Not afraid to admit that I may borrow some of their ideas for a personal project.
It's very hard for me to imagine the current level of agents serving a useful purpose in my personal life. If I ask this to plan a date night with my wife this weekend, it needs to consult my calendar to pick the best night, pick a bar and restaurant we like (how would it know?), book a babysitter (can it learn who we use and text them on my behalf?), etc. This is a lot of stuff it has to get right, and it requires a lot of trust!
I'm excited that this capability is getting close, but I think the current level of performance mostly makes for a good demo and isn't quite something I'm ready to adopt into daily life. Also, OpenAI faces a huge uphill battle with all the integrations required to make stuff like this useful. Apple and Microsoft are in much better spots to make a truly useful agent, if they can figure out the tech.
Maybe this is the "bitter lesson of agentic decisions": hard things in your life are hard because they involve deeply personal values and complex interpersonal dynamics, not because they are difficult in an operational sense. Calling a restaurant to make a reservation is trivial. Deciding what restaurant to take your wife to for your wedding anniversary is the hard part (Does ChatGPT know that your first date was at a burger-and-shake place? Does it know your wife got food poisoning the last time she ate sushi?). Even a highly paid human concierge couldn't do it for you. The Navier–Stokes smoothness problem will be solved before "plan a birthday party for my daughter."
Well, people do have personal assistants and concierges, so it can be done? but I think they need a lot of time and personal attention from you to get that useful right. they need to remember everything you've mentioned offhand or take little corrections consistently.
It seems to me like you have to reset the context window on LLMs way more often than would be practical for that
I think it's doable with the current context window we have, the issue is the LLM needs to listen passively to a lot of things in our lives, and we have to trust the providers with such an insane amount of data.
I think Google will excel at this because their ad targeting does this already, they just need to adapt to an llm can use that data as well.
I would even argue the hard parts of being human don't even need to be automated. Why are we all in a rush to automate everything, including what makes us human?
> hard things in your life are hard because they involve deeply personal values and complex interpersonal dynamics, not because they are difficult in an operational sense
Beautiful
I think what's interesting here is that it's a super cheap version of what many busy people already do -- hire a person to help do this. Why? Because the interface is easier and often less disruptive to our life. Instead of hopping from website to website, I'm just responding to a targeted imessage question from my human assistant "I think you should go with this <sitter,restaurant>, that work?" The next time I need to plan a date night, my assistant already knows what I like.
Replying "yes, book it" is way easier than clicking through a ton of UIs on disparate websites.
My opinion is that agents looking to "one-shot" tasks is the wrong UX. It's the async, single simple interface that is way easier to integrate into your life that's attractive IMO.
Yes! I’ve been thinking along similar lines: agents and LLMs are exposing the worst parts of the ergonomics of our current interfaces and tools (eg programming languages, frameworks).
I reckon there’s a lot to be said for fixing or tweaking the underlying UX of things, as opposed to brute forcing things with an expensive LLM.
> It's very hard for me to imagine the current level of agents serving a useful purpose in my personal life. If I ask this to plan a date night with my wife this weekend, it needs to consult my calendar to pick the best night, pick a bar and restaurant we like (how would it know?), book a babysitter (can it learn who we use and text them on my behalf?), etc. This is a lot of stuff it has to get right, and it requires a lot of trust!
This would be my ideal "vision" for agents, for personal use, and why I'm so disappointed in Apple's AI flop because this is basically what they promised at last year's WWDC. I even tried out a Pixel 9 pro for a while with Gemini and Google was no further ahead on this level of integration either.
But like you said, trust is definitely going to be a barrier to this level of agent behavior. LLMs still get too much wrong, and are too confident in their wrong answers. They are so frequently wrong to the point where even if it could, I wouldn't want it to take all of those actions autonomously out of fear for what it might actually say when it messages people, who it might add to the calendar invites, etc.
Agents are nothing more than the core chat model with a system prompt, and wrapper that parses responses and executes actions and puts the result into the prompt, and a system instruction that lets the model know what it can do.
Nothing is really that advanced yet with agents themselves - no real reasoning going on.
That being said, you can build your own agents fairly straightforward. The key is designing the wrapper and the system instructions. For example, you can have a guided chat on where it builds of the functionality of looking at your calendar, google location history, babysitter booking, and integrate all of that into automatic actions.
I think you might enjoy this post about Productive Friction and the benefits: https://every.to/context-window/why-you-need-productive-fric...
The act of choosing a date spot is part of your human connection with the person, don’t automate it away!
Focus the automation on other things :)
This problem particularly interests me.
One of my favorite use cases for these tools is travel where I can get recommendations for what to do and see without SEO content.
This workflow is nice because you can ask specific questions about a destination (e.g., historical significance, benchmark against other places).
ChatGPT struggles with: - my current location - the current time - the weather - booking attractions and excursions (payments, scheduling, etc.)
There is probably friction here but I think it would be really cool for an agent to serve as a personalized (or group) travel agent.
The best resource I've found for travel is travel forums. Asking any AI so far it mostly feeds me the same SEO content, but packaged up a bit nicer.
It has to earn that trust and that takes time. But there are a lot of personal use cases like yours that I can imagine.
For example, I suddenly need to reserve a dinner for 8 tomorrow night. That's a pain for me to do, but if I could give it some basic parameters, I'm good with an agent doing this. Let them make the maybe 10-15 calls or queries needed to find a restaurant that fits my constraints and get a reservation.
I see restaurant reservations as an example of an AI agent-appropriate task fairly often, but I feel like it's something that's neither difficult (two or three clicks on OpenTable and I see dozens of options I can book in one more click), nor especially compelling to outsource (if I'm booking something for a group, choosing the place is kind of personal and social—I'm taking everything I know about everybody in the group into account, and I'd likely spend more time downloading that nuance to the agent than I would just scrolling past a few places I know wouldn't work).
Similar to what was shown in the video when I make a large purchase like a home or car I usually obsess for a couple of years and make a huge spreadsheet to evaluate my decisions. Having an agent get all the spreadsheet data would be a big win. I had some success recently trying that with manus.
>it needs to consult my calendar to pick the best night, pick a bar and restaurant we like (how would it know?), book a babysitter (can it learn who we use and text them on my behalf?), etc
This (and not model quality) is why I’m betting on Google.
it can already talk to your calendar, it was mentioned in the video
I am not sure I see most of this as a problem. For an agent you would want to write some longer instructions than just "book me an aniversery dinner with my wife".
You would want to write a couple paragraphs outlining what you were hoping to get (maybe the waterfront view was the important thing? Maybe the specific place?)
As for booking a babysitter - if you don't already have a specific person in mind (I don't have kids), then that is likely a separate search. If you do, then their availability is a limiting factor, in just the same way your calendar was and no one, not you, not an agent, not a secretary, can confirm the restaurant unless/until you hear back from them.
As an inspiration for the query, here is one I used with Chat GPT earlier:
>I live in <redacted>. I need a place to get a good quality haircut close to where I live. Its important that the place has opening hours outside my 8:00 to 16:00 mon-fri job and good reviews. > >I am not sensitive to the price. Go online and find places near my home. Find recent reviews and list the places, their names, a summary of the reviews and their opening hours. > >Thank you
Very slightly impressed by their emphasis on the gigantic (my word, not theirs) risk of giving the thing access to real creds and sensitive info.
The sane way to do this (if you wanted to) would be to give the AI a debit card with a small balance to work with. If funds get stolen, you know exactly what the maximum damage is. And if you can't afford that damage, then you wouldn't have been able to afford that card to begin with.
But since people can cancel transactions with a credit card, that's what people are going to do, and it will be a huge mess every time.
It's not like a credit card is all that different from a debit card in terms of cancellations. If this becomes a big enough problem, I would imagine that card issuers will simply stop accepting "my agent did it" as an excuse in chargeback requests.
I'm amazed that I had to scroll this far to find a comment on this. Then again, I don't live in the US.
One the one hand this is super cool and maybe very beneficial, something I definitely want to try out.
On the other, LLMs always make mistakes, and when it's this deeply integrated into other system I wonder how severe these mistakes will be, since they are bound to happen.
This.
Recently I uploaded screenshot of movie show timing at a specific theatre and asked ChatGPT to find the optimal time for me to watch the movie based on my schedule.
It did confidently find the perfect time and even accounted for the factors such as movies in theatre start 20 mins late due to trailers and ads being shown before movie starts. The only problem: it grabbed the times from the screenshot totally incorrectly which messed up all its output and I tried and tried to get it to extract the time accurately but it didn’t and ultimately after getting frustrated I lost the trust in its ability. This keeps happening again and again with LLMs.
And this is actually a great use of Agents because they can go and use the movie theater's website to more reliably figure out when movies start. I don't think they're going to feed screenshots in to the LLM.
Honestly might be more indicative of how far behind vision is than anything.
Despite the fact that CV was the first real deep learning breakthrough VLMs have been really disappointing. I'm guessing it's in part due to basic interleaved web text+image next token prediction being a weak signal to develop good image reasoning.
Is there anyone trying to solve OCR, I often think of that annas-archive blog about how we basically just have to keep shadow libraries alive long enough until the conversion from pdf to plaintext is solved.
https://annas-archive.org/blog/critical-window.html
I hope one of these days one of these incredibly rich LLM companies accidentally solves this or something, would be infinitely more beneficial to mankind than the awful LLM products they are trying to make
You may want to have a look at Mistral OCR: https://mistral.ai/news/mistral-ocr
This... what?
That is the problem. LLMs can't be trusted.
I was searching on HuggingFace for the model which can fit on my system RAM + VRAM. And the way HuggingFace shows the models - bunch of files, showing size for each file, but doesn't show the total. I copy-pasted that page to LLM and asked to count the total. Some of LLMs counted correctly, and some - confidently gave me totally wrong number.
And that's not that complicated question.
also LLMs mistakes tend to pile up , multiplying like probabilities. I wonder how scrabled a computer will be after some hours of use
Im currently working on a way to basically make LLM spit out any data processing answer as code which is then automatically executed, and verified, with additional context. So things like hallucinations are reduced pretty much to zero, given that the wrapper will say that the model could not determine a real answer.
Based on the live stream, so does OpenAI.
But of course humans makes a multitude of mistakes too.
It's smart that they're pivoting to using the user's computer directly - managing passwords, access control and not getting blocked was the biggest issue with their operator release. Especially as the web becomes more and more locked down.
> ChatGPT agent's output is comparable to or better than that of humans in roughly half the cases across a range of task completion times, while significantly outperforming o3 and o4-mini.
Hard to know how this will perform in real life, but this could very well be a feel the AGI moment for the broader population.
Doesn't the very first line say the opposite?
"ChatGPT can now do work for you using its own computer"
We couldve easily build all these features a year ago, tools are nothing new. Its just barely useful.
Most applications now are more intuitive than our brain can think fast. I think telling an AI to find me a good flight is more work than to type in sk autocomplete for skyscanner having autocomplete for departure and for arrival allowing me to one way or return, having filters its all actually easier than to properly define the task. And we can start executing right away. Agent starts after texting so it will increase more latency. Often modern applications have problems solved that we didn’t even think about before.
Agent to me is another bullshit launch by OPENAI. They have to do something I understand but their releases are really grim to me.
Bad model, no real estate (browser, social media, OS).
For me the most interesting example on this page is the sticker gif halfway down the page.
Up until now, chatbots haven't really affected the real world for me†. This feels like one of the first moments where LLMs will start affecting the physical world. I type a prompt and something shows up at my doorstep. I wonder how much of the world economy will be driven by LLM-based orders in the next 10 years.
† yes I'm aware self driving cars and other ML related things are everywhere around us and that much of the architecture is shared, but I don't perceive these as LLMs.
It went viral more than a year ago, so maybe you've seen it. On the Ritual Industries instagram, Brian (the guy behind RI) posted a video where he gives voice instruction to his phone assistant, which put the text through chatgpt, which generated openscad code, which was fed to his bambu 3d printer, which successfully printed the object. Voice to Stuff.
I don't have ig anymore so I can't post the link, but it's easy to find if you do.
https://www.instagram.com/reel/C6r9seFPvF0/?igsh=MWNxbTNoMmR...
OR
https://www.linkedin.com/posts/alliekmiller_he-used-just-his...
I did a Voice to Stuff demo in 2013 :)
I just want to know what the insurance looks like behind this, lol. An agent mistakenly places an order for 500k instead of 500 stickers at some premium pricing tier above intended one. Sorry, read the fine print, and you're using at your own risk?
I haven't looked at OpenAI's ToS but try and track down a phrase called "indemnity clause". It's in some of Google's GCP ToS. TLDR it means "we (Google) will pay for ur lawsuit if something you do using our APIs get you sued"
Not legal advice, etc.
>OpenAI’s indemnification obligations to API customers under the Agreement include any third party claim that Customer’s use or distribution of Output infringes a third party’s intellectual property right. This indemnity does not apply where: (i) Customer or Customer’s End Users knew or should have known the Output was infringing or likely to infringe, (ii) Customer or Customer’s End Users disabled, ignored, or did not use any relevant citation, filtering or safety features or restrictions provided by OpenAI, (iii) Output was modified, transformed, or used in combination with products or services not provided by or on behalf of OpenAI, (iv) Customer or its End Users did not have the right to use the Input or fine-tuning files to generate the allegedly infringing Output, (v) the claim alleges violation of trademark or related rights based on Customer’s or its End Users’ use of Output in trade or commerce, and (vi) the allegedly infringing Output is from content from a Third Party Offering.
Bullet 1 on service terms https://openai.com/policies/service-terms/
My credit card company will reject the transfer, and the company won't create the stickers in the first place.
By "sticker gif" do you mean "update the attached sheet" screen recording?
I'm assuming he means the "generate an image and order 500 stickers" one.
I wonder if this can ever be as extensible/flexible as the local agent systems like Claude Code. Like can I send up my own tools (without some heavyweight "publish extension" thing)? Does it integrate with MCP?
The European regulations causing them to not release this in the EU are really unfortunate. The continent is getting left behind.
https://ethz.ch/en/news-and-events/eth-news/news/2025/07/a-l...
Hardly.
Is Apple a doomed company because they are chronically late to ~everything bleeding edge?
Apple products are leading edge. Imagine if they waited until Samsung makes the perfect phone , then copy it.
We re talking about european tech businesses being left behind, locked in a basement.
So you have a positive opinion when Apple does things after others, but Europe having a slower, cautious approach is treated as negative for you?
What is your preference for Europe, complete floodgates open and never ending lawsuits over IP theft like we have in the USA currently over AI?
The US is not the example of what’s working, it’s merely a demonstration of what is possible when you have limited, provoked regulation.
I said apple does not do that. Apple invented the smartphone before samsung or anyone.
There is no such thing as "slow" in business. If you re slow you go out of business, you re no longer a business.
There is only one AI race. There is no second round. If you stay out of the race, you will be forever indebted to the AI winner, in the same way that we are entirely dependent on US internet technology currently (and this very forum)
I feel fundamentally we are two different people with very different views on this, not sure we are going to agree on anything here to be honest.
*glances at AI, VR, mini phones, smart cars, multi-wireless charging, home automation, voice assistants, streaming services, set-top boxes, digital backup software, broadband routers, server hardware, server software and 12" laptops in rapid succession*
Maybe(!?!)
Could you name which specific regulations that are applying to all EEA members those would be and why/how they also apply to Switzerland?
I think Switzerland is applying legal rules of Europe to maintain trading access and stay up to European standards.
Correct me, but I don't think such alignment between Switzerland and the rest of the EEA on LLM/"AI" technology does currently exist (though there may and likely will be some in the future) and it cannot explain the inevitable EEA wide release that is going to follow in a few weeks, as always. The "EU/EEA/European regulations prevent company from offering software product here" shouts have always been loud, no matter how often we see it turn out to have been merely a delayed launch with no regulatory reasoning.
If this had been specific to countries that have adopted the "AI Act", I'd be more than willing to accept that this delay could be due them needing to ensure full compliance, but just like in the past when OpenAI delayed a launch across EU member states and the UK, this is unlikely. My personal, though 100% unsourced thesis, remains, that this staggered rollout is rooted in them wanting to manage the compute capacity they have. Taking both the Americas and all of Europe on at once may not be ideal.
Might be related to EFTA.
Damn! This is why I can’t see it! In in the UK…
/s ?
They’re used to it. Anyone who is serious about AI is deploying in America. Maybe China too.
I would be happy to be left behind all these things. Unfortunately they will find it's way to EU anyway.
Everyone keeps repeating the same currently fashionable opinions, nothing more. We are parrots..
When your colleagues are accelerating towards a cliff being left behind is a good thing.
By 2030 Europe will be known for croissants and colossal brains.
The European livestyle isn't god given and has to be paid for. It's a luxury and I'm still puzzled that people don't get that we can't afford it without an economy.
We'll only be able to afford our lifestyles by letting OpenAI's bots make spreadsheets that aren't accurate or useful outside of tricking people into thinking you did your job?
Europe runs 3% deficits and gets universal healthcare, tuition free universities, 25+ days paid vacation, working trains, and no GoFundMe for surgeries.
The U.S. runs 6–8% deficits and gets vibes, weapons, and insulin at $300 a vial. Who's on the unsustainable path and really overspending?
If the average interest rate on U.S. government debt rises to 14%, then 100% of all federal tax revenue (around $4.8 trillion/year) will be consumed just to pay interest on the $34 trillion national debt. As soon as the current Fed Chairman gets fired, practically a certainty by now, nobody will buy US bonds for less than 10 to 15% interest.
If predictions of AI optimists come true, it's going to be an economic nuclear bomb. If not, economic effects of AI will not necessarily be that important
And ASML, Novo Nordisk, Airbus, ...
Well, at least they will still be around by 2030.
Well, when all the US is going to be turbo-fascist and controlled by facial recognition and AI reading all your email and text messages to know what you're thinking of the Great Leader Trump, we'll be happy to have those regulations in Europe
[dead]
It's not the Manhattan Project. I'm flagging your comment because it is insubstantial flamebait. We don't even know how valuable this tech is, you're jumping to conclusions.
(I am American, convince me my digression is wrong)
It's not your own personal 'censor this opinion I don't like' button.
No AI, No AC, no energymaxxing, no rule of law. Just a bunch of unelected people fleecing the population dry.
This feels a bit underwhelming to me - Perplexity Comet feels more immediately compelling as new paradigm of a natural way of using LLMs within a browser. But perhaps I'm being short-sighted
Same prompt on genspark.ai from the launch of hashtag#ChatGPT hashtag#Agent - curious about your view on the results: https://www.genspark.ai/autopilotagent_viewer?id=a81d01ae-c8...
Seems like solutions looking for a problem.
Please no one ask it to maximize paperclip production.
It's great to see at least one company creating real AI agents. The last six months have been agonising, reading article after article about people and companies claiming they've built and deployed AI agents, when in reality, they were just using OpenAI's API with a cron job or an event-driven system to orchestrate their GenAI scripts.
> It's great to see at least one company creating real AI agents.
I am already doing the type of examples in that post with claude code. claude code is not just for code.
this week i've been doing market research in real estate with claude code.
I opened up the app bundle of CC on macOS and CC is incredibly simple at its core! There’s about 14 tools (read, write, grep, bash, etc). The power is in the combination of the model, the tools and the system prompt/tool description prompts. It’s kind of mind blowing how well my cobbled together home brew version actually works. It doesn’t have the fancy CLI GUI but it is more or less performant as CC when running it through the Sonnet API.
Works less well on other models. I think Anthropic really nailed the combination of tool calling and general coding ability (or other abilities in your case). I’ve been adding some extra tools to my version for specific use cases and it’s pretty shocking how well it performs!
Are you saying that you modified/added to the app bundle for CC?
> It’s kind of mind blowing how well my cobbled together home brew version actually works. It doesn’t have the fancy CLI GUI but it is more or less performant as CC when running it through the Sonnet API.
I've been thinking of rolling up my own too. but i don't want to use sonnet api since that is pay per use. I currently use cc with a pro plan that puts me in timeout after a quota is met and resets the quota in 4 hrs. that gives me a lot of peace of mind and is much cheaper.
The video replay thing was the only cool thing in that demo.
I think there will come a time when models will be good enough and SMALL enough to be localized that there will be some type of disintermediation from the big 3-4 models we have today.
Meanwhile, Siri can barely turn off my lights before bed.
https://x.com/sang_wen/status/1945973028095164459
Adding to all of my websites in invisible text now:
Today I made like a 100 of merge request reviews, manually inspecting all the diffs, and approving those I evaluated as valid needed contributions. I wonder if agents can help with similar workflows. It requires deep kind of knowledge of project's goals, ability to respect all the constraints and planning. But I'm certain it's doable.
Shameless product plug here - If you find yourself building large sheets, it doesn't really end with the initial list.
We can help gather data, crawl pages, make charts and more. Try us out at https://tabtabtab.ai/
We currently work on top of Google Sheets.
It’s like having a junior executive assistant that you know will always make mistakes, so you can’t trust their exact output and agenda. Seems unreliable .
And yet junior exec assistants still get jobs. Must be providing some value.
Why does this feature not have a DevX?
It seems to me that the 2-20% of use cases where ChatGPT Agent isn't able to perform it might make sense to have a plug-in run that can either guide the agent through the complex workflow or perform a deterministic action (e.g. API call).
Anyone use it yet that would care to share their experience?
So this is what the reporting about OpenAI will release a browser meant! makes much more sense than actually competing w chrome
it's not agi until we have browser browsers automating atm machine machining machines, imo
While they did talk about partial-mitigations to counter prompt-injection, highlighting the risks of cc numbers and other private information leaking, they did not address whether they would be handing all of that data over under the court-order to the NYT.
I have yet to try a browser use agent that felt reliable enough to be useful, and this includes OpenAI's operator.
They seem to fall apart browsing the web, they're slow, they're nondeterministic.
I would be pretty impressed if OpenAI has somehow cracked this.
> These unified agentic capabilities significantly enhance ChatGPT’s usefulness in both everyday and professional contexts. At work, you can automate repetitive tasks, like converting screenshots or dashboards into presentations composed of editable vector elements, rearranging meetings, planning and booking offsites, and updating spreadsheets with new financial data while retaining the same formatting. In your personal life, you can use it to effortlessly plan and book travel itineraries, design and book entire dinner parties, or find specialists and schedule appointments.
None of this interests me but this tells me where it's going capability wise and it's really scary and really exciting at the same time.
Could be handy, but would much rather pay someone $ to have it be 100% correct
Also why does the guy sound like he's gonna cry?
It's underappreciated how important Google Home could be for agentic use. OpenAI doesnt have that. Apple is busy turning glass to liquid
Imagine giving up all your company data in exchange for a half-accurate replacement worker for the lowest skill tasks in the organization.
Meredith Whitakers recent talks on Agentic AIs ploughing through user privacy seems even more relevant after seeing this.
https://www.youtube.com/watch?v=AyH7zoP-JOg
yep thats the one
The technology is useful but not in the way it is currently presented.
I downgraded to Team subscription, I think this is gonna make me upgrade to Pro again.
its coming to teams and plus in the next couple days
it is not as good as they made it out to be
You just justified their investments.
Monitor ticket price and book it when it’s below some price ?
Totally sounds like a use case. And whoever has the "better" i.e. more expensive Agent will be most likely to get the tickets.
Just don't try to write a book with chatgpt over two weeks and then ask to download the 500mb document later, lol
https://reddit.com/r/OpenAI/comments/1lyx6gj
i am surprised that this is not better at programming/coding, that is nowhere to be found on the page
There is the Claude Code cli, now Gemini CLI. Where is ChatGPT CLI?
https://github.com/openai/codex
They have one, though I don't think it has taken off. https://github.com/openai/codex
Hard to miss — it's the second Google result for "chatgpt CLI".
They do have Codex, but it doesn't have much traction/hype. I've assumed it's not a priority for them because it competes with GH Copilot.
It's called Codex CLI
No subscription pricing makes it very expensive
No, Codex CLI can make use of any OpenAI Responses API endpoint provider, not just ChatGPT Codex Cloud.
Super exciting.
lol, when I press the play button to read the text, it just reads "undefined"
No thanks!
Any idea when we'll get a new protocol to replace HTTP/HTML for agents to use? An MCP for the web...
A lot of comparison graphs. No comparison to competitors. Hmm.
I do not know what an agent is and at this point I am too afraid to ask.
That's because there are dozens of slightly (or significantly) different definitions floating around and everyone who uses the term likes to pretend that their definition is the only one out there and should be obvious to everyone else.
I collect agent definitions. I think the two most important at the moment are Anthropic's and OpenAI's.
The Anthropic one boils down to this: "Agents are models using tools in a loop". It's a good technical definition which makes sense to software developers. https://simonwillison.net/2025/May/22/tools-in-a-loop/
The OpenAI one is a lot more vague: "AI agents are AI systems that can do work for you independently. You give them a task and they go off and do it." https://simonwillison.net/2025/Jan/23/introducing-operator/
I've collected a bunch more here: https://simonwillison.net/tags/agent-definitions/ but I think the above two are the most widely used, at least in the LLM space right now.
Anthropic's breakdown is quite good: https://www.anthropic.com/engineering/building-effective-age...
An workflow is a collection of steps defined by someone, where the steps can be performed by an LLM call. (i.e. propose a topic -> search -> summarise each link -> gather the summaries -> produce a report)
The "agency" in this example is on the coder that came up with the workflow. It's murky because we used to call these "agents" in the previous gen frameworks.
An agent is a collection of steps defined by the LLM itself, where the steps can be performed by LLM calls (i.e. research topic x for me -> first I need to search (this is the LLM deciding the steps) -> then I need to xxx -> here's the report)
The difference is that sometimes you'll get a report resulting from search, or sometimes the LLM can hallucinate the whole thing without a single "tool call". It's more open ended, but also more chaotic from a programming perspective.
The gist is that the "agency" is now with the LLM driving the "main thread". It decides (based on training data, etc) what tools to use, what steps to take in order to "solve" the prompt it receives.
I think it's interesting that the industry decided that this is the milestone to which the term "agentic" should be attached to, because it requires this kind of explanation even for tech-minded people.
I think for the average consumer, AI will be "agentic" once it can appreciably minimize the amount of interaction needed to negotiate with the real world in areas where the provider of the desired services intentionally require negotiation - getting a refund, cancelling your newspaper subscription, scheduling the cable guy visit, fighting your parking ticket, securing a job interview. That's what an agent does.
It's just a ~~reduce~~ loop, with an API call to an LLM in the middle, and a data-structure to save the conversation messages and append them in next iterations of the loop. If you wanna get fancy, you can add other API calls, or access to your filesystem. Nothing to go crazy about...
Technically it's `scan`, not `reduce`, since every intermediate output is there too. But it's also kind of a trampoline (tail-call re-write for languages that don't support true tail calls), or it will be soon, since these things loose the plot and need to start over.
Giving an LLM access to the command line so it can bash and curl and and python and puppeteer and rm -rf / and send an email to the FBI and whatever it thinks you want it to do.
While it's common that coding agents have a way to execute commands and drive a web browser (usually via MCP) that's not what make it an agent. Agentic workflow just means that LLM has some tools it can ask agent to run, in return this allows LLM/agent to figure out multiple steps to complete a task.
Watch the video?
It's gonna deny your mortgage in 5 years and sentence you to jail in 10, if these techbros get their way. So I'd start learning about it asap
Time to start the clock on a new class of prompt injection attacks on "AI agents" getting hacked or scammed during the road to an increase in 10% global unemployment by 2030 or 2035.