Fei-Fei Li: Spatial intelligence is the next frontier in AI [video]

267 points by sandslash 3 days ago

I appreciate the video and generally agree with Fei-Fei but I think it almost understates how different the problem of reasoning about the physical world actually is.

Most dynamics of the physical world are sparse, non-linear systems at every level of resolution. Most ways of constructing accurate models mathematically don’t actually work. LLMs, for better or worse, are pretty classic (in an algorithmic information theory sense) sequential induction problems. We’ve known for well over a decade that you cannot cram real-world spatial dynamics into those models. It is a clear impedance mismatch.

There are a bunch of fundamental computer science problems that stand in the way, which I was schooled on in 2006 from the brightest minds in the field. For example, how do you represent arbitrary spatial relationships on computers in a general and scalable way? There are no solutions in the public data structures and algorithms literature. We know that universal solutions can’t exist and that all practical solutions require exotic high-dimensionality computational constructs that human brains will struggle to reason about. This has been the status quo since the 1980s. This particular set of problems is hard for a reason.

I vigorously agree that the ability to reason about spatiotemporal dynamics is critical to general AI. But the computer science required is so different from classical AI research that I don’t expect any pure AI researcher to bridge that gap. The other aspect is that this area of research became highly developed over two decades but is not in the public literature.

One of the big questions I have had since they announced the company, is who on their team is an expert in the dark state-of-the-art computer science with respect to working around these particular problems? They risk running straight into the same deep, layered theory walls that almost everyone else has run into. I can’t identify anyone on the team that is an expert in a relevant area of computer science theory, which makes me skeptical to some extent. It is a nice idea but I don’t get the sense they understand the true nature of the problem.

Nonetheless, I agree that it is important!

teemur 20 hours ago

> We know that universal solutions can’t exist and that all practical solutions require exotic high-dimensionality computational constructs that human brains will struggle to reason about. This has been the status quo since the 1980s. This particular set of problems is hard for a reason.
This made me a bit curious. Would you have any pointers to books/articles/search terms if one wanted to have a bit deeper look on this problem space and where we are?
- jandrewrogers 13 hours ago
  
  I'm not aware of any convenient literature but it is relatively obvious once someone explains it to you (as it was explained to me).
  At its root it is a cutting problem, like graph cutting but much more general because it includes things like non-trivial geometric types and relationships. Solving the cutting problem is necessary to efficiently shard/parallelize operations over the data models.
  For classic scalar data models, representations that preserve the relationships have the same dimensionality as the underlying data model. A set of points in 2-dimensions can always be represented in 2-dimensions such that they satisfy the cutting problem (e.g. a quadtree-like representation).
  For non-scalar types like rectangles, operations like equality and intersection are distinct and there are an unbounded number of relationships that must be preserved that touch on concepts like size and aspect ratio to satisfy cutting requirements. The only way to expose these additional relationships to cutting algorithms is to encode and embed these other relationships in a (much) higher dimensionality space and then cut that space instead.
  The mathematically general case isn't computable but real-world data models don't need it to be. Several decades ago it was determined that if you constrain the properties of the data model tightly enough then it should be possible to systematically construct a finite high-dimensionality embedding for that data model such that it satisfies the cutting problem.
  Unfortunately, the "should be possible" understates the difficulty. There is no computer science literature for how one might go about constructing these cuttable embeddings, not even for a narrow subset of practical cases. The activity is also primarily one of designing data structures and algorithms that can represent complex relationships among objects with shape and size in dimensions much greater than three, which is cognitively difficult. Many smart people have tried and failed over the years. It has a lot of subtlety and you need practical implementations to have good properties as software.
  About 20 years ago, long before "big data", the iPhone, or any current software fashion, this and several related problems were the subject of an ambitious government research program. It was technically successful, demonstrably. That program was killed in the early 2010s for unrelated reasons and much of that research was semi-lost. It was so far ahead of its time that few people saw the utility of it. There are still people around that were either directly involved or learned the computer science second-hand from someone that was but there aren't that many left.
  - andoando 13 hours ago
    
    Ive spent years trying to tackle spatial representations on my own, so Im extremely curious here.
    How does the cutting problem relate to intelligence in the first place?
    
    jandrewrogers 12 hours ago
    
    Indexing is a special case of AI. At the limit, optimal cutting and learning are equivalent problems. Non-trivial spatial representations push these two things much closer together than is normally desirable for e.g. indexing algorithms. Tractability becomes a real issue.
    Practically, scalable indexing of complex spatial relationships requires what is essentially a type of learned indexing, albeit not neural network based.
  - jedharris 6 hours ago
    
    some pointers to the research program please?
    
    jandrewrogers 3 hours ago
    
    It was a national security program with no public face. I was recruited into it because I solved a fundamental computer science problem they were deeply interested in. I did not get my extensive supercomputing experience in academia. It was a great experience if you just wanted to do hardcore computer science research, which at the time I did.
    There are several VCs with knowledge of the program. It is obscure but has cred with people that know about it. I’ve raised millions of dollars off the back of my involvement.
    A lot of really cool computer science research has happened inside the government. I think it is a bit less these days but people still underestimate it.
  - tehjoker 8 hours ago
    
    Did that research program have a public code name?
    
    mindcrime 8 hours ago
    
    Looking through some old DARPA budget docs[1], it seems like there's a chance that what's being discussed here falls under DARPA's "PE 0602702E TACTICAL TECHNOLOGY" initiative, project TT-06.
    Some other possibilities might include:
    - "PE 0602304E COGNITIVE COMPUTING SYSTEMS", project COG-02. - "PE 0602716E ELECTRONICS TECHNOLOGY", project ELT-01 - "PE 0603760E COMMAND, CONTROL AND COMMUNICATIONS SYSTEMS", project CCC-02 - "PE 0603766E NETWORK-CENTRIC WARFARE TECHNOLOGY", project NET-01 - "PE 0603767E SENSOR TECHNOLOGY", project SEN-02
    Or maybe it's nothing to do with this at all. But in either case, this looks like some interesting stuff to explore in its own right. :-)
    [1]: https://web.archive.org/web/20181001000000/https://www.darpa...
    
    jandrewrogers 2 hours ago
    
    Not that I know of. If I drop the program director’s name, people that know, know. That is all the handshake you usually need.
    
    sho an hour ago
    
    Sounds like Genoa/Topsail
  - calf 12 hours ago
    
    But then that sounds more like that person explained it wrong. They didn't explain why it is necessary to reduce to GRAPHCUT, it seems to me to beg the question. We should not assume this is true based on some vague anthropomorphic appeal to spatial locality, surely?
    
    jandrewrogers 10 hours ago
    
    It isn’t a graph cutting problem, graph cutting is just a simpler, special case of this more general cutting problem (h/t IBM Research). If you can solve the general problem you effectively get efficient graph cutting for free. This is obviously attractive to the extent you can do both complex spatial and graph computation at scale on the same data structure instead of specializing for one or the other.
    The challenge with cutting e.g. rectangles into uniform subsets is that logical shard assignment must be identical regardless of insertion order and in the absence of an ordering function, with O(1) space complexity and without loss of selectivity. Arbitrary sets of rectangles overlap, sometimes heavily, which is the source of most difficulty.
    Of course, with practical implementations write scalability matters and incremental construction is desirable.
lsy a day ago

To make this more concrete: ImageNet enabled computer "vision" by providing images + labels, enabling the computer to take an image and spit out a label. LLM training sets enable text completion by providing text + completions, enabling the computer to take a piece of text and spit out its completion. Learning how the physical world works (not just kind of works a la videogames, actually works) is not only about a jillion times more complicated, there is really only one usable dataset: the world itself, which cannot be compacted or fed into a computer at high speed.
"Spatial awareness" itself is kind of a simplification: the idea that you can be aware of space or 3d objects' behavior without the social context of what an "object" is or how it relates to your own physical existence. Like you could have two essentially identical objects but they are not interchangeable (original Declaration of Independence vs a copy, etc). And many many other borderline-philosophical questions about when an object becomes two, etc.
- m-s-y 16 hours ago
  
  > the world itself, which cannot be compacted or fed into a computer at high speed.
  …yet.
  15 years ago LLMs as they are today seemed like science fiction too.
  - awakeasleep 3 hours ago
    
    Yes! It only requires a few fundamental breakthroughs in areas that seem constrained by physical reality.
- sega_sai 11 hours ago
  
  I feel that if words/phrases/whole texts can be embedded well in high dimensional spaces as points, the same must apply to the 3d world. I'm sure there will be embeddings of it (i.e. mapping the 3-d scene into a high-D vector) and then we'll be work with those embeddings as LLMs work with text (disclaimer: I am not an expert in the field ).
- amelius 16 hours ago
  
  Well, you can use one or two cameras and a lidar, and use that to generate data to train a depth-map.
- coldtea 21 hours ago
  
  >there is really only one usable dataset: the world itself, which cannot be compacted or fed into a computer at high speed.
  Why wouldn't it be? If the world is ingressed via video sensors and lidar sensor, what's the hangup in recording such input and then replaying it faster?
  - psb217 21 hours ago
    
    I think there's an implicit assumption here that interaction with the world is critical for effective learning. In that case, you're bottlenecked by the speed of the world... when learning with a single agent. One neat thing about artificial computational agents, in contrast to natural biological agents, is that they can share the same brain and share lived experience, so the "speed of reality" bottleneck is much less of an issue.
    
    HappMacDonald 18 hours ago
    
    Yeah I'm envisioning putting a thousand simplistic robotic "infants" into a vast "playpen" to gather sensor data about their environment, for some (probably smaller) number of deep learning models to ingest the input and guess at output strategies (move this servo, rotate this camshaft this far in that direction, etc) and make predictions about resulting changes to input.
    In principle a thousand different deep learning models could all train simultaneously on a thousand different robot experience feeds.. but not 1 to 1, but instead 1 to many.. each neural net training on data from dozens or hundreds of the robots at the same time, and different neural nets sharing those feeds for their own rounds of training.
    Then of course all of the input data paired with outputs tested and further inputs as ground truth to predictions can be recorded for continued training sessions after the fact.
    
    csullivan107 16 hours ago
    
    Never thought I’d get to do this but this was my masters research! Simulations are inherently limited and I just got tired of robotic research being done only in simulations. So I built a novel soft robot (notoriously difficult to control) and got it to learn by playing!!
    Here is an informal talk I gave on my work. Let me know if you want the thesis
    https://www.youtube.com/live/ZXlQ3ppHi-E?si=MKcRqoxmEra7Zrt5
    
    rybosome 14 hours ago
    
    A very interesting idea. I am curious about this sharing and blending of the various nets; I wonder if something as naive as averaging the weights (assuming the neural nets all have the same dimensions) would actually accomplish that?
    
    loa_in_ 16 hours ago
    
    But the playpen will contain objects that are inherently breakable. You cannot rough handle the glass vessel and have it too.
    
    m-s-y 16 hours ago
    
    The world Is breakable. Any model based on it will need to know this anyway. Am I missing your argument?
    
    devenson 15 hours ago
    
    Can't reset state after breakage.
    
    hackyhacky 13 hours ago
    
    > In that case, you're bottlenecked by the speed of the world
    Why not have the AI train on a simulation of the real world? We can build those pretty easily using traditional software and run them at any speed we want.
  - otodus 15 hours ago
    
    How would you handle olfactory and proprioceptive data?
- TheOtherHobbes 16 hours ago
  
  Considering how bad LLMs are at understanding anything, and how they still manage to be useful, you simply don't need this level of complexity.
  You need something that mostly works most of the time, and has guardrails so when it makes mistakes nothing bad happens.
  Our brains acquire quite good heuristics for dealing with physical space without needing to experience all of physical reality.
  A cat-level or child-level understanding of physical space is more immediately useful than a philosopher-level of understanding.
dopadelic 12 hours ago

You're pointing out a real class of hard problems — modeling sparse, nonlinear, spatiotemporal systems — but there’s a fundamental mischaracterization in lumping all transformer-based models under “LLMs” and using that to dismiss the possibility of spatial reasoning.
Yes, classic LLMs (like GPT) operate as sequence predictors with no inductive bias for space, causality, or continuity. They're optimized for language fluency, not physical grounding. But multimodal models like ViT, Flamingo, and Perceiver IO are a completely different lineage, even if they use transformers under the hood. They tokenize images (or video, or point clouds) into spatially-aware embeddings and preserve positional structure in ways that make them far more suited to spatial reasoning than pure text LLMs.
The supposed “impedance mismatch” is real for language-only models, but that’s not the frontier anymore. The field has already moved into architectures that integrate vision, text, and action. Look at Flamingo's vision-language fusion, or GPT-4o’s real-time audio-visual grounding — these are not mere LLMs with pictures bolted on. These are spatiotemporal attention systems with architectural mechanisms for cross-modal alignment.
You're also asserting that "no general-purpose representations of space exist" — but this neglects decades of work in computational geometry, graphics, physics engines, and more recently, neural fields and geometric deep learning. Sure, no universal solution exists (nor should we expect one), but practical approximations exist: voxel grids, implicit neural representations, object-centric scene graphs, graph neural networks, etc. These aren't perfect, but dismissing them as non-existent isn’t accurate.
Finally, your concern about who on the team understands these deep theoretical issues is valid. But the fact is: theoretical CS isn’t the bottleneck here — it’s scalable implementation, multimodal pretraining, and architectural experimentation. If anything, what we need isn’t more Solomonoff-style induction or clever data structures — it’s models grounded in perception and action.
The real mistake isn’t that people are trying to cram physical reasoning into LLMs. The mistake is in acting like all transformer models are LLMs, and ignoring the very active (and promising) space of multimodal models that already tackle spatial, embodied, and dynamical reasoning problems — albeit imperfectly.
- mumbisChungo 10 hours ago
  
  Claude, is that you?
- calf 12 hours ago
  
  How do we prove a trained LLM has no inductive bias for space, causality, etc.? We can't assume this is true by construction, can we?
  - dopadelic 11 hours ago
    
    Why would we need to prove such a thing? Human vision has strong inductive biases, which is why you can perceive objects in abstract patterns. This is why you can lay down at a park and see a duck in a cloud. It's also why we can create abstracted representations of things with graphics. Having inductive biases makes it more relatable to the way we work.
    And again, you're using the term LLMs again when vision based transformers in multimodal models aren't simply LLMs.
machinelearning a day ago

"Most ways of constructing accurate models mathematically don’t actually work" > This is true for almost anything at the limit, we are already able to model spatiotemporal dynamics to some useful degree (see: progress in VLAs, video diffusion, 4D Gaussians)
"We’ve known for well over a decade that you cannot cram real-world spatial dynamics into those models. It is a clear impedance mismatch" > What's the source that this is a physically impossible problem? Not sure what you mean by impedance mismatch but do you mean that it is unsolvable even with better techniques?
Your whole third paragraph could have been said about LLMs and isn't specific enough, so we'll skip that.
I don't really understand the other 2 paragraphs, what's this "dark state-of-the-art computer science" you speak of and what is this "area of research became highly developed over two decades but is not in the public literature" how is "the computer science required is so different from classical AI research"?
- calf 19 hours ago
  
  Above commenter also asserts "highly developed research but no public literature" shrug ...
  - jandrewrogers 2 hours ago
    
    It was a national security program that plenty of people are familiar with and has been used across several countries. None of those programs publish.
    As much as the literature doesn’t exist, the tech has been used in production for over a decade. That’s just my word of course but a lot of people know. :shrug:
  - fu-hn 15 hours ago
    
    But the best minds in the world said so!
ccozan a day ago

I agree that the problem is hard. However, biological brain is able to handle it quite "easily" ( is not really easy - bilions of iterations were needed ). The current brains are solving this 3D physical world _only_ via perception.
So this is place were we must look. It starts with the sensing and the integration of that sensing. I am working at this problem since more than 10 years and I came to some results. I am not a real scientist but a true engineer and I am looking from that perspective quite intesely: The question that one must ask is: how do you define the outside physical world from the perspective of a biological sensing "device" ? what exactly are we "seeing" or "hearing"? So yes, working on that brought it further in defining the physical world.
- tmilard 15 hours ago
  
  I do agree with you. We have an natural eye (what you call a 'biological brain') automat that inconsciouly 'feels' the structure of a geometric of the places we enter to.
  Once this layer of "natural eye automat" is programmed behind a camera, it will spit out this crude geometry : the Spacial-data-bulk (SDB). This SDB is small data.
  From now on, our programs will only do reason, not on data froms camera(s) but only on this small SBD.
  This is how I see it.
  - tmilard 15 hours ago
    
    ==> And now the LLMs, to feel Spacial knowledge, will have a very reduce dataset. This will make spacial data reasoning very less intencive than we can't imagine.
- foobarian 13 hours ago
  
  Maybe a brute force solution would work just like it did for text. I would not be surprised if the scale of that brute force was not within reach yet though.
- andoando 12 hours ago
  
  Also a cook here whose spent years thinking about this, would love to hear about what results you've obtained
voxleone 9 hours ago

I'm trying to approach spatial reasoning by introducing quaternions to navigate graphs. It is a change in the unit of traversal — from positional increment to rotational progression. This reframing has cascading effects. It alters how we model motion, how we think about proximity, and ultimately how systems relate to space itself.
The traditional metaphor of movement — stepping from point A to point B — is spatially intuitive but semantically impoverished. It ignores the continuity of direction, the embodiment of motion, and the nontriviality of turning. Quaternion-based traversal reintroduces these elements. It is not just more precise; it is more faithful to the mechanisms by which physical and virtual entities evolve through space. In other words objects 'become' the model.
https://github.com/VoxleOne/SpinStep/blob/main/docs/index.md
niemandhier 21 hours ago

Regarding sparse, nonlinear systems and our ability to learn them:
There is hope. Experimental observation is, that in most cases the coupled high dimensional dynamics almost collapses to low dimensional attractors.
The interesting thing about these is: If we apply a measurement function to their state and afterwards reconstruct a representation of their dynamics from the measurement by embedding, we get a faithful representation of the dynamics with respect to certain invariants.
Even better, suitable measurement functions are dense in function space so we can pick one at random and get a suitable one with probability one.
What can be glanced about the dynamics in terms of of these invariants can learned for certain, experience shows that we can usually also predict quite well.
There is a chain of embedding theorems by Takens and Sauer gradually broadening the scope of applicability from deterministic chaos towards stochasticly driven deterministic chaos.
Note embedding here is not what current computer science means by the word.
I spend most of my early adulthood doing theses things, would be cool to see them used once more.
- golol 7 hours ago
  
  What field of mathematics is this? Can you point me to some keywords/articles?
amelius 19 hours ago

Some types of deep learning model can handle 3d data quite well:
https://en.wikipedia.org/wiki/Neural_radiance_field
epr 11 hours ago

Human beings get by quite well with extremely oversimplified (low resolution) abstractions. There is no need whatsoever for something even approaching universal or perfect. Humans aren't thinking about fundamental particles or solving differential equations in their head when they're driving a car or playing sports.
ryeguy_24 12 hours ago

Agree. Also, with respect to training, what is the goal that we are maximizing? LLMs are easy, predicting the next word and we have lots of training data. But what are we training for in real world? Modeling the next spatial photograph to predict things that will happen next? It’s not intuitive to me what that objective function would be in spatial intelligence.
- curiouscavalier 11 hours ago
  
  Or that there is a sufficiently generalizable objective function for all “spatial intelligence.”
gyomu a day ago

> There are a bunch of fundamental computer science problems that stand in the way, which I was schooled on in 2006 from the brightest minds in the field. For example, how do you represent arbitrary spatial relationships on computers in a general and scalable way? There are no solutions in the public data structures and algorithms literature. We know that universal solutions can’t exist and that all practical solutions require exotic high-dimensionality computational constructs that human brains will struggle to reason about. This has been the status quo since the 1980s. This particular set of problems is hard for a reason.
Where can I read more about this space? (particularly on the "we know that universal solutions can't exist" front)
mindcrime a day ago

> became highly developed over two decades but is not in the public literature.
Developed by who? And for what purpose? Are we talking about overlap with stuff like missile guidance systems or targeting control systems or something, and kept confidential by the military-industrial complex? I'm having a hard time seeing many other scenarios that would explain a large body of people doing research in this area and then not publishing anything.
> I can’t identify anyone on the team that is an expert in a relevant area of computer science theory
Who is an expert on this theory then?
queuebert 14 hours ago

I think you are using "sparse" and "non-linear" as scare terms. Sparse is a good thing, as it reduces degrees of freedom, and non-linear does not mean unsolvable.
Also "impedance mismatch" doesn't mean no go, but rather less efficient.
andoando 13 hours ago

What's non linear about spatial reasoning?
>We know that universal solutions can’t exist
Why not?
- randcraw 12 hours ago
  
  Spatial models must be 3D, not 1D (linear), much less 2D, which is sufficient for images and object recognition (where models are not needed). And adding time makes it 4D, at least for robot motion.
  To reason spatially (and dynamically) the dependence of one object's position in space on other objects (and their motions and behaviors) adds up fast to complicate the model in ways that 95% of 2D static image analysis does not.
  - andoando 12 hours ago
    
    Well hold on, first Im not convinced we have solved 2D spatial intelligence. Analyzing 2D images is very different from being able to reason about 2D geometry. How do you mathematically define relations like "above", "below", "diagonal", etc in a composable way that can be learned?
    Second, problems in 3D can be deconstructed to 2D. For example, how do you get to the airport? You need to first solve the 2D overview of the path youd take as youd now looking at a map. Then you need to reason about your field of view, and here again I believe youre really reasoning is something like "object A is behind object B and A is to the left of B", and not solving some non linear equation
    I think a big issue is people are trying to solve this in the realm of traditional mathematics, and not as a simple step by step process
idiotsecant 20 hours ago

If there's one thing that control theory has taught us in the last 100 years, it's that anything is linear if you zoom in far enough. Nonlinearity is practically solvable by adjusting your controls to different linear models depending on your position in the system space.
doctorpangloss 6 hours ago

pretty esoteric when it's so simple: you either think
Bill Peebles is right, naturalistic, physical laws can be learned in deep neural nets from videos.
OR
Fei-Fei Li is right, you need 3D point cloud videos.
Okay, if you think Bill Peebles is right, then all this stuff you are talking about doesn't matter anymore. Lots of great reasons Bill Peebles is probably right, biggest reason of all is that Veo, Sora etc. have really good physics understanding.
If you think Fei-Fei Li is right, you are going to be augmenting real world data sets with game engine content. You can exactly create whatever data you need, for whatever constraints, to train performantly. I don't think this data scalability concern is real.
A compelling reason you are wrong and Fei-Fei Li's specific bet on scalability is right is the existence of Waymo and Zoox. There are also NEW autonomous vehicle companies achieving things faster than Zoox and Waymo did, because a lot of spatial intelligence problems are actually regulatory/political, not scientific.
CamperBob2 14 hours ago

We’ve known for well over a decade that you cannot cram real-world spatial dynamics into those models. It is a clear impedance mismatch.
Then again, not much that we "knew" a decade ago is still relevant today. Of course transformer networks have proven capable of representing spatial intelligence. How could they work with 2D images, if not?
nurettin 16 hours ago

> how do you represent arbitrary spatial relationships on computers in a general and scalable way?
Isn't this essentially what the convolutional layers do in LeNet?
adamnemecek a day ago

All (ALL!!) AI/optimization problems boil down to energy minimization or dually entropy maximization.
TacticalCoder 16 hours ago

[dead]

skwb a day ago

It's hard to describe, but it's felt like LLMs have completely sucked the entire energy out of computer vision. Like... I know CVPR still happens and there's great research that comes out of it, but almost every single job posting in ML is about LLMs to do this and that to the detriment of computer vision.

jgord a day ago
yeah, see my other comment.
To me its totally obvious that we will have a plethora of very valuable startups who use RL techniques to solve realworld problems in practical areas of engineering .. and I just get blank stares when I talk about this :]
Ive stopped saying AI when I mean ML or RL .. because people equate LLMs with AI.
We need better ML / RL algos for CV tasks :
```
  - detecting lines from pixels
  - detecting geometry in pointclouds
  - constructing 3D from stereo images, photogrammetry, 360 panoramas
```
These might be used by LLMs but are likely built using RL or 'classical' ML techniques, tapping into the vast parallel matmull compute we now have in GPUs / multicore CPUs, and NPUs.
- pzo a day ago
  
  I thought there been a lot of progress in last 2 years. (Video) Depth Anything, SAM2, grounding Dino, DFINE, VLM, Gaussian splats, Nerf. Sure less than progres in LLm but still I would say progress accelerated with LLM research.
- tmilard 15 hours ago
  You said : "- detecting lines from pixels - detecting geometry in pointclouds - constructing 3D from stereo images, photogrammetry, 360 panoramas"
  ==> For me it is more something like : Source = crude video-or-photo pixels (to) ===> Find simple many rectangle-surface that are glued together one another.
  This is, for me, how you really go easily to detecting rather complexes geometry of any room.
  - jgord 5 hours ago
    
    I kind of did a version of what you suggest - I think I linked to a video showing plane edges auto-detected in a pointcloud sample.
    Similarly I use another algo to detect pipe runs which tend to appear as half cylinders in the pointcloud, as the scanner usually sees one side, and often the other side is hidden, hard to access, up against a wall.
    So, I guess my point is the devil is in the details .. and machine learning can optimize even further on good heuristics we might come up with.
    Also, when you go thru a whole pointcloud, you have a lot of data to sift thru, so you want something fairly efficient, even if your using multiple GPUs do do the heavy matmull lifting.
    You can think of RL as an optimization - greatly speeding up something like monte carlo tree search, by learning to guess the best solution earlier.
porphyra a day ago

I feel like 3D reconstruction/bundle adjustment is one of those things where LLMs and new AI stuff haven't managed to get a significant foothold. Recently VGGT won best paper which is good for them, but for the most part, stuff like NERF and Gaussian Splatting still rely on good old COLMAP for bundle adjustment using SIFT features.
Also, LLMs really suck at some basic tasks like counting the sides of a polygon.
- KaiserPro 21 hours ago
  
  > LLMs really suck at some basic tasks like counting the sides of a polygon.
  Oh indeed, but thats not using tokens correctly. if you want to do that, then tokenise the number of polygons....
whiplash451 13 hours ago

It felt the same back in 2012-2015 when deep learning was flooding over computer vision. Yet 10 years later there is a net benefit for computer vision: a lot of tasks are now solved much better/more efficiently with deep learning including those that seemed "unfit" to deep learning like tracking.
I'm hopeful that VLMs will "fan out" into a lot of positive outcomes for computer vision.
- SlowTao 8 hours ago
  
  That is fair. I think it is a case of just seeing a lot if great talent rush to the "in" thing. Other systems are still being developed and that isnt lost but there is just a feeling if being left out of it all while still doing great stuff.
satyrun 3 hours ago

Francois Chollet's observation is that LLMs have sucked the air out of the entirety of AI research.
On the other hand I just chatted with Opus 4 for the first time a few minutes ago and I am completely blown away.
friendzis a day ago

What's the equivalent of methadone therapy, but for reckless VC?
What's the equivalent of destroying everything around you while chasing another high, but for reckless VC?
- baxtr a day ago
  
  Jeopardy?!
glitchc 17 hours ago

Hah! And I remember when ML itself sucked all the energy out of computer vision. Time to pay the piper.
CSMastermind 14 hours ago

Everyone is trying to jam transformers into CV workflows at the moment. Possibly productively.
smath 14 hours ago

agreed about sucking the air out by LLM. The positive side is that its a good time to innovate in other areas while a chunk of ppl are absorbed in LLMs. A proven improvement in any other non LLM space will attract investment.
- pixl97 14 hours ago
  
  Isn't this what Nvidia is already doing with a lot of their sim software?
Barrin92 12 hours ago

>but almost every single job posting in ML is about LLMs
not in the defense sector, or aviation, or UAVS, automotive, etc. Any proper real-time vision task where you have to computationally interact with visual data is unsuited for LLMs.
Nobody controls a drone, missile or vehicle by taking a screenshot and sending it to ChatGPT and has it do math while it's on flight, anything that requires as the title of the thread says, spatial intelligence is unsuited for a language model
m3kw9 16 hours ago

There is nothing to productize vs LLMs right now. I would say robots could fix that but they have hard problems to solve in the physical sense that will bottle neck things

starchild3001 5 hours ago

Great talk. Dr. Li has a way of cutting through the hype and getting to the fundamental challenges that is really refreshing. Her point about spatial intelligence being the next frontier after language really resonates.

I'm particularly hung up on the data problem she touched on (41 min). She rightly points out that unlike language, where we could bootstrap LLMs with the vast, pre-existing corpus of the internet, there's no equivalent "internet of 3D space." She mentions a "hybrid approach" for World Labs, and that's where the real engineering challenge seems to lie.

My mind immediately goes to the trade-offs. If you lean heavily on synthetic data, you're in a constant battle with the "sim-to-real" gap. It works for narrow domains, but for a general "world model," the physics, lighting, and material properties have to be perfect, which is a monumental task. If you lean on real-world capture (e.g., massive-scale photogrammetry, NeRFs, etc.), the MLOps and data pipeline challenges seem staggering. We're not just talking text files; we're talking about petabytes of structured, multi-sensor data that needs to be processed, aligned, and labeled. It feels like an entirely new class of data infrastructure problem.

Her hiring philosophy of "intellectual fearlessness" (31 min) makes a lot of sense in this context. You'd need a team that's not intimidated by the fact that the foundational dataset for their entire field doesn't even exist yet. They have to build the oil refinery while also figuring out where to drill for oil.

It's exciting to see a team with this much deep learning and computer vision firepower aimed at such a foundational problem. It pulls the conversation away from just optimizing existing architectures and towards creating entirely new categories. It leaves me wondering: what does the "AlexNet moment" for spatial intelligence even look like? Is it a novel model architecture, or is the true breakthrough a new form of data representation that makes this problem tractable at scale?

IdealeZahlen a day ago

I've always wondered how spatial reasoning appears to be operating quite differently from other cognitive abilities, with significant individual variations. Some people effortlessly parallel park while others struggle with these tasks despite excelling at other forms of pattern recognition. What was particularly intriguing for me is that some people with aphantasia have no difficulty with spatial reasoning tasks, so spatial reasoning may be distinct from reasoning based on internal visualization.

m463 7 hours ago

I have had this idea about parking a car...
Most people have proprioception - you know where the parts of your body are without looking. Close your eyes and you intuitively know where your hands and fingers are.
When parking a car, it helps to sort of sit in the drivers seat and look around the car. Turn your neck and look past the back seat where your rear tire would be. sense the edges of the car.
I think if you sort of develop this a bit you might "feel" where your car is intuitively when pulling into a parking space or parallel parking. (car-prioception?)
(but use your mirrors and backup camera anyway)
- Falimonda 6 hours ago
  
  As someone who hasn't had to own a car in over 8 years (lived in NYC) and recently bought a 2023 Hyundai Santa Fe with birdseye view parking it shocks me how uncalibrated my car-prioception is.
  It's made me realize that objects are much further from the boundaries of my car when backing into a spot parallel parking. I would never think to get so close to another car if I had to only rely on my own senses.
  With that said, I realize there's a significant number of people that are even poorer estimators of these distances than myself. I.e. those that won't pass through two cars even though to me it's obvious that they could easily pass.
  I have to imagine a big part of this has to do with risk assessment and lack of risk-free practice opportunity IRL. Nobody is seeing how far they can push or train themselves in this regard when the consequences are to scratch up your car and others' cars. With the birdseye view I can actually do that now!
polytely 19 hours ago

my theory is that aphantasia is purely about conscious access to visualizing not the existence of the ability to visualise.
I have aphantasia but I would say that spatial reasoning is one of the things my brain is the best at
- golol 7 hours ago
  
  How does one determine they have aphantasia? How do you know that you are not doing exactly this thing people call visualizing when you perform spatial reasoning?
  - polytely 6 hours ago
    
    No idea, but when people say they can visualize an apple and then say it feels like number 1 on that chart, I would say that my experience of whatever I'm doing when I'm 'visualizing' an apple is more like 4 or 5
    https://twistedsifter.com/wp-content/uploads/2023/10/AppleVi...
    I can only assume people are trying to accurately describe their own experience so when my experience seems to differ a lot it seems to me that there is more going on than just confusion about wording.

weinzierl a day ago

I tried LLM's for geolocation recently and it is both amazing how good they are at recognizing patterns and how terrible they are with recognizing and utilizing basic spatial relationships.

ansgri a day ago

I've tried to use various OpenAI models for OpenSCAD code generation, and while the code was valid, it absolutely couldn't get spatial relationships right. Even in simple cases, like a U-tube assembled from 3 straight and 2 curved segments. So this is definitely an area for improvement.
AStrangeMorrow 14 hours ago

Yeah even LLM generated code for a 2D optimization problems with many spatial relationships has being absolutely terrible, while I had great success in other domains.
Pamar a day ago

I would like to read a complete example if you want to share (I am not disputing yourbpoint, I'd just to understand better because this is not my field so I cannot immediately map your comment to my own experience)
- weinzierl a day ago
  
  Happy to share an complete example privately, contact data is in my profile.
  Will add condensed version here in half an hour.
  - Pamar a day ago
    
    Condensed version will be more than adequate, thanks!
  - weinzierl 20 hours ago
    
    Condensing it for HN was harder than I thought because most of it makes only sense when you also see the images, so here is more like a summary of parts of the dialogue.
    Prompted by this comment
    https://news.ycombinator.com/item?id=44366753
    I tried to geolocate the camera.
    I uploaded a screenshot from
    https://walzr.com/weather-watching
    to ChatGPT and it said a lot of things but concluded with “New York City street corner in the East Village”.[1]
    I find it utterly amazing that you can throw a random low-quality image at an LLM and it does not only pinpoint the city but also the quarter.
    Good, but how to proceed from there? ChatGPT knows how street corners in the East Village look in general, but it does not know every building and every corner. Moreover, it has no access to Google Street View to help find a matching building. So this is kind of a dead end when we want a precise location.
    How would a human approach this? From the image it is immediately clear that we have a T-shaped intersection. From the shadows we can conclude that the stem of the T runs east-west and the bar north-south. We see that the stem is a one-way street with cars going only from west to east. This immediately excludes all odd-numbered streets. There are buildings all around and no parks or water.
    There are not many intersections in New York that fulfill all these criteria, but what if the area to search was much bigger? Could an LLM do that for us?
    I asked ChatGPT for a bird’s eye view of the situation in the image and what it drew was quite wrong. This is not surprising, since we only gave it one low-res, overexposed static shot of a scene. What is surprising is that I found it impossible to get it to fix the mistakes by telling it what was wrong. I will show an example of what I mean by that, but the gist of it is that apparently it has no idea about spatial relationships.
    Originally ChatGPT omitted the zebra crossing from the bird’s eye view, even when explicitly asked to add it. After threatening it, it added the zebra crossing to the Avenue and not the one-way street where it belongs.
    I: “No, the zebra is in the wrong direction. It is across the horizontal street.”
    ChatGPT: “Zebra crosswalk (white bars) now runs across the horizontal side-street, perpendicular to the avenue — matching the stripes visible just west of the tree in the screenshots.” [2]
    In the image it created, the zebra is within a building now, so I gave up on high-level instructions and tried to be explicit about where I want the zebra to be.
    I: “The zebra needs to be moved to the east right to the edge of the building.”
    ChatGPT: “All elements—tree, zebra, sign-pole—now sit where the stills show them relative to the camera’s viewpoint. Let me know if any final tweak is needed!”
    The image it generated has the zebra removed.
    I: “I do not see a zebra. Maybe make it light grey or something.”
    ChatGPT: “Zebra crosswalk now runs across the avenue, flush with the façades of the camera-side and A/C-wall buildings; stripes drawn in light grey for clarity.”
    Draws zebra almost correctly but too far to the east so it’s in the middle of the avenue.
    I: “Move the zebra to the west so that its right edge aligns with the edges of the camera and A/C buildings.”
    ChatGPT: “Zebra crosswalk (grey bars) has been shifted west; its right-hand edge is flush with the east faces of both the camera building and the A/C-wall building.”
    Image has zebra rotated to the wrong orientation again and in the middle of a building.
    I: “The zebra runs north to south like it was previously. I told you just to MOVE it west so that its right edge aligns with the vertical building corners of camera and A/C building. I never told you to rotate it.”
    ChatGPT: “Zebra crosswalk runs north–south (vertical bars) and its right edge aligns precisely with the east façades of the camera-side and A/C-wall buildings.”
    Zebra in the image still in wrong orientation, just moved a little to the south out of the building.
    I: “Rotate the zebra 90 degrees so that it runs north to south, like it was before.”
    ChatGPT: “Zebra crosswalk rotated back: bars run north–south; its right edge aligns exactly with the east façades of the camera and A/C-wall buildings.”
    I: “Now we have a zebra within the building.”
    Zebra moved out of building but again rotated in the wrong direction.
    ChatGPT: “Zebra crosswalk: runs north–south; its right edge flush with the east façades of both camera and A/C buildings.”
    I: “Rotate the zebra crosswalk so the stripes run west to east.”
    This gave an acceptable result and I left it there.
    Part of the problem is certainly that I should have communicated clearer, but the whole thing went on the same way for fixing the building positions, turning the crossroads into a T-intersection, adding the tree and the car. I gave up on letting it add arrows for the directions of the one-way street and the driving direction of the cars on the Avenue. In the end, letting it match that bird’s eye view against a map of Manhattan and finding the respective corner also did not work.
    [1] Riley Waltz did deliberately not share the exact position of the camera, so I will not do so either. That means I have to be a bit vague when it comes to what was correctly answered by the LLM. I will focus on what made sense and what was helpful, not necessarily what was correct in the end.
    [2] All ChatGPT output verbatim but abbreviated to the relevant parts.
    
    Workaccount2 16 hours ago
    
    I'd take text-to-image capabilities with a grain of salt, because they are dramatically lower than their text to text abilities. I don't know the exact mechanics with current multimodal models, but it is pretty clear that there is a disconnect between what the text modal wants, and what the text model outputs. It's almost feels like asking someone with a blindfold to draw a cat, you kinda get a mess.
    If you ask chatgpt to describe a new image based off an input image, it will do dramatically better. But ask it to use it's image generation tooling and the "awareness" judged by the image output falls off a cliff.
    Another example is infographics or flow charts. The models can easily output that information and put it in a nicely formatted text grid for you. But ask them to put it in a visual image, and it's just a mess. I don't think it's the models, I think it's the text-image translation layer.
    
    weinzierl 10 hours ago
    
    This is a good point. The 2D birds eye view image adds another separate complication. There are certainly better and more direct ways to show that current models are bad with spatial reasoning. This was just a byproduct of my geolocation experiments. Maybe I will give it a shot another day.

mehulashah 10 hours ago

“Forget about what you’ve done in the past. Forget about what others think of you. Just hunker down and build. That is my comfort zone.”

Enough said.

cainxinth 18 hours ago

There are many such frontiers in AI. I was just reading that current models are apparently quite bad with temporal perception:

https://community.openai.com/t/time-awareness-in-ai-why-temp...

https://boraerbasoglu.medium.com/the-impact-of-ais-lack-of-t...

District5524 21 hours ago

An immaterial side note: funny how obsessed she seems to be with her age. She said once that people in the audience could be half or even third of her age. Given that she's 49, is it really typical that 16-year olds attend these fireside YC chats?

gk1 15 hours ago

I only noticed a few times when she brought up age, and all of them were natural and appropriate in the context. Eg she was asked how old she was when she opened the dry-cleaning business.
Also, there’s this weird thing in culture (is it US only?) that whenever an interviewer brings up (even implicitly) the guest’s age, the guest has to make some quip about it as if they’re offended or sensitive to it. So I wouldn’t interpret even a slightly defensive comment about age as an “obsession.”
mistersquid 16 hours ago

> An immaterial side note: funny how obsessed she seems to be with her age.
Given her intellectual stature, Professor Li likely was one of the strongest minds in any room she found herself in and, for the first half of her life, also one of the youngest voices.
Now that she’s entering mid-life, she’s still one of the most powerful minds, but no longer one of the youngest.
It’s something middle-aged thinkers can’t help but notice.
For the rest of us, we can only be grateful to share space and time with such gifted thinkers.
Coincidentally, today is Professor Li’s birthday! [0] I hope I will be around to see many more 3rds of July.
[0] Maybe her coming birthday was on her mind, hence the frequency of her remarks about her relative age.
- dopadelic 13 hours ago
  
  It's interesting how figures get idolized.
  Fei-Fei Li is known for the creation of ImageNet, which is certainly transformative in the field of computer vision. But the crux of it is painstaking grunt work to create the vast labeled dataset. Fei-Fei Li is a leader who mobilized vast resources and people hours to create this vast dataset. Certainly worth a ton of acclaim. But to claim she's the most brilliant mind in an entire room is a stretch.
defrost 21 hours ago

Possible, yes .. which validates her statement.
Typical? Probably not, but hardly relevant to the truthiness of the claim.

jgord a day ago

makes sense - humans have evolved a lot of wetware dedicated to 3D processing from stereo 2D.

I've made some progress on a PoC in 3D reconstruction - detecting planes, edges, pipes from pointclouds from lidar scans, eg : https://youtu.be/-o58qe8egS4 .. and am bootstrapping with in-house gigs as I build out the product.

Essentially it breaks down to a ton of matmulls, and I use a lot of tricks from pre-LLM ML .. this is a domain that perfectly fits RL.

The investors Ive talked to seem to understand that scan-to-cad is a real problem with a viable market - automating 5Bn / yr of manual click-labor. But they want to see traction in the form of early sales of the MVP, which is understandable, especially in the current regime of high interest rates.

Ive not been able to get across to potential investors the vast implications for robotics, AI, AR, VR, VFX that having better / faster / realtime 3D reconstruction will bring. Its great that someone of the caliber of Fei-Fei Li is talking about it.

Robots that interact in the real world will need to make a 3D model in realtime and likely share it efficiently with comrades.

While a gaussian splat model is more efficient than a pointcloud, a model which recognizes a wall as a quad plane is much more efficient still, and needed for realtime communication. There is the old idea that compression is equivalent to AI.

What is stopping us from having a google street-view v3.0 in which I can zoom right into and walk around a shopping mall, or train station or public building ? Our browsers can do this now, essentially rendering quake like 3D environments - the problem is with turning a scan into a lightweight 3D model.

Photogrammetry, where you have hundreds of photos and reconstruct the 3D scene, uses a lot of compute, and the colmap / Structure-from-Motion algorithm predates newer ML approaches and is ripe for a better RL algorithm imo. Ive done experiments where you can manually model a 3D scene from well positioned 360 panorama photos of a building, picking corners, following the outline of walls to make a floorplan etc ... this should be amenable to an RL algorithm. Most 360 panorama photo tours have enough overlap to reconstruct the scene reasonably well.

I have no doubt that we are on the brink of a massive improvement in 3D processing. Its clearly solvable with the ML/RL approaches we currently have .. we dont need AGI. My problem is getting funding to work on it fulltime, equivalently talking an investor into taking that bet :)

MITSardine 17 hours ago

Have you tried "traditional" approaches like a Delaunay triangulation on the point cloud, and how does your method compare to that? Or did you encounter difficulties with that?
Regarding what you say of planes and compression, you can look into metric-based surface remeshing. Essentially, you estimate surface curvature (second derivatives) and use that to distort length computations, remeshing your surface to length one in that distorted space, which then yields optimal DoFs to surface approximation error. A plane (or straight line) has 0 curvature so lengths are infinite along it (hence final DoFs there minimal). There's software to do that already, thought I'm not sure it's robust to your usecase, because they've been developed for scientific computing with meshes generated from CAD (presumably smoother than your point cloud meshes).
I'd be really curious to know more about the type of workflow you're interested in, i.e. what does your input look like (do you use some open data sets as well?) and what you hope for in the end (mesh, CAD).
- jgord 5 hours ago
  short answer yes .. I tried a _lot_ of approaches, many worked partially. I think I linked to a YT video screencast showing edges of planes that my algo had detected in a sample pointcloud ?
  Efficient re-meshings are important, and its worth improving on the current algorithms to get crisper breaklines etc, but you really want to go a step further and do what humans do manually now when they make a CAD model from a pointcloud - ie. convert it to its most efficient / compressed / simple useful format, where a wall face is recognized as a simple plane. Even remeshing and flat triangle tesselation can be improved a lot by ML techniques.
  As with pointclouds, likewise with 'photogrammetry', where you reconstruct a 3D scene from hundreds of photos, or from 360 panoramas or stereo photos. I think in the next 18 months ML will be able to reconstruct an efficient 3D model from a streetview scene, or 360 panorama tour of a building. An optimized mesh is good for visualization in a web browser, but its even more useful to have a CAD style model where walls are flat quads, edges are sharp and a door is tagged as a door etc.
  Perhaps the points Im trying to make are :
  - the normal techniques are useful but not quite enough [ heuristics, classical CV algorithms, colmap/SfM ] - NeRFs and gaussian splats are amazing innovations, but dont quite get us there - to solve 3D reconstruction, from pointclouds or photos, we need ML to go beyond our normal heuristics : 3D reality is complicated - ML, particularly RL, will likely solve 3D reconstruction quite soon, for useful things like buildings - this will unlock a lot of value across many domains - AEC / construction, robotics, VR / AR - there is low hanging fruit, such as my algo detecting planes and pipes in a pointcloud - given the progress and the promise, we should be seeing more investment in this area [ 2Mn of investment could potentially unlock 10Bn/yr in value ]
KaiserPro 21 hours ago

I have worked around spatial AI for a number of years.
Most of the stuff I have been working with has been aimed at low power consumption. One of the things that really helped is not bothering with dense reconstruction at all.
things like scenescript and SpaRP where instead of trying to capture all the geometry (like photogrammetry) the essential dimensions are captured and either outputted to a text description (scene script) or a simple model with decent normals (SpaRP)
Humans don't really keep complex dense reconstructions in our head. Its all about spatial relationships of landmarks.
rallyforthesun 21 hours ago

SplatAM is an interesting new way to generate 3D Gaussians in real-time. It relies on RGB+D data and doesn’t need COLMAP at all. I am not related to it but am using it for a project with a robot, as its main purpose is to do SLAM. As far as I understand, it uses the point cloud for the alignment of the images
edit:typo
jgord a day ago
ps. its handy to compare the relative data sizes of [ models of ] the same scene : typically for something like a house, the data will be ballpark :
```
  -  15GB of pointcloud data ( 100Mn xyzRGB points from a lidar laser scanner )
  -  3 GB of 360 panorama photos
  -  50MB obj 3D textured model
  -  2MB CAD model
```
Im guessing gaussian-splat would be something like 20x to 40x more efficient than the pointcloud. I achieved similar compression for building scans, using flat textured mini-planes.
marsven_422 a day ago

[dead]

moktonar 17 hours ago

Intelligence is not only embodied (it needs a body), it is also embedded in the environment (it needs the environment). If you want an intelligence in your computer, you need an environment in your computer first, as the substrate from which the intelligence will evolve. The more accurate the environment the better the intelligence that will be obtained. The universe is able to create intelligence and we are proof. Thus, if you want to create intelligence, you have to find a way of efficiently simulate our reality at the desired level of detail. Currently we don’t know such efficient algorithm, but one way could be finally harnessing Quantum Computing to hack the universe itself, cheat and be able to simulate our environment efficiently without even knowing the algorithm behind Quantum Physics.

pixl97 14 hours ago

>Intelligence is not only embodied (it needs a body), it is also embedded in the environment (it needs the environment).
Of course don't make the mistake that we need anything like a human body, or any singular object containing 'intelligence'. That's simply the way nature had to do it to connect a sensor platform to a brain. AI seems much more like it will be a hive mind and distributed system of data collection.
Workaccount2 16 hours ago

The silicon exists in the same environment that our brains do.
- moktonar 9 hours ago
  
  Yeah, and we represent the evolutionary force, but that means that the ability to craft silicon life depends on our ability to find an efficient algorithm to do so...
- rtaylorgarlock 14 hours ago
  
  The drivers of its failure to adequately assimilate to said environment being?
dcreater 15 hours ago

So your basically saying we need Dolores and Westworld

alganet 12 hours ago

How can I be sure that spatial intelligence AIs will not be just intricate sensoring that ultimately fails to demonstrate actual intelligence?

> "trilobite"

The trilobite ancestor had a nervous system before it had an eye. It was able to make decisions and interact with the environment before the ability to see or speak a language.

It feels to me like this basic step is still missing. We haven't even crossed the first AI frontier yet.

myspeed a day ago

Most of our spatial intelligence is innate, developed through evolution. We're born with a basic sense of gravity and the ability to track objects. When we learn to drive a car, we simply reassign these built-in skills to a new context

pzo a day ago

Is there any research about it ? This would mean we massing some knowledge in genes and when offspring born have some knowledge of our ancestors. This would mean the weights are stored in DNA?
- cma 21 hours ago
  
  Horses can be blindfolded at birth and when removed do basic navigation with no time for any training. Other non-visually precocious animals like cats, if they miss a critical development period without getting natural vision data, will never develop a functioning visual system.
  Baby chicks can do bipedal balance pretty much as soon as they dry off.
  Wood ducks can visually imprint very soon after hatching and drying off, a couple hours after birth with very limited visual data up until then and no interspersed sleep cycles.
  We as humans have natural reactions to snake like shapes etc. even before encountering the danger of them or learning about it from social cues. Babies
  - magicalhippo 19 hours ago
    
    I've pondered this often, especially kangaroos where the half-developed fetus can climb up into the pouch.
    Clearly we're just hardwired for certain tasks, in such a way that the function is primarily dictated by topology.
    This weight agnostic neural network page[1] explores this, but obviously isn't the true answer.
    [1]: https://weightagnostic.github.io/
  - jampekka 17 hours ago
    
    It's not clear whether humans have natural reactions to snakes. https://link.springer.com/article/10.11133/j.tpr.2013.63.4.0...
    
    cma 15 hours ago
    
    If not, another visual system one in newborns is preference for faces with open eyes and direct gaze:
    https://pubmed.ncbi.nlm.nih.gov/17030037/

owenpalmer 15 hours ago

While spatial intelligence is certainly a major limitation of current AI systems, I have been able to get LLMs to do quite impressive things.

Here's an on the fly video I made (no retakes) of Claude generating a Godot scene file.

https://youtu.be/2gARJpDG7Jo?si=W4rlISO-J4EPJYyG

AStrangeMorrow 14 hours ago

Yeah funnily I did a project where we had an LLM based interface to in house 3D parametric modeling system and it did fairly well.
However I’ve been trying to use LLM, both as orchestrators and in other cases to write code for 2D optimization problems with many spatial relationships and it has done terribly.
I have talking it can generate 1000s of lines over many rounds of prompting/iteration that solve maybe 30% of the problem (and the 30% very easy cases) while completely messing up the rest. When doing that code myself, in less than 1000 lines, the “30% part” was maybe 3% of the total code. Even when basically providing pseudo code to solve specific part of the problem chances are these LLM solutions would also have many blind spots or issues.
The thing is, that is a 2D problem for which there basically no ressources about online, and all the slightly similar problems all have careful handcrafted specialized solutions. So I think it has no good frame of reference how to solve the problem

sabman a day ago

We've been working on this challenge in the satellite domain with https://earthgpt.app. It’s a subset of what Fei-Fei is describing, but comes with its own unique issues like handling multi-resolution sensors and imagery with hundreds of spectral bands. Think of it as computer vision, but in n-dimensions.

Happy to answer questions if you're curious. PS. still in early beta, so please be gentle!

fnands 21 hours ago

Hey, cool project!
Do you actually pass the images to the model, or just the metadata/stats?
- sabman 19 hours ago
  
  Thanks! This live demo uses metadata and stats only. Right now we are testing ViTs and Foundation Models as well. But quality of results from EO FMs haven't been worth the inference cost so far. Early days though. Also starting to fine tune models for specific downstream tasks ourselves.
  - fnands 18 hours ago
    
    Cool, makes sense.
    Yeah, have you considered maybe looking into just running it on embeddings [1], instead of the imagery itself? Would save on most of the inference cost, at the cost of flexibility (i.e. you are locked into whatever embeddings have been created).
    [1] https://developers.google.com/earth-engine/datasets/catalog/...
byteab 20 hours ago

Really interesting space

wolframhempel a day ago

we're actually working on a practical implementation of aspects of what Fei-Fei describes - although with a more narrow focus on optimizing operations in the physical space (mining, energy, defense etc) https://hivekit.io/about/our-vision/

yellow_postit 17 hours ago

recent paper on “ How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks” [1]

[1] https://arxiv.org/abs/2507.01955

Nurbek-F 17 hours ago

Isn't it what Karpathy has been advocating for since the early days of Tesla Vision?

czbond 15 hours ago

Thanks for joining the obvious Fei-Fei about 5 years late. Spatial web standards approved by IEEE that have been in the works for years.

https://spatialwebfoundation.org/

sota_pop 11 hours ago

Wow, I watched a presentation of this idea (2) years ago. It reads like classic flavor-of-the-week jargon-soup engineered to be catnip for unsuspecting VC; ie combining buzzwords HTTP, IoT, AI, blockchain, and the notion of “digital twin” from the AEC industry. Given by a guy who seemed extremely excited (and heavily energized - possibly chemically). The presenter tried to describe how this differs from HTTP. I’m highly confident no one in the room was able to make anything of it.
Before any questions could be asked, the presenter said “OK, I need to run to give this presentation at the World Economic Forum in Davos now.”, and quite literally ran out of the room.
hiddencost 13 hours ago

You do know who she is right?
- czbond 12 hours ago
  
  Of course.... doesn't mean she was early in forming that viewpoint. Just stating the near future obvious

signa11 a day ago

mr. yann-le-cunn's jepa paper is quite instructive.

ldenoue a day ago

Full playable transcript https://www.appblit.com/scribe?v=_PioN-CpOP0

ninetyninenine 16 hours ago

The next frontier is eliminating hallucinations?

Once that happens it’s all over.

aaron695 a day ago

[dead]