NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
The Bitter Lesson Is Misunderstood (obviouslywrong.substack.com)
kushalc 18 hours ago [-]
Hey folks, OOP/original author and 20-year HN lurker here — a friend just told me about this and thought I'd chime in.

Reading through the comments, I think there's one key point that might be getting lost: this isn't really about whether scaling is "dead" (it's not), but rather how we continue to scale for language models at the current LM frontier — 4-8h METR tasks.

Someone commented below about verifiable rewards and IMO that's exactly it: if you can find a way to produce verifiable rewards about a target world, you can essentially produce unlimited amounts of data and (likely) scale past the current bottleneck. Then the question becomes, working backwards from the set of interesting 4-8h METR tasks, what worlds can we make verifiable rewards for and how do we scalably make them? [1]

Which is to say, it's not about more data in general, it's about the specific kind of data (or architecture) we need to break a specific bottleneck. For instance, real-world data is indeed verifiable and will be amazing for robotics, etc. but that frontier is further behind: there are some cool labs building foundational robotics models, but they're maybe ~5 years behind LMs today.

[1] There's another path with better design, e.g. CLIP that improves both architecture and data, but let's leave that aside for now.

FloorEgg 18 hours ago [-]
10+ years ago I expected we would get AI that would impact blue collar work long before AI that impacted white collar work. Not sure exactly where I got the impression, but I remember some "rising tide of AI" analogy and graphic that had artists and scientists positioned on the high ground.

Recently it doesn't seem to be playing out as such. The current best LLMs I find marvelously impressive (despite their flaws), and yet... where are all the awesome robots? Why can't I buy a robot that loads my dishwasher for me?

Last year this really started to bug me, and after digging into it with some friends I think we collectively realized something that may be a hint at the answer.

As far as we know, it took roughly 100M-1B years to evolve human level "embodiment" (evolve from single celled organisms to human), but it only took around ~100k-1M for humanity to evolve language, knowledge transfer and abstract reasoning.

So it makes me wonder, is embodiment (advanced robotics) 1000x harder than LLMs from an information processing perspective?

breuleux 17 hours ago [-]
> So it makes me wonder, is embodiment (advanced robotics) 1000x harder than LLMs from an information processing perspective?

Essentially, yes, but I would go further in saying that embodiment is harder than intelligence in and of itself.

I would argue that intelligence is a very simple and primitive mechanism compared to the evolved animal body, and the effectiveness of our own intelligence is circumstantial. We manage to dominate the world mainly by using brute force to simplify our environment and then maintaining and building systems on top of that simplified environment. If we didn't have the proper tools to selectively ablate our environment's complexity, the combinatorial explosion of factors would be too much to model and our intelligence would be of limited usefulness.

And that's what we see with LLMs: I think they model relatively faithfully what, say, separates humans from chimps, but it lacks the animal library of innate world understanding which is supposed to ground intellect and stop it from hallucinating nonsense. It's trained on human language, which is basically the shadows in Plato's cave. It's very good at tasks that operate in that shadow world, like writing emails, or programming, or writing trite stories, but most of our understanding of the world isn't encoded in language, except very very implicitly, which is not enough.

What trips us up here is that we find language-related tasks difficult, but that's likely because the ability evolved recently, not because they are intrinsically difficult (likewise, we find mental arithmetic difficult, but it not intrinsically so). As it turns out, language is simple. Programming is simple. I expect that logic and reasoning are also simple. The evolved animal primitives that actually interface with the real world, on the other hand, appear to be much more complicated (but time will tell).

FloorEgg 15 hours ago [-]
Nicely said. This all aligns with my intuition, with one caveat.

I think you and I are using different definitions of intelligence. I'm bought into Karl Friston's free energy principle and think it's intelligence all the way down. There is no separating embodiment and intelligence.

The LLM distinction is intelligence via symbols as opposed to embodied intelligence, which is why I really like your shadow world analogy. Without getting caught up in subtle differences in our ontologies, I agree wholeheartedly.

breuleux 15 hours ago [-]
You're right, we probably have different ontologies. To me an intelligent system is a system which aims to realize a goal through modelling its environment and planning actions to bring about that intended state. That's more or less what humans do and I think that's more in line with the colloquial understanding of it.

There are basically two approaches to defining intelligence, I think. You can either define it in terms of capability, in which case a system that has no intent and does not plan can be more intelligent than one that does, simply by virtue of being more effective. Or you can define it in terms of mechanism: something is intelligent if it operates in a specific way. But it may then turn out to be the case that some non-intelligent systems are more effective than some intelligent systems. Or you can do both and assume that there is some specific mechanism (human intelligence, conveniently) that is intrinsically better than the others, which is a mistake people commonly make and is the source of a lot of confusion.

I tend to go for the second approach because I think it's a more useful framing to talk about ourselves, but the first is also consistent. As long as we know what the other means.

FloorEgg 14 hours ago [-]
If intelligence is treated as a scale, should it be measured primarily by (a) the diversity of valid actions an entity can take combined with its ability to collect and process information about its environment and predict outcomes, or (b) only by its ability to collect and process information and predict outcomes?

In either case, the smallest unit of intelligence could be seen as a component of a two-field or particle interaction, where information is exchanged and an outcome is determined. Scaled up, these interactions generate emergent properties, and at each higher level of abstraction, new layers of intelligence appear that drive increasing complexity. Under such a view, a less intelligent system might still excel in a narrow domain, while a more intelligent system, effective across a broader range, might perform worse in that same narrow context.

Depending on the context of the conversation, I might go along with some cut-off on the scale, but I don't see why the scale isn't continuous. Maybe it has stacked s-curves though...

We just happen to exist at an interesting spot on the fractal that's currently the highest point we can see. So it makes sense we would start with our own intelligence as the idea of intelligence itself.

GeorgeTirebiter 11 hours ago [-]
I think it's an issue of hierarchies and the Society of Mind (Minsky). If a human touches a hot stove, or any animal's end effector, a lower-level process instantly pulls the hand/paw away from the heat. There are no doubt thousands of these 'smart body, no brain' interactions that take over in certain situations, conscious thinking not required.

Ken Goldberg shows that getting robots to operate in the real world using methods that have been successful getting LLMs to do things we consider smart -- getting huge amounts of training data -- seems unlikely. The vastness between what little data a company like Physical Intelligence has vs what GPT-5 uses is shown here: https://drive.google.com/file/d/16DzKxYvRutTN7GBflRZj57WgsFN... 84 seconds

Ken advocates plenty of Good Old-Fashioned Engineering to help close this gap, and worries that demos like Optimus actually set the field back because expectations are set too high. Like the AI researchers who were shocked by LLMs' advances, it's possible something out of left field will close this training gap for robots. I think it'll be at least 5 more years before robots will be among us as useful in-house servants. We'll see if the LLM hype has spilled over too much into the humanoid robot domain soon enough.

pmontra 11 hours ago [-]
> But it may then turn out to be the case that some non-intelligent systems are more effective than some intelligent systems.

That is surely the case on limited scopes. For example the non neural net chess engines are better at chess than any human.

I think that neural networks compare with human intelligence in a fair way, because we should limit their training to the number of games that human professionals can reasonably play in their life. Alphago won't be much good after playing, let's say, 10 thousand games even starting from the corpus of existing human games.

coldtea 10 hours ago [-]
>There is n separating embodiment and intelligence.

And yet whetever IQ you have, it can't make you just play the violin without actually having embodied practice first.

thfuran 6 hours ago [-]
If you have sufficient motor control and dexterity, the amount of required practice should be approximately zero. Just calculate the required finger position and bow orientation, pressure, and velocity for optimal production of the desired sound and do that. That is not how humans perform physical tasks though.
highfrequency 7 hours ago [-]
> We manage to dominate the world mainly by using brute force to simplify our environment and then maintaining and building systems on top of that simplified environment. If we didn't have the proper tools to selectively ablate our environment's complexity…

This is very interesting and I feel there is a lot to unpack here. Could you elaborate on this theory with a few more paragraphs (or books / blogs that elucidate this)? In what ways do we use brute force to simplify the environment, and are there not ways in which we use highly sophisticated leveraged methods to simplify our environment tools? What proper tools allow us to selectively ablate complexity? Why does our intelligence only operate on simplified forms?

Also, what would convince you that symbolic intelligence is actually “harder” than embodied intelligence? To me the natural test is how hard it is for each one to create the other. We know it took a few billion years to go from embodied intelligence (ie organisms that can undergo evolution, with enough diversity to survive nearly any conditions on Earth) to sophisticated symbolic intelligence. What if it turns out that within 100 years, symbolic intelligence (contained in LLM like systems) could produce the insights to eg create new synthetic life from scratch that was capable of undergoing self-sustained evolution in diverse and chaotic environments? Would this convince you that actually symbolic intelligence is the harder problem?

lucketone 7 hours ago [-]
Not OP, but several examples:

A. instead of building a house on random terrain with random materials, first we prefer to flatten the place, then we use standard materials (e.g. bricks), which were produced from simple source (e.g. large and relatively homogenous deposit of clay).

B. For mental tasks it’s usual to said, that a person can handle only 7 items at a time (if you disagree multiply by 2-3). But when you ride a bike you process more inputs at the same time (you hear a car behind you, you see person on the right, you feel your balance, you anticipate your direction, if you feel strong wind or sun on your face you probably squint your eyes, you take a breath of air. On top of that all the processes of your body adjust and support your riding: heart, liver, stomach…)

C. “Spherical cows” in physics. (Google this if needed)

breuleux 3 hours ago [-]
> Why does our intelligence only operate on simplified forms?

Part of the issue with discussing this is that our understanding of complexity is subjective and adapted to our own capabilities. But the gist of it is that the difficulty of modelling and predicting the behavior of a system scales very sharply with its complexity. At the end of the scale, chaotic systems are basically unintelligible. Since modelling is the bread and butter of intelligence, any action that makes the environment more predictable has outsized utility. Someone else gave pretty good examples, but I think it's generally obvious when you observe how "symbolic-smart" people think (engineers, rationalists, autistic people, etc.) They try to remove as many uncontrolled sources of complexity as possible. And they will rage against those that cannot be removed, if they don't flat out pretend they don't exist. Because in order to realize their goals, they need to prove things about these systems, and it doesn't take much before that becomes intractable.

One example of a system that I suspect to be intractable is human society itself. It is made out of intelligent entities, but as a whole I don't think it is intelligent, or that it has any overarching intent. It is insanely complex, however, and our attempts to model its behavior do not exactly have a good record. We can certainly model what would happen if everybody did this or that (aka a simpler humanity), but everybody doesn't do this and that, so that's moot. I think it's an illuminating example of the limitations of symbolic intelligence: we can create technology (simple), but we have absolutely no idea what the long term consequences are (complex). Even when we do, we can't do anything about it. The system is too strong, it's like trying to flatten the tides.

> To me the natural test is how hard it is for each one to create the other.

I don't think so. We already observe that humans, the quintessential symbolic intelligences, have created symbolic intelligence before embodied intelligence. In and of itself, that's a compelling data point that embodied is harder. And it appears likely that if LLMs were tasked to create symbolic intelligences, even assuming no access to previous research, they would recreate themselves faster than they would create embodied intelligences. Possibly they would do so faster than evolution, but I don't see why that matters, if they also happen to recreate symbolic intelligence even faster than that. In other words, if symbolic is harder... how the hell did we get there so quick? You see what I mean? It doesn't add up.

On a related note, I'd like to point out an additional subtlety regarding intelligence. Intelligence (unlike, say, evolution) has goals and it creates things to further these goals. So you create a new synthetic life. That's cool. But do you control it? Does it realize your intent? That's the hard part. That's the chief limitation of intelligence. Creating stuff that is provably aligned with your goals. If you don't care what happens, sure, you can copy evolution, you can copy other methods, you can create literally anything, perhaps very quickly, but that's... not smart. If we create synthetic life that eats the universe, that's not an achievement, that's a failure mode. (And if it faithfully realizes our intent then yeah I'm impressed.)

djmips 16 hours ago [-]
You've captured a lot here with you shadow world summary. Very well done - I've been feeling this and now you've turned it into words and I'm pretty sure you're correct!
programjames 16 hours ago [-]
It took about the same amount of time to evolve human-level intelligence as human-level mobility. Pretty much no other animal walks on two legs...
trescenzi 16 hours ago [-]
This is interesting to think about. It’s basically just birds and primates. Birds have an ancient evolutionary tree as they are dinosaurs, which did actually walk on two legs. But the gap between dinos and primates walking on two feet, I think, is tens of millions of years. So yea pretty long time.
noduerme 15 hours ago [-]
This makes me think something else, though. Once we were able to reason about the physics behind the way things can move, we invented wheels. From there it's a few thousand years to steam engines and a couple hundred more years to jet planes and space travel.

We may have needed a billion years of evolution from a cell swimming around to a bipedal organism. But we are no longer speed limited by evolution. Is there any reason we couldn't teach a sufficiently intelligent disembodied mind the same physics and let it pick up where we left off?

I like the notion of the LLM's understanding being "shadows on the wall of Plato's cave metaphor," and language may be just that. But math and physics can describe the world much more precisely and, of you pair them with the linguistic descriptors, a wall shadow is not very different from what we perceive with out own senses and learn to navigate.

breuleux 14 hours ago [-]
Note that wheels, steam engines, jet planes, spaceships wouldn't survive on their own in nature. Compared to natural structures, they are very simple, very straightforward. And while biological organisms are adapted to survive or thrive in complicated, ever-changing ecosystems, our machines thrive in sanitized environments. Wheels thrive on flat surfaces like roads, jet planes thrive in empty air devoid of trees, and so on. We ensure these conditions are met, and so far, pretty much none of our technology would survive without us. All this to say, we're playing a completely different game from evolution. A much, much easier game. Apples and oranges.

As for limits, in my opinion, there are a few limits human intelligence has that evolution doesn't. For example, intent is a double-edged sword: it is extremely effective if the environment can be accurately modelled and predicted, but if it can't be, it's useless. Intelligence is limited by chaos and the real world is chaotic: every little variation will eventually snowball into large scale consequences. "Eventually" is the key word here, as it takes time, and different systems have different sensitivities, but the point is that every measure has a half-life of sorts. It doesn't matter if you know the fundamentals of how physics work, it's not like you can simulate physics, using physics, faster than physics. Every model must be approximate and therefore has a finite horizon in which its predictions are valid. The question is how long. The better we are at controlling the environment so that it stays in a specific regime, the more effective we can be, but I don't think it's likely we can do this indefinitely. Eventually, chaos overpowers everything and nothing can be done.

Evolution, of course, having no intent, just does whatever it does, including things no intelligence would ever do because it could never prove to its satisfaction that it would help realize its intent.

noduerme 13 hours ago [-]
Okay, but (1) we don't need to simulate physics faster than physics to make accurate-enough predictions to fly a plane, in our heads, or build a plane on paper, or to model flight in code. (2) If that's only because we've cleared out the trees and the Canada Geese and whatnot from our simplified model and "built the road" for the wheels, then necessity is also the mother of invention. "Hey, I want to fly but I keep crashing into trees" could lead an AI agent to keep crashing, or model flying chainsaws, or eventually something that would flatten the ground in the shape of a runway. In other words, why are we assuming that agents cannot shape the world (virtual, for now) to facilitate their simplified mechanical and physical models of "flight" or "rolling" in the same way that we do?

Also, isn't that what's actually scary about AI, in a nutshell? The fact that it may radically simplify our world to facilitate e.g. paper clip production?

breuleux 54 minutes ago [-]
> we don't need to simulate physics faster than physics to make accurate-enough predictions to fly a plane

No, but that's only a small part of what you need to model. It won't help you negotiate a plane-saturated airspace, or avoid missiles being shot at you, for example, but even that is still a small part. Navigation models won't help you with supply chains and acquiring the necessary energy and materials for maintenance. Many things can -- and will -- go wrong there.

> In other words, why are we assuming that agents cannot shape the world

I'm not assuming anything, sorry if I'm giving the wrong impression. They could. But the "shapability" of the world is an environment constraint, it isn't fully under the agent's control. To take the paper clipper example, it's not operating with the same constraints we are. For one, unlike us (notwithstanding our best efforts to do just that), it needs to "simplify" humanity. But humanity is a fast, powerful, reactive, unpredictable monster. We are harder to cut than trees. Could it cull us with a supervirus, or by destroying all oxygen, something like that? Maybe. But it's a big maybe. Such brute force takes requires a lot of resources, the acquisition of which is something else it has to do, and it has to maintain supply chains without accidentally sabotaging them by destroying too much.

So: yes. It's possible that it could do that. But it's not easy, especially if it has to "simplify" humans. And when we simplify, we use our animal intelligence quite a bit to create just the right shapes. An entity that doesn't have that has a handicap.

coldtea 9 hours ago [-]
>Also, isn't that what's actually scary about AI, in a nutshell? The fact that it may radically simplify our world to facilitate e.g. paper clip production?

No, it's more about massive job losses and people left to float alone, mass increase in state control and surveillance, mass brain rot due to AI slop, and full deterioration of responsibility and services through automation and AI as a "responsibility shield".

baq 13 hours ago [-]
Something that isn’t obvious when we’re talking about the invention of the wheel: we aren’t actually talking about the round shape thing, we’re actually talking about the invention of the axle which allowed mounting a stationary cart on moving wheels.
Earw0rm 12 hours ago [-]
And the roadways (later, rails) on which it operates.

Meanwhile, entire civilizations in South America developed with little to no use of wheels, because the terrain was unsuited to roads.

oblio 8 hours ago [-]
It wasn't actually just terrain. It was actually availability of draft animals, climate conditions and actually most importantly... economics.

Wheeled vehicles aren't inherently better in a natural environment unless they're more efficient economically than the alternatives: pack animals, people carrying cargo, boats, etc.

South America didn't have good draft animals and lots of Africa didn't have the proper economic incentives: Sahara had bad surfaces where camels were absolutely better than carts and sub Saharan Africa had climate, terrain, tsetse flies and whatnot that made standard pack animals economically inefficient.

Humans are smart and lazy, they will do the easiest thing that let's them achieve their goals. This sometimes leads them to local maxima. That's why many "obvious" inventions took thousands of years to create (cotton gin, for example).

14 hours ago [-]
card_zero 12 hours ago [-]
Yes, only humans, birds, sifakas, pangolins, kangaroos, and giant ground sloths. Only those six groups of creatures, and various lizards including the Jesus lizard which is bipedal on water, just those seven groups and sometimes goats and bears.
trescenzi 6 hours ago [-]
I get what you mean, that’s why the basically is there. Most, kangaroos and some lemurs in your list being the exception, do not move around primarily as bipeds. The ability to walk on two legs occasionally is different than genuinely having two legs and two arms.
coldtea 9 hours ago [-]
And once every while, my cat.
coldtea 9 hours ago [-]
Human-level mobility however is not much to write home about. Just one more variation of the many types seen in animals.

Human level intelligence is, otoh, qualitatively and quantitatively a bigger deal.

oblio 8 hours ago [-]
I wouldn't agree completely. Being bipedal frees up the hands for, anything, really.

We're better than most animals because we have tools. We have great tools because we have hands.

delusional 13 hours ago [-]
Talking about "time to evolve something" seems patently absurd and unscientific to me. All of nature evolved simultaneously. Nature didn't first make the human body and then go "that's perfect for filling the dishwasher, now to make it talk amongst itself" and then evolve intelligence. It all evolved at the same time, in conjunction.

You cannot separate the mind and the body. They are the same physiological and material entity. Trying anyway is of course classic western canon.

coldtea 9 hours ago [-]
>Nature didn't first make the human body and then go "that's perfect for filling the dishwasher, now to make it talk amongst itself" and then evolve intelligence. It all evolved at the same time, in conjunction.

Nature didn't make decisions about anything.

But it also absolutely didn't "all evolved at the same time, in conjunction" (if by that you mean all features, regarding body and intelligence, at the same rate).

>You cannot separate the mind and the body. They are the same physiological and material entity

The substrate is. Doesn't mean the nature of abstract thinking is the same as the nature of the body, in the same way the software as algorithm is not the same as hardware, even if it can only run on hardware.

But to the point: this is not about separating the "mind and the body". It's about how you can have humanoid form and all the typical human body functions for millions of years before you get human level intelligence, after many later evolution.

>Trying anyway is of course classic western canon.

It's also classic eastern canon, and several others besides.

imtringued 9 hours ago [-]
Birds? Bears whose front paws got injured? https://youtu.be/kcIkQaLJ9r8
oblio 8 hours ago [-]
Birds didn't develop hands, neither did bears. Also bears can't walk 100km on their hind legs, but we can.
giardini 13 hours ago [-]
Plato's "Allegory of the cave" was uninteresting and uninformative when I first read it more than 50 years ago. It remains so today.

https://en.wikipedia.org/wiki/Allegory_of_the_cave

Also, other than in sculpture/dentistry/medicine I also find "ablation" to not be a particularly insightful metaphor either. Although I see ablation's application to LLMs I simply had to laugh when I first read about it: I envisioned starting with a Greyhound bus and blowing off parts until it was a Lotus 7 sports car!8-). Good luck with that! Kind of like fixing the TV set by kicking it (but it _does_ work sometimes!).

Perhaps we should refrain somewhat from applying metaphors/simile/allegories to describe LLMs relative to human intelligence unless they provide some insight of significant value.

coldtea 9 hours ago [-]
>Plato's "Allegory of the cave" was uninteresting and uninformative when I first read it more than 50 years ago. It remains so today.

Anything can be uninteresting and uninformative when one doesn't see it's interestingness or can't grok its information.

It however stood for millenia as a great device to describe multiple layers of abstractions, deeper reality vs appearance, and so on, with utility as such in countless domains.

giardini 2 hours ago [-]
No. the Allegory is a fragment of a poor unfinished story and little more. You don't need it to explain "multiple layers of abstractions, deeper reality vs appearance" as you say. In fact, you don't need it for anything at all except to explain Plato's "Allegory of the cave". Sheesh.

coldtea says "...with utility as such in countless domains." So when's the last time you referred to the "Allegory of the cave" in your day, other than on HN?

taneq 11 hours ago [-]
I don’t think that’s what ablation is about. It’s more like blowing parts off a bus until it ceases to be a bus. Then you find the minimal set of bus parts required to still be a bus, and that’s an indication that those parts are important to the central task of being a bus.
giardini 1 hours ago [-]
taneq SAYS "i don’t think that’s what ablation is about. It’s more like blowing parts off a bus until it ceases to be a bus."

Different people have different goals. You want some form of minimal bus and I want a Lotus 7. There's no guarantee either of us reach our goal.

Ablation is about disassembling something randomly, whether little by little or on an arbitrary scale until [SOMETHING INTERESTING OR DESIRABLE HAPPENS].

https://en.wikipedia.org/wiki/Ablation_(artificial_intellige...

Ablation is laughable but sometimes useful. It is also easy, mostly brainless, NOT guaranteed to provide any useful information (so you've an excuse for the wasted resources), and occasionally provides insight. It's a good tool for software engineers who have no (or seek no) understanding of their system, so I think of ablation as a "last resort" solutions (e.g., another being to randomly modify code until it "works") that I disdain.

But I'm old so I'm probably wrong! Burn those CPU towers down, boys and girls!

dragonwriter 16 hours ago [-]
> 10+ years ago I expected we would get AI that would impact blue collar work long before AI that impacted white collar work.

We did.

Like, to the point that the AI that radically impacted blue collar work isn't even part of what is considered “AI” any more.

mikeyouse 16 hours ago [-]
I think it's Benedict Evans who frequently posts about 'blue collar' AI work not looking like humanoid robots but instead Amazon fulfillment centers keeping track of millions of individual items or tomato picking robots with MV cameras only keeping the ripe ones as it picks at absurd rates.

There are endless corners of the physical world right now where it's not worth automating a task if you need to assign an engineer and develop a software competency as a manufacturing or retail company, but would absolutely be worth it if you had a generalizable model that you could point-and-shoot at them.

noduerme 15 hours ago [-]
Or a generalized model to develop them in a virtual sandbox before deploying them physically, which I think is more likely.
mbac32768 10 hours ago [-]
We think this because ten years ago we were all having our minds blown by DeepMind's game playing achievements and videos of dancing robots and thought this meant blue collar work would be solved imminently.

But most of these solutions were more crude than they let on, and you wouldn't really know unless you were working in AI already.

Watch John Carmack's recent talk at Upper Bound if you want him to see him destroy like a trillion dollars worth of AI hype.

https://m.youtube.com/watch?v=rQ-An5bhkrs&t=11303s&pp=2AGnWJ...

Spoiler: we're nowhere close to AGI

Hendrikto 8 hours ago [-]
> But most of these solutions were more crude than they let on, and you wouldn't really know unless you were working in AI already.

Same with LLMs. Despite having seen this play out before, and being aware of this, people are falling for it again.

uncircle 9 hours ago [-]
Thank you for this update. I vividly remember a few years ago the excitement of John Carmack announcing he was retreating into his cave to do some deep work on AGI, pushing the boundaries of the current AI research. I truly appreciate Carmack's intellectual honesty now at announcing "yeah, no, LLMs are not the way to go to recreate anything remotely close to human intelligence.". In fact, and I quote him, "we do not even have a line of sight to [the fundamentals of intelligence]."

I'm honestly relieved that one of the brightest minds in computing, with all the resources and desire to create actual super-intelligences, has had to temper hard his expectations.

hackinthebochs 5 hours ago [-]
I don't think that quote from Carmack represents some deeply considered conclusion. He started off his efforts with embodiment. He either never considered LLMs a path towards AGI, or thought he didn't personally have anything to contribute to LLMs (he talked about it early on in his journey but I don't remember the specifics). He didn't spend a year investigating LLMs and then decide that they weren't the path to AGI. The point is that he has no special insight regarding LLMs relationship to AGI and its misleading to imply that his current effort towards building AGI that eschew LLMs is an expert opinion.
uncircle 3 hours ago [-]
Yes, I meant to say that, for Carmack, no type of modern AI research has figured out the path to actual general intelligence. I just didn't want to use the meaningless "AI" buzzword, and these days all the focus and money is on large language models, especially when talking about the end goal of AGI.
chrchr 14 hours ago [-]
Part of the answer to this puzzle is that your dishwasher itself is a robot that washes dishes, and has had enormous impact on blue collar jobs since its invention and widespread deployment. There are tons of labor saving devices out there doing blue collar work that we don't think of as robots or as AI.
kushalc 18 hours ago [-]
Not a robotics guy, but to extent that the same fundamentals hold—

I think it's a degrees of freedom question. Given the (relatively) low conditional entropy of natural language, there aren't actually that many degrees of (true) freedom. On the other hand, in the real world, there are massively more degrees of freedom both in general (3 dimensions, 6 degrees of movement per joint, M joints, continuous vs. discrete space, etc.) and also given the path dependence of actions, the non-standardized nature of actuators, actuators, kinematics, etc.

All in, you get crushed by the curse of dimensionality. Given N degrees of true freedom, you need O(exp(N)) data points to achieve the same performance. Folks do a bunch of clever things to address that dimensionality explosion, but I think the overly reductionist point still stands: although the real world is theoretically verifiable (and theoretically could produce infinite data), in practice we currently have exponentially less real-world data for an exponentially harder problem.

Real roboticists should chime in...

simne 2 hours ago [-]
As I could see, classic methods (used in children teaching) could create at least magnitude more data than we have now, just paraphrasing text (classic NLP), but depends on language (I'll try explain).

Text really have lot of degrees of freedom, but depends on language, and even more on type of alphabet - modern English with phonetic alphabet is worst choice, because it is simplest, nearly nobody use second-third hidden meaning (I hear about 2-3 to 5-6 meanings depending on source); hieroglyphic languages are much more information rich (10-22 meanings); and what is interest, phonetic languages in totalitarian countries (like Russian) are also much more rich (8-12 meanings), because they used to hide few meanings from government to avoid punishment.

Language difference (more dimensions) could be explanation of current achievements of China, superior to Western, and it could also be hint, on how to boost Western achievements - I mean, use more scientists from Eastern Europe and give more attention to Eastern European languages.

For 3D robots, I see only one way - computational simulated environment.

criddell 2 hours ago [-]
> in practice we currently have exponentially less real-world data for an exponentially harder problem

Is that where learning comes in? Any actual AGI machine will be able to learn. We should be able to buy a robot that comes ready to learn and we teach it all the things we want it to do. That might mean a lot of broken dishes at first, but it's about what you would expect if you were to ask a toddler to load your dishes into the dishwasher.

My personal bar for when we reach actual AGI is when it can be put in a robot body that can navigate our world, understand spatial relationships, and can learn from ordinary people.

jandrewrogers 16 hours ago [-]
This understates the complexity of the problem. I have built a career modeling/learning entity behavior in the physical world at scale. Language is almost a trivial case by comparison.

Even the existence of most relationships in the physical world can only be inferred, never mind dimensionality. The correlations are often weak unless you are able to work with data sets that far exceed the entire corpus of all human text, and sometimes not even then. Language has relatively unambiguous structure that simply isn't the norm in real space-time data models. In some cases we can't unambiguously resolve causality and temporal ordering in the physical world. Human brains aren't fussed by this.

There is a powerful litmus test for things "AI" can do. Theoretically, indexing and learning are equivalent problems. There are many practical data models for which no scalable indexing algorithm exists in literature. This has an almost perfect overlap with data models that current AI tech is demonstrably incapable of learning. A company with novel AI tech that can learn a hard data model can demonstrate a zero-knowledge proof of capability by qualitatively improving indexing performance of said data models at scale.

Synthetic "world models" so thoroughly nerf the computer science problem that they won't translate to anything real.

noduerme 15 hours ago [-]
But we don't need to know all the things that could happen if M joints moved in every possible way at the same time. We operate within normal constraints. When you see someone trip on a sidewalk and recover before falling on their face, that's still a physical system taking signals and suggesting corrections that could be simulated in a relatively straightforward newtonian virtual reality, and trained a billion times on with however many virtual joints and actuators.

In terms of "world building", it makes sense for the "world" to not be dreamed up by an AI, but to have hard deterministic limits to bump up against in training.

I guess what I mean is that humans in the world constantly face a lot of conditions that can lead to undefined behavior as well, but 99% of the time not falling on your face is good enough to get you a job washing dishes.

amelius 8 hours ago [-]
In other words, self driving cars and robot vacuum cleaners cannot exist. Hmm.
oblio 8 hours ago [-]
LOL. Both of those are very limited and work in 2D spaces in highly constrained environments especially designed for them.
FloorEgg 17 hours ago [-]
Also not a robotics guy, but that all sounds right to me...

What I do have deep experience in is market abstractions and jobs to be done theory. There are so many ways to describe intent, and it's extremely hard to describe intent precisely. So in addition to all the dimensions you brought up that relate to physical space, there is also the hard problem of mapping user intent to action with minimal "error", especially since the errors can have big consequences in the physical world. In other words, the "intent space" also has many dimensions to it, far beyond what LLMs can currently handle.

On one end of the spectrum of consequences is the robot loads my dishwasher such that there is too much overlap and a bunch of the dishes don't get cleaned (what I really want is for the dishes to be clean, not for the dishes to be in the dishwasher), and on the other end we get the robot that overpowers humanity and turns the universe into paperclips.

So maybe we have to master LLMs and probably a whole other paradigm before robots can really be general purpose and useful.

Earw0rm 11 hours ago [-]
Autonomous vehicles are an interesting subset.

Even though the system rules and I/O are tightly constrained, they're still struggling to match human performance in an open-world scenario, after a gigantic R&D investment with a crystal clear path to return.

Fifteen years ago I thought that'd be a robustly solved problem by now. It's getting there, but I think I'll still need to invest in driving lessons for my teenage kids. Which is pretty annoying, honestly: expensive, dangerous for a newly qualified driver, and a massive waste of time that could be used for better things. (OK, track days and mountain passes are fun. 99% of driving is just boring, unnecessary suckage).

What's notable: AVs have vastly better sensors than humans, masses of compute, potentially 10X reaction speed. What they struggle with is nuance and complexity.

Also, AVs don't have to solve the exact same problems as a human driver. For example, parking lots: they don't need to figure out echelon parking or multi-storey lots, they can drop their passengers and drive somewhere else further away to park.

bflesch 9 hours ago [-]
The problem is not the robot loading the diswasher, it is the dishwasher. The dishwasher (and general kitchen electronics) industry has not innovated in a long time.

My prediction is a new player will come in who vertically integrates these currently disjoint industries and product. The tableware used should be compatible with the dishwasher, the packaging of my groceries should be compatible with the cooking system. Like a mini-factory.

But current vendors have no financial incentive to do so, because if you take a step back the whole notion of putting one room of your apartment full with random electronics just to cook a meal once in a blue moon is deeply inefficient. End-to-end food automation is coming to the restaurant business, and I hope it pushes prices of meals so far down that having a dedicated room for a kitchen in the apartment is simply not worth it.

That's the "utopia" version of things.

In reality, we see prices for fast food (the most automated food business) going up while quality is going down. Does it make the established players more vulnerable to disruption? I think so.

criddell 2 hours ago [-]
> the whole notion of putting one room of your apartment full with random electronics just to cook a meal once in a blue moon is deeply inefficient

You don't use your kitchen? After the rooms we sleep in, the kitchen is probably the most used space in my home. We are planning an upcoming renovation of our home and the kitchen is where we plan on spending the most money.

> The tableware used should be compatible with the dishwasher

Aside from non-dishwasher safe items, what tableware is incompatible with a dishwasher?

bflesch 17 minutes ago [-]
Yes, of course I use it a lot. It is a great hobby. But only use it because it is kind of forced upon us. It's just so inefficient nowadays. Cooking used to be for the whole homestead or for the large family. Now it is mostly only for the immediate family. All the machines are not utilized properly. When people discussed car sharing it was exactly the same argument and I feel it also applies to kitchens.

With the "tableware" argument I meant something like a standardized (magnetic?) adapter for grabbing plates, forks and knives so they can easily be moved by machines/robots.

I feel a company like Ikea is perfectly set up to make this idea a reality, but they'll never do so because they make much more money when every single household buys all these appliances and items for their own kitchen.

Just from the perspective of a single household in a densely populated city I think it'd be nice to have freshly cooked, reproducibly prepared meals with high-quality ingredients available to me. Like an automated soup kitchen with cleanup. Without all the layers of plastic wrapping needed to move produce from large-scale distributors into single-household fridges and so on.

lambdaone 8 hours ago [-]
This exists already in the form of "ready meals" a.k.a. TV dinners. Fast food shops are already substantially mechanised; huge effots have been made to robotize cooking, but people are still cheaper to hire. It's still nowhere near the quality of home-cooked food.
bflesch 4 hours ago [-]
Yes, there are a lot of garbage microwave food offerings, especially popular with the US population. As a European I'm talking about quality food made with an automated process and end-to-end automation, including ingredient procurement and cleanup.

Not in competition with trash food but with proper food and local ingredients.

ACCount37 17 hours ago [-]
The big robot AI issue is: no data!

There is a lot of high quality text from diverse domains, there's a lot of audio or images or videos around. The largest robotics datasets are absolutely pathetic in size compared to that. We didn't collect or stockpile the right data in advance. Embodiment may be hard by itself, but doing embodiment in this data-barren wasteland is living hell.

So you throw everything but the kitchen sink at the problem. You pre-train on non-robotics data to squeeze transfer learning for all its worth, you run hard sims, a hundred flavors of data augmentation, you get hardware and set up actual warehouses with test benches where robots try their hand at specific tasks to collect more data.

And all of that combined only gets you to "meh" real world performance - slow, flaky, fairly brittle, and on relatively narrow tasks. Often good enough for an impressive demo, but not good enough to replace human workers yet.

There's a reason why a lot of those bleeding edge AI powered robots are designed for and ship with either teleoperation capabilities, or demonstration-replay capabilities. Companies that are doing this hope to start pushing units first, and then use human operators to start building up some of the "real world" datasets they need to actually train those robots to be more capable of autonomous operation.

Having to deal with Capital H Hardware is the big non-AI issue. You can push ChatGPT to 100 million devices, as long as you have a product people want to use for the price of "free", and the GPUs to deal with inference demand. You can't materialize 100 million actual physical robot bodies out of nowhere for free, GPUs or no GPUs. Scaling up is hard and expensive.

Hendrikto 8 hours ago [-]
> And all of that combined only gets you to "meh" real world performance - slow, flaky, fairly brittle, and on relatively narrow tasks. Often good enough for an impressive demo, but not good enough to replace human workers yet.

Sounds like LLMs to me.

ACCount37 8 hours ago [-]
It's like GPT-3.5 - a proof-of-concept tech demo more than a product.

I don't think further improvements are impossible, not at all. They're just hard to get at.

petralithic 6 hours ago [-]
> 10+ years ago I expected we would get AI that would impact blue collar work long before AI that impacted white collar work.

I'm not sure where people get this impression from, even back decades ago. Hardware is always harder than software. We had chess engines in the 20th century but a robotic hand that could move pieces? That was obviously not as easy because dealing with the physical world always has issues that dealing with the virtual doesn't.

api 7 hours ago [-]
Embodiment is 1000x harder from a physical perspective.

Look at how hard it is for us to make reliable laptop hinges or the articulated car door handle trend (started by Tesla) where they constantly break.

These are simple mechanisms compared to any animal or human body. Our bodies last up to 80-100 years through not just constant regeneration but organic super-materials that rival anything synthetic in terms of durability within its spec range. Nature is full of this, like spider silk much stronger than steel or joints that can take repeated impacts for decades. This is what hundreds of millions to billions of years of evolution gets you.

We can build robots this good but they are expensive, so expensive that just hiring someone to do it manually is cheaper. So the problem is that good quality robots are still much more expensive than human labor.

The only areas where robots have replaced human labor is where the economics work, like huge volume manufacturing, or where humans can’t easily go or can’t perform. The latter includes tasks like lifting and moving things thousands of times larger than humans can or environments like high temperatures, deep space, the bottom of the ocean, radioactive environments, etc.

zer00eyz 14 hours ago [-]
> 10+ years ago I expected we would get AI that would impact blue collar work long before AI that impacted white collar work. Not sure exactly where I got the impression, but I remember some "rising tide of AI" analogy and graphic that had artists and scientists positioned on the high ground.

The moment you strip away the magical thinking, the humanization (bugs not hallucinations) what you realize is that this is just progress. Ford in the 1960's putting in the first robot arms vs auto manufacturing today. The phone: from switch board operators, to mechanical switching to digital to... (I think phone is in some odd hybrid era with text but only time will tell). Draftsmen in the 1970's all replaced by autocad by the 90's. GO further back to 1920, 30 percent of Americans were farmers, today thats less than 2.

Humans, on very human scales are very good at finding all new ways of making ourselves "busy" and "productive".

foxglacier 16 hours ago [-]
Robots are only harder because they have expensive hardware. We already have robots that can load dishwashers and do other manual work but humans are cheaper so there isn't much of a market for them.

The rising tide idea came from a 1997 paper by Moravec. Here's a nice graphic and subsequent history https://lifearchitect.ai/flood/

Interestingly, Moravec also stated: "When the highest peaks are covered, there will be machines than can interact as intelligently as any human on any subject. The presence of minds in machines will then become self−evident." We pretty much have those today so by 1997 standards, machines have minds, yet somehow we moved the goalposts and decided that doesn't count anymore. Even if LLMs end up being strictly more capable than every human on every subject, I'm sure we'll find some new excuse why they don't have minds or aren't really intelligent.

ewoodrich 14 hours ago [-]
> Interestingly, Moravec also stated: "When the highest peaks are covered, there will be machines than can interact as intelligently as any human on any subject. The presence of minds in machines will then become self−evident

> We pretty much have those today so by 1997 standards, machines have minds, yet somehow we moved the goalposts and decided that doesn't count anymore

What you describe as "moving the goalposts" could also just be explained as simply not meeting the standard of "as intelligently as any human on any subject".

Even in the strongest possible example of LLM's strengths applying their encyclopedic knowledge and (more limited) ability to apply that knowledge for a given subject I don't think they meet that bar. Especially if we're comparing to a human over a time period greater than 30 minutes or so.

Quarrelsome 18 hours ago [-]
> if you can find a way to produce verifiable rewards about a target world

I feel like there's an interesting symmetry here between the pre and post LLM world, where I've always found that organisations over-optimise for things they can measure (e.g. balance sheets) and under-optimise for things they can't (e.g. developer productivity), which explains why its so hard to keep a software product up to date in an average org, as the natural pressure is to run it into the ground until a competitor suddenly displaces it.

So in a post LLM world, we have this gaping hole around things we either lack the data for, or as you say: lack the ability to produce verifiable rewards for. I wonder if similar patterns might play out as a consequence and what unmodelled, unrecorded, real-world things will be entirely ignored (perhaps to great detriment) because we simply lack a decent measure/verifiable-reward for it.

mikewarot 2 hours ago [-]
My focus lately is on the cost side of this. I believe strongly that it's possible to reduce the cost of compute for LLM type loads by 95% or more. Personally, it's been incredibly hard to get actual numbers for static and dynamic power in ASIC designs to be sure about this.

If I'm right (which I give a 50/50 odds to), and we can reduce the power of LLM computation by 95%, trillions can be saved in power bills, and we can break the need for Nvidia or other specialists, and get back to general purpose computation.

olq_plo 3 hours ago [-]
Since you seem to know your stuff, why do LLMs need so much data anyway? Humans don't. Why can't we make models aware of their own uncertainty, e.g. feeding the variance of the next token distribution back into the model, as a foundation to guide their own learning. Maybe with that kind of signal, LLMs could develop 'curiosity' and 'rigorousness' and seek out the data that best refines them themselves. Let the AI make and test its own hypotheses, using formal mathematical systems, during training.
w10-1 12 hours ago [-]
> rather how we continue to scale for language models at the current LM frontier — 4-8h METR tasks

I wonder if this doesn't reify a particular business model, of creating a general model and then renting it out Saas-style (possibly adapted to largish customers).

It reminds me of the early excitement over mainframes, how their applications were limited by the rarity of access, and how vigorously those trained in those fine arts defended their superiority. They just couldn't compete with the hordes of smaller competitors getting into every niche.

It may instead be that customer data and use cases are both the most relevant and the most profitable. An AI that could adopt a small user model and track and apply user use cases would have entirely different structure, and would have demonstrable price/performance ratios.

This could mean if Apple or Google actually integrated AI into their devices, they could have a decisive advantage. Or perhaps there's a next generation of web applications that model use-cases and interactions. Indeed, Cursor and other IDE companies might have a leg up if they can drive towards modeling the context instead of just feeding it as intention to the generative LLM.

JumpCrisscross 10 hours ago [-]
> there are some cool labs building foundational robotics models, but they're maybe ~5 years behind LMs today

Wouldn't the Bitter Lesson be to invest in those models over trying to be clever about ekeing out a little more oomph from today's language models (and langue-based data)?

10 hours ago [-]
amelius 9 hours ago [-]
What do you mean by "verifiable rewards"?

Do you mean challenges for which the answer is known?

eab- 17 hours ago [-]
What do you mean about CLIP?
rawgabbit 15 hours ago [-]
I believe he is referring to OpenAI proposal to move beyond training with pure text. Instead train with multi modal data. Instead of only the dictionary definition of an apple. Train it with a picture of an apple. Train it with a video of someone eating an apple etc.
godshatter 2 hours ago [-]
Before this AI wave got going, I'd always assumed that AGI would be more about converting words, pictures, video, and lots of sensory data and who knows what else into a model of concepts that it would be putting together and hypothesizing about and testing as it grows. A database of what concepts have been learned and what data they were built from and what holes it needed to fill in. It would continually be working on this and reaching out to test reality or discuss it's findings with people or other AIs instead of waiting for input like a chatbot. I haven't even seen anything like this yet, just ways of faking it by getting better at stringing words together or mashing pixels together based on text tokens.

No one seems to be working on building an AI model that understands, to any real degree, what it's saying or what it's creating. Without this, I don't see how they can even get to AGI.

rawgabbit 37 minutes ago [-]
When I was young, my relatives would make fun of me. Saying I had a lot of book learning but yet to experience the absurdity of the real world. Wait, they said, when I try to apply my fancy book learning to a world controlled by good ole boys, gatekeepers, and double talk. Then I will learn reality is different from the idealized world of books.
godelski 14 hours ago [-]

  > this isn't really about whether scaling is "dead" 
I think there's a good position paper by Sara Hooker[0] that mentions some of this. Key point being that while the frontier is being pushed by big models with big data there's a very quiet revolution of models using far fewer parameters (still quite big) and data. Maybe "Scale Is All You Need"[1], but that doesn't mean it is practical or even a good approach. It's a shame these research paths have gotten a lot of pushback, especially given today's concerns about inference costs (this pushback still doesn't seem to be decreasing)

  > verifiable rewards
There's also a current conversation in the community over world models: is it actually a world model if the model does not recover /a physics/[2]. The argument for why they should recover a physics is that this means a counterfactual model must have been learned (no guarantees on if it is computationally irreducible). A counterfactual model gives far greater opportunities for robust generalization. In fact, you could even argue that the study of physics is the study of compression. In a sense, physics is the study of the computability of our universe[3]. Physics is counterfactual, allowing you to answer counterfactual questions like "What would the force have been if the mass had been 10x greater?" If this were not counterfactual we'd require different algorithms for different cases.

I'm in the recovery camp. Honestly I haven't heard a strong argument against it. Mostly "we just care that things work" which, frankly, isn't that the primary concern of all of us? I'm all for throwing shit at a wall and seeing what sticks, it can be a really efficient method sometimes (especially in early exploratory phases), but I doubt it is the most efficient way forward.

In my experience, having been a person who's created models that require magnitudes fewer resources for equivalent performance, I cannot stress enough the importance of quality over quantity. The tricky part is defining that quality.

[0] https://arxiv.org/abs/2407.05694

[1] Personally, I'm unconvinced. Despite success of our LLMs it's difficult to decouple other variables.

[2] The "a" is important here. There's not one physics per-say. There are different models. This is a level of metaphysics most people will not encounter and has many subtleties.

[3] I must stress that there's a huge difference between the universe being computable and the universe being a computation. The universe being computable does not mean we all live in a simulation.

mdemare 11 hours ago [-]
Just using common sense, if we had a genius, who had tremendous reasoning ability, total recall of memories, and an unlimited lifespan and patience, and he'd read what the current LLMs have read, we'd expect quite a bit more from him than what we're getting now from LLMs.

There are teenagers that win gold medals on the math olympiad - they've trained on < 1M tokens of math texts, never mind the 70T tokens that GPT5 appears to be trained on. A difference of eight orders of magnitude.

In other words, data scarcity is not a fundamental problem, just a problem for the current paradigm.

bob1029 9 hours ago [-]
I think quantization is the simplest canary.

If we can reduce the precision of the model parameters by 2~32x without much perceptible drop in performance, we are clearly dealing with something wildly inefficient.

I'm open to the possibility that over parameterization is essential as part of the training process, much like how MSAA/SSAA over sample the frame buffer to reduce information aliasing in the final scaled result (also wildly inefficient but very effective generally). However, I think for more exotic architectures (spiking / time domain) these rules don't work the same way. You can't back propagate a recurrent SNN so much of the prevailing machine learning mindset doesn't even apply.

jebarker 2 hours ago [-]
It’s not clear that the inefficiency of the current paradigm is in the neural net architectures. It seems just as likely that it’s in the training objective.
voxic11 4 hours ago [-]
Maybe human brains are constantly generating (and training on) massive amounts of synthetic data and that is how they get so smart?
timeinput 49 minutes ago [-]
You mean those like 8 hours of ~~nightmares~~ dreams I have every night?
anthonypasq 4 hours ago [-]
This sentence really struck me in a particular way. Very interesting. It does seem like thoughts/stream of consciousness is just your brain generating random tokens to itself and learning from it lol.
jimbokun 3 hours ago [-]
What experiment could be run to test this hypothesis?
flooo 10 hours ago [-]
Now consider that the genius cannot physically interact with the world or the people therein, and uses her eyes only for reading text.
nosianu 10 hours ago [-]
Yes - we train only on a subset of human communication, the one using written symbols (even voice has much much more depth to it), but human brains train on the actual physical world.

Human students who only learned some new words but have not (yet) even began to really comprehend a subject will just throw around random words and sentences that sound great but have no basis in reality too.

For the same sentence, for example, "We need to open a new factory in country XY", the internal model lighting up inside the brain of someone who has actually participated when this was done previously will be much deeper and larger than that of someone who only heard about it in their course work. That same depth is zero for an LLM, which only knows the relations between words and has no representation of the world. Words alone cannot even begin to represent what the model created from the real-world sensors' data, which on top of the direct input is also based on many times compounded and already-internalized prior models (nobody establishes that new factory as a newly born baby with a fresh neural net, actually, even the newly born has inherited instincts that are all based on accumulated real world experiences, including the complex very structure of the brain).

Somewhat similarly, situations reported in comments like this one (client or manager vastly underestimating the effort required to do something): https://news.ycombinator.com/item?id=45123810 The internal model for a task of those far removed from actually doing it is very small compared to the internal models of those doing the work, so trying to gauge required effort falls short spectacularly if they also don't have the awareness.

Fargren 6 hours ago [-]
I'm not sure what point you are trying to make. Are you saying in order to make LLMs better at learning the missing piece is to make the capable to interact with the outside world? Give them actuators and sensors?
imtringued 9 hours ago [-]
Also the geniuses get beaten with a stick if they don't memorize and perfectly reproduce the text they've read.
energy123 9 hours ago [-]
> they've trained on < 1M tokens of math texts, never mind the 70T tokens that GPT5 appears to be trained on.

Somewhat apples and oranges given billions of years of evolution behind that human. GPT-5 started off as a blank slate.

TheDong 8 hours ago [-]
This comparison is absolute nonsense.

"How could a telescope see saturn, human eyes have billions of years of evolution behind them, and we only made telescopes a few hundred years ago, so they should be much weaker than eyes"

"How can StockFish play chess better than a human, the human brain has had billions of years of evolution"

Evolution is random, slow, and does not mean we arrive at even a local optima.

Vinnl 6 hours ago [-]
They're not saying that LLMs should be better than smart teenagers; they're saying that smart teenagers can solve some problems without needing massive amounts of data, so apparently those problems are technically solvable without those amounts of data.
mdemare 5 hours ago [-]
Yes. It is astonishing that LLMs can solve problems that only a handful of very smart teenagers can solve, but LLMs do it by consuming a million times as much content as those teenagers. Running out of data is not a reason for despair.

Also consider that during training LLMs spend much less time on processing, say, TAOCP (Knuth), or SICP (Abelson, Sussman, and Sussman), or Probability Theory (Jaynes) than on the entirety that is r/Frugal.

20 thick books turn a smart teenager into a graduate with a MSc. That's what, 10 million tokens?

When we read difficult, important texts, we reflect on them, make exercises, discuss them, etc. We don't know how to make an LLM do that in a way that improves it. Yet.

energy123 8 hours ago [-]
What comparison? I was arguing against a comparison.
skeezyboy 8 hours ago [-]
humans arent born with memories you numpty
hackinthebochs 5 hours ago [-]
Neural precursor cells literally move themselves from where they first differentiate to their final location to ensure specific neural structures and information dynamics in the developed brain. It's not declarative memory, but its a memory of the neural architecture etched out over evolutionary time.
hansvm 6 hours ago [-]
They're born with neural hardware whose architecture has been optimized by evolution. Any choice of architecture imparts some inductive bias, making some problems easier and some problems harder to learn, and humans have the advantage that people with bad architectures (those not matching properties of the world we live in) were more likely to die or to not mate.

You're right that we don't call those inherited thought patterns memories; we call them reflexes, emotions, region-specific brain functions, etc.

sindriava 6 hours ago [-]
This is such a wildly misleading statement that it borders on straight up incorrect.
petralithic 6 hours ago [-]
Humans are not tabulae rasae though. Evolution has hardwired our geniosity over millions of years.
FloorEgg 20 hours ago [-]
The problem I am facing in my domain is that all of the data is human generated and riddled with human errors. I am not talking about typos in phone numbers, but rather fundamental errors in critical thinking, reasoning, semantic and pragmatic oversights, etc. all in long-form unstructured text. It's very much an LLM-domain problem, but converging on the existing data is like trying to converge on noise.

The opportunity in the market is the gap between what people have been doing and what they are trying to do, and I have developed very specialized approaches to narrow this gap in my niche, and so far customers are loving it.

I seriously doubt that the gap could ever be closed by throwing more data and compute at it. I imagine though that the outputs of my approach could be used to train a base model to close the gap at a lower unit cost, but I am skeptical that it would be economically worth while anytime soon.

mediaman 20 hours ago [-]
This is one reason why verifiable rewards works really well, if it's possible for a given domain. Figuring out how to extract signal and verify it for an RL loop will be very popular for a lot of niche fields.
stego-tech 20 hours ago [-]
This is my current drum I bang on when an uninformed stakeholder tries shoving LLMs blindly down everyone’s throats: it’s the data, stupid. Current data aggregates outside of industries wholly dependent on it (so anyone not in web advertising, GIS, or intelligence) are garbage, riddled with errors and in awful structures that are opaque to LLMs. For your AI strategy to have any chance of success, your data has to be pristine and fresh, otherwise you’re lighting money on fire.

Throwing more compute and data at the problem won’t magically manifest AGI. To reach those lofty heights, we must first address the gaping wounds holding us back.

FloorEgg 19 hours ago [-]
Yes, for me both customers and colleagues continually suggested "hey let's just take all these samples of past work and dump it in the magical black box and then replicate what they have been doing".

Instead I developed a UX that made it as easy as possible for people to explain what they want to be done, and a system that then goes and does that. Then we compare the system's output to their historical data and there is always variance, and when the customer inspects the variance they realize that their data was wrong and the system's output is far more accurate and precise than their process (and ~3 orders of magnitude cheaper). This is around when they ask how they can buy it.

This is the difference between making what people actually want and what they say they want: it's untangling the why from the how.

marlott 18 hours ago [-]
Interesting! Could you give an example with a bit more specific detail here? I take it there's some kind of work output, like a report, in a semi-structured format, and the goal is to automate creation of these. And you would provide a UX that lets them explain what they want the system to create?
FloorEgg 18 hours ago [-]
Yes, essentially.

There are multiple long-form text inputs, one set is provided by User A, and another set by User B. User A inputs act as a prompt for User B, and then User A analyzes User B's input according to the original User A inputs, producing an output.

My system takes User A and B inputs and produces the output with more accuracy and precision than User As do, but a wide margin.

Instead of trying to train a model on all the history of these inputs and outputs, the solution was a combination of goal->job->task breakdown (like a fixed agentic process), and lots of context and prompt engineering. I then test against customer legacy samples, and inspect any variances by hand. At first the variances were usually system errors, which informed improvements to context and prompt engineering, and after working through about a thousand of these (test -> inspect variance -> if system mistake improve system -> repeat) iterations, and benefiting from a couple base-model upgrades, the variances are now about 99.9% user error (bad historical data or user inputs) and 0.1% system error. Overall it took about 9 months to build, and this one niche is worth ~$30m a year revenue easy, and everywhere I look there are market niches like this... it's ridiculous. (and a basic chat interface like ChatGPT doesn't work for these types of problems, no matter how smart it gets, for a variety of reasons)

So to summarize:

Instead of training a model on the historical inputs and outputs, the solution was to use the best base model LLMs, a pre-determined agentic flow, thoughtful system prompt and context engineering, and an iterative testing process with a human in the loop (me) to refine the overall system by carefully comparing the variances between system outputs and historical customer input/output samples.

marlott 17 hours ago [-]
Thanks a lot for the detailed reply! Makes a lot of sense now. I'm working on similar problems, and have dabbled with this kind of approach.
PLenz 18 hours ago [-]
I've worked in 2 of those domains (I was a geographer at a web advertising company) and let me tell you, the data is only slightly better than the median industry and in the case of the geodata from apps I'd say it's far, far, far worse.
jandrewrogers 12 hours ago [-]
I have bad news about the quality of the data in geospatial, intelligence, and advertising.
Workaccount2 19 hours ago [-]
They'll pay academics to create data, in fact this is already happening.
simianwords 12 hours ago [-]
You just need data to be directionally correct. It doesn’t have to be absolutely correct.

We still got pretty far by scraping internet data which we all know is not fully trust worthy.

incompatible 18 hours ago [-]
When studying human-created data, you always need to be aware of these factors, including bias from doctrines, such as religion, older information becoming superseded, outright lies and misinformation, fiction, etc. You can't just swallow it all uncritically.
cs702 20 hours ago [-]
I don't think Sutton's essay is misunderstood, but I agree with the OP's conclusion:

We're reaching scaling limits with transformers. The number of parameters in our largest transformers, N, is now in the order of trillions, which is the most we can apply given the total number of tokens of training data available worldwide, D, also in the order of trillions, resulting in a compute budget C = 6N × D, which is in the order of D². OpenAI and Google were the first to show these transformer "scaling laws." We cannot add more compute to a given compute budget C without increasing data D to maintain the relationship. As the OP puts it, if we want to increase the number of GPUs by 2x, we must also increase the number of parameters and training tokens by 1.41x, but... we've already run out of training tokens.

We must either (1) discover new architectures with different scaling laws, and/or (2) compute new synthetic data that can contribute to learning (akin to dreams).

antirez 19 hours ago [-]
This is true for the pre-training step. What if advancements in the reinforcement learning steps performed later may benefit from more compute and more models parameters? If right now the RL steps only help with sampling, that is, they only optimize the output of a given possible reply instead of the other (there are papers pointing at this: that if you generate many replies with just the common sampling methods, and you can verify correctness of the reply, then you discover that what RL helps with is selecting what was already potentially within the model output) this would be futile. But maybe advancements in the RL will do to LLMs what AlphaZero-alike models did with Chess/Go.
cs702 18 hours ago [-]
It's possible. We're talking about pretraining meaningfully larger models past the point at which they plateau, only to see if they can improve beyond that plateau with RL. Call it option (3). No one knows if it would work, and it would be very expensive, so only the largest players can try it, but why the heck not?
charleshn 19 hours ago [-]
> We cannot add more compute to a given compute budget C without increasing data D to maintain the relationship. > We must either (1) discover new architectures with different scaling laws, and/or (2) compute new synthetic data that can contribute to learning (akin to dreams).

Of course we can, this is a non issue.

See e.g. AlphaZero [0] that's 8 years old at this point, and any modern RL training using synthetic data, e.g. DeepSeek-R1-Zero [1].

[0] https://en.m.wikipedia.org/wiki/AlphaZero

[1] https://arxiv.org/abs/2501.12948

jeremyjh 19 hours ago [-]
AlphaZero trained itself through chess games that it played with itself. Chess positions have something very close to an objective truth about the evaluation, the rules are clear and bounded. Winning is measurable. How do you achieve this for a language model?

Yes, distillation is a thing but that is more about compression and filtering. Distillation does not produce new data in the same way that chess games produce new positions.

charleshn 19 hours ago [-]
You can have a look at the DeepSeek paper, in particular section "2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Mode".

But generally the idea is that it's, you need some notion of reward, verifiers etc.

Works really well for maths, algorithms, amd many things actually.

See also this very short essay/introduction: https://www.jasonwei.net/blog/asymmetry-of-verification-and-...

That's why we have IMO gold level models now, and I'm pretty confident we'll have superhuman mathematics, algorithmic etc models before long.

Now domains which are very hard to verify - think e.g. theoretical physics etc - that's another story.

skeezyboy 8 hours ago [-]
> But generally the idea is that it's, you need some notion of reward, verifiers etc.

i dont think youre getting the point hes making.

voxic11 19 hours ago [-]
Synthetic data is already widely used to do training in the programming and mathematics domains where automated verification is possible. Here is an example of an open source verified reasoning synthetic dataset https://www.primeintellect.ai/blog/synthetic-1
jeremyjh 18 hours ago [-]
Are they actually producing new data though? This is the sort of thing I called "compression and filtering" because it seems to be new information content is not being produced, but LLMs are used to distill the information we already have. We need more raw information.
voxic11 14 hours ago [-]
Yes this is new synthetic data which did not exist before. I encourage you to read the link.
jeremyjh 8 hours ago [-]
I think we're talking past each other, I'll try once more. Suppose you train an LLM on a very small corpus of data, such as all the content of the library of congress. Then you have that LLM author new works. Then you train a new LLM on the original corpus plus this new material. Do you really think you've addressed the core issue in the SP? Can more parameters be meaningfully trained even if you add more GPU?

To me, the answer is clearly no. There is no new information content in the generated data. Its just a remix of what already exists.

hackinthebochs 5 hours ago [-]
When it comes to logical reasoning, the difficulty isn't about having enough new information, but about ensuring the LLMs capture the right information. The problem LLMs have with learning logical reasoning from standard training is that they learn spurious relationships between the context and the next token, undermining its ability to learn fully general logical reasoning. Synthetic data helps because spurious associations are undermined by the randomness inherent in the synthetic data, forcing the model to find the right generic reasoning steps.
jeremyjh 1 hours ago [-]
I agree! DeepSeek has shown this is incredibly powerful. I think their Qwen 8B model may be as good as GPT4’s flagship. And I can run it on my laptop if it’s not on my lap. But the amount of synthetic data you can generate is bounded by the raw information, so I don’t think it’s an answer to the SP.
voxic11 5 hours ago [-]
Yes if you have some way to verify the quality of the new works and you only include the high quality works in the new LLM's training set.
scotty79 18 hours ago [-]
Simple, you just need to turn language into a game.

You make models talk to each other, create puzzles for each other's to solve, ask each other to make cases and evaluate how well they were made.

Will some of it look like ramblings of pre-scientific philosophers? (or modern ones because philosophy never progressed after science left it in the dust)

Sure! But human culture was once there too. And we pulled ourselves out of this nonsense by the bootstraps. We didn't need to be exposed to 3 alien internet's with higher truth.

It's really a miracle that AIs got as much as they did from purely human generated mostly garbage we cared to write down.

jebarker 59 minutes ago [-]
I feel like you’re glossing over some very thorny details that it’s not obvious we can solve. For example, if you just get two LLMs setting each other puzzles and scoring the others solutions how do you stop this just collapsing into nonsense? I.e. where does the source of actual truth come from for the puzzles?
hackinthebochs 5 hours ago [-]
>And we pulled ourselves out of this nonsense by the bootstraps.

Human progress was promoted by having to interact with a physical world that anchored our ramblings and gave us a reward function for coherence and cooperation. LLMs would need some analogous anchoring for it to progress beyond incoherent babble.

cs702 18 hours ago [-]
> Of course we can, ... synthetic data ...

That's option (2) in the parent comment: synthetic data.

FloorEgg 20 hours ago [-]
What about or (3) models that interact with the real world?

To be clear I also agree with your (1) and (2).

tliltocatl 20 hours ago [-]
That's the endgame, but on the other hand, we already have one, it's called "humanity". No reason to believe that another one would be much cheaper. Interacting with the real world is __expensive__. It's the most expensive thing of all.
FloorEgg 20 hours ago [-]
Very true. Living cells are ~4-5 orders of magnitude more functional-information-dense than the most advanced chips, and there is a lot more living mass than advanced chips.

But the networking potential of digital compute is a fundamentally different paradigm than living systems. The human brain is constrained in size by the width of the female pelvis.

So while it's expensive, we can trade scope-constrained robustness (replication and redundancy at many levels of abstraction), for broader cognitive scale and fragility (data centers can't repair themselves and self-replicate).

Going to be interesting to see it all unfold... my bet is on stacking S-curves all the way.

Isharmla 12 hours ago [-]
> The human brain is constrained in size by the width of the female pelvis.

https://en.wikipedia.org/wiki/Obstetrical_dilemma

While the width is constrained by bipedal locomotion.

__d 19 hours ago [-]
> The human brain is constrained in size by the width of the female pelvis.

Well, it _was_ until recently.

FloorEgg 19 hours ago [-]
haha yeah I suppose so, but only barely...
socalgal2 18 hours ago [-]
> Living cells are ~4-5 orders of magnitude more functional-information-dense than the most advanced chips,

In what sense is this true? That sounds suspiciously like cubic meter of dirt is more advanced than an iPhone because there are 6-7 orders of magnitude more atoms in the dirt.

FloorEgg 18 hours ago [-]
Well, sort of, but functional-information is a specifically defined term that you can google. "Advanced" is vague and in this context not very helpful.

functional information is basically the amount of data (bits) necessary to explain all the possible functions matter can perform based on its unique configuration (in contrast to random). I am sure I partially butchered this explanation... but hopefully its close enough to catch my drift.

Life is optimized to process and learn from the real world, and it is insanely efficient at it and functional-information dense. (It might even be at the theoretical limit) Our most advanced technology is still 4-5 orders of magnitude behind it.

The capabilities of your iPhone are extremely narrow when compared to a handful of dirt. To you it may seem the opposite, but you are probably mixing up utility to you with functional capability. Your iPhone is has more functional utility to you, but the same amount dirt has way more general functional utility. (Your iphone isn't capable of self-replication, self-repair, and self-nonself distinction aka autopoiesis)

airstrike 19 hours ago [-]
> Living cells are ~4-5 orders of magnitude more functional-information-dense than the most advanced chips, and there is a lot more living mass than advanced chips.

I believe you but I would love to know where this number came from just so I can read more about it

FloorEgg 17 hours ago [-]
It's napkin math so take it with a pinch of salt, but I am calculating the information stored in genome, assuming 2 bits per base pair, reducing to estimated 88% coding fraction to get the functional bits, and then dividing by cell volume. Did this for a few different types of cells and then averaged the result to around 1–10 Mbit/μm³

# If there are any bioinformaticians around please come eviscerate or confirm this calc #

Then compared it to TSMC 2nm research macro of (38.1 Mbit/mm^2) normalized to cell scale: 0.00019 Mbit/μm³

Living Cells: 1–10 Mbit/μm³

Current best chips: 0.00019 Mbit/μm³

https://research.tsmc.com/page/memory/4.html

imtringued 8 hours ago [-]
You are comparing the fastest writable memory available (SRAM) vs biological non-volatile memory that is essentially read only. Samsung's 280 layer NAND reaches 28,5 Gbit per mm^2. I don't know how you would convert that to a volume, but if we simply multiply by 1000x for simplicity, it would be much closer to 0.19 Mbit/μm³, but even then you have to remember that NAND flash is still writable at pretty high speeds.
FloorEgg 1 hours ago [-]
This is true, I was only comparing functional information density (unique functional genome bps), not read/write speeds.

I was also taking the information from a genome and then dividing it by the volume of a cell, but there are many instances of the genome in a cell. I didn't count all instances because they aren't unique.

There's a lot to unpack with this comparison and my approximation was crude, but the more I've dug into this comparison the more apparent how incredibly efficient life is at managing and processing and storing information. Especially if you also consider the amino acids, proteins, etc as information. No matter how you slice it, life seems orders of magnitude more efficient by every metric.

I'd like to think there's a paper somewhere where someone has carefully unpacked this and formally quantified it all.

airstrike 1 minutes ago [-]
[delayed]
incompatible 18 hours ago [-]
> The human brain is constrained in size by the width of the female pelvis.

I think this is an old belief that isn't supported by modern research.

FloorEgg 17 hours ago [-]
Thanks for pointing this out. I wasn't aware, and so I just dug into it.

From what I can tell, science used to point to this as the only/primary limit to human-brain size, but more recently the picture seems a lot less clear, with some indications that pelvis size doesn't place as hard of a constraint as we thought and there are other constraints such as metabolic (how many calories the mother can sustain during pregnancy and lactation).

So overall I'd say you are technically correct, even though this doesn't really materially change the point I was making; which is that the size of the human brain is constrained in ways that the size of data centers are not.

jvanderbot 20 hours ago [-]
Play in the real world generates a data point every few minutes. Seems a bit slow?
pizzly 20 hours ago [-]
Humans experience (play in the real world) is multi modal though vision, sound, touch, pressure, muscle feedback, gravitational, etc. Its extremely rich in data. Its also not a data point its continuous stream of information. Also I would bet that humans synthesize data at the same time. Everytime we run multiple scenarios in our mind before choosing the one we execute without even thinking about it is synthesizing data. Also humans dream which is another form of data synthesizing. Allowing AI to interact with the real world is definitely a way to go.
mannykannot 19 hours ago [-]
That's true, but still, a single individual or small group living isolated in the real world will, over a lifetime, learn only a tiny fraction of what we can learn from the written knowledge accumulated over millennia.

Having said that, I tend to agree that having AI interact with the world may be key: for one thing, I'm not sure whether there is any sense in which LLMs understand that most of the information content of language is about an external world.

FloorEgg 20 hours ago [-]
What are you basing that statement on?

What exactly are you considering a "data point"?

Are you assuming one model = one agent instance?

I am pretty sure that there is more information (molecular structure) and functional information (I(Ex )) just in the room I am sitting in than all the unique, useful, digitized information on earth.

dosnem 13 hours ago [-]
This seems so simple but I’m totally not understanding it..

If C = D^2, and you double compute, then 2C ==> 2D^2. How do you and the original author get 1.41D from 2D^2?

yberreby 13 hours ago [-]
If C ~ D^2, then D ~ sqrt(C).

In other words, the required amount of data scales with the square root of the compute. The square root of 2 ~= 1.414. If you double the compute, you need roughly 1.414 times more data.

dosnem 6 hours ago [-]
Thanks for clarification!
credit_guy 15 hours ago [-]
> There is no second internet

I don't know about that. LLMs have been trained mostly on text. If you add photos, audio and videos, and later even 3D games, or 3D videos, you get massively more data than the old plain text. Maybe by many orders of magnitude. And this is certainly that can improve cognition in general. Getting to AGI without audio and video, and 3D perception seems like a non-starter. And even if we think AGI is not the goal, further improvements from these new training datasets are certainly conceivable.

Symmetry 5 hours ago [-]
Also, even if we lacked the data to proceed with Chinchilla-optimal scaling that wouldn't be the same as being unable to proceed with scaling, it would just require larger models and more flops than we would prefer.
1970-01-01 6 hours ago [-]
Yes. It's a complete oversight and wrong. This paper missed:

darknets, the deep web, Usenet, BBS, Internet2, and all other paywalled archives .

aerospades 13 hours ago [-]
I disagree with the author's thesis about data scarcity. There's an infinite amount of data available in the real world. The real world is how all generally intelligent humans have been trained. Currently, LLMs have just been trained on the derived shadows (as in Plato's allegory of the cave). The grounding to base reality seems like an important missing piece. The other data type missing is the feedback: more than passively training/consuming text (and images/video), being able to push on the chair and have it push back. Once the AI can more directly and recursively train on the real world, my guess is we'll see Sutton's bitter lesson proven out once again.
NooneAtAll3 19 hours ago [-]
while I don't disagree with the facts, I don't understand the... tone?

when Dennard scaling (single core performance) started to fail in 90s-00s, I don't think there was a sentiment "how stupid was it to believe such a scaling at all"?

sure, people were compliant (and we still meme about running Crysis), but in the end the discussion resulted in "no more free lunch" - progress in one direction has hit a bottleneck, so it's time to choose some other direction to improve on (and multi-threading has now become mostly the norm)

I don't really see much of a difference?

nightsd01 15 hours ago [-]
I am not an expert in AI by any means but I think I know enough about it to comment on one thing: there was an interesting paper not too long ago that showed if you train a randomly-initialized model from scratch on questions, like a bank of physics questions & answers, models will end up with much higher quality if you teach it the simple physics questions first, and then move up to more complex physics questions. This shows that in some ways, these large language models really do learn like we do.

I think the next steps will be more along this vain of thinking. Treating all training data the same is a mistake. Some data is significantly more valuable to developing an intelligent model than most other training data, even when you pass quality filters. I think we need to revisit how we 'train' these models in the first place, and come up with a more intelligent/interactive system of doing so

a2128 12 hours ago [-]
From my personal experience training models this is only true when the parameter count is a limiting factor. When the model is past a certain size, it doesn't really lead to much improvement to use curriculum learning. I believe most research also applies it only to small models (e.g. Phi)
FloorEgg 15 hours ago [-]
Wow. I really like this take. I've seen how time and time again nature follows the Pareto principle. It makes sense that training data would follow this principle as well.

Further that the order of training matters is novel to me and seems so obvious in hindsight.

Maybe both of these points are common knowledge/practice among current leading LLM builders. I don't build LLMs, I build on and with them, so I don't know.

simianwords 12 hours ago [-]
I have never heard of order of training data matter in back propagation
nikki93 15 hours ago [-]
A relevant paper: https://arxiv.org/abs/2306.11644 -- the Phi models (and many others too) are based on this idea.
lawrencechen 12 hours ago [-]
In the bitter lesson essay [0], the word "data" is not mentioned a single time.

The author fundamentally misunderstands the bitter lesson.

[0] https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...

heresie-dabord 9 hours ago [-]
"We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach."
imtringued 8 hours ago [-]
What if the meta bitter lesson is that data scaling is just a more extreme form of the human-centric approach of building knowledge into agents? After all, we're telling the model what to say, think and how to behave.

A true general method wouldn't rely on humans at all! Human data would be worthless beyond bootstrapping!

heresie-dabord 5 hours ago [-]
Another meta bitter lesson: we don't understand ourselves well enough to define and build something that thinks as we do.
TheDong 8 hours ago [-]
Is the data input into ChatGPT not a large enough source of new data to matter?

People are constantly inputting novel data, telling ChatGPT about mistakes it made and suggesting approaches to try, and so on.

For local tools, like claude code, it feels like there's an even bigger goldmine of data in that you can have a user ask claude code to do something, and when it fails they do it themselves... and then if only anthropic could slurp up the human-produced correct solution, that would be high quality training data.

I know paid claude-code doesn't slurp up local code, and my impression is paid ChatGPT also doesn't use input for training... but perhaps that's the next thing to compromise on in the quest for more data.

johnecheck 6 hours ago [-]
NO! CLEARLY THE ENTIRE CORPUS OF HUMAN LITERATURE AND THE INTERNET DOESN'T CONTAIN ENOUGH INFORMATION TO EDUCATE AN EXPERT!!!! I JUST NEED ANOTHER BILLION DOLLARS PLS PLS PLS I PROMISE THE SCALING LAWS ARE ACTUALLY LAWS THIS TIME
geetee 19 hours ago [-]
I don't understand why we need more data for training. Assuming we've already digitized every book, magazine, research paper, newspaper, and other forms of media, why do we need this "second internet?" Legal issues aside, don't we already have the totality of human knowledge available to us for training?
hansvm 11 hours ago [-]
We don't have anything close to the totality of human knowledge digitized, much less in a form that LLMs can easily take advantage of. Even for easily verifiable facts powering modern industry, details like appropriate lube/speeds/etc for machining molybdenum for this or that purpose just don't exist outside of the minds of the few people who actually do it. Moreover, _most_ knowledge is similarly locked up inside a few people rather than being written down.

Even when written down, without the ability to interact with and probe the world like you did growing up it's not possible to meaningfully tell the difference between 9/11 hoaxers and everyone else save for how frequent the relative texts appear. They don't have the ability to meaningfully challenge their world model, and that makes the current breadth of written content even less useful than it might otherwise appear.

decimalenough 17 hours ago [-]
The goal/theory behind the LLM investment explosion is that we can get to AGI by feeding them all the data. And to be clear, by AGI I don't mean "superhuman singularity", just "intelligent enough to replace most humans" (and, by extension, hoover up all the money we're spending on their salaries today).

But if we've already fed them all the data, and we don't have AGI (which we manifestly don't), then there's no way to get to AGI with LLMs and the tech/VC industry is about to have a massive, massive problem justifying all this investment.

dr_dshiv 19 hours ago [-]
Let’s keep in mind that we don’t have most of the renaissance through the early modern period (1400-1800) because it was published in neolatin with older typefaces— and only about 10% is even digitized.

We probably don’t have most of the Arabic corpus either — and barely any Sanskrit. Classical Chinese is probably also lacking — only about 1% of it is translated to English.

jacobolus 18 hours ago [-]
The volume of text in English and digitized from the past few years dwarfs the volume of Latin text from all time. Unless you are wondering about a very niche historical topic there’s more written in English than Latin about basically everything.
dr_dshiv 18 hours ago [-]
Well, if you are looking for diversity of perspective— temporal diversity may be valuable.

Marsilio Ficino was hired by the Medici to translate Plato and other classical Greek works into Latin. He directly taught DaVinci, Raphael, Michelangelo, Toscanelli, etc. I mean to say that his ideas and perspectives helped spark the renaissance.

Insofar as we hope for an AI renaissance and not an AI apocalypse, it might benefit us to have the actual renaissance in the training data.

jacobolus 13 hours ago [-]
And here you can e.g. find Ficino's correspondence translated into English, with commentary, https://archive.org/details/lettersofmarsili0000fici

If you make a cursory search you can also find other translations of his works, various biographies, and a wide range of commentary and criticism by later authors.

Many of Ficino's originals are also in the corpus of scanned and OCRed or recently republished texts. I'm sure there are archives here or there with additional materials which have not been digitized, but it seems questionable whether those would make any significant difference to a process as indiscriminate and automatic as LLM training.

dr_dshiv 9 hours ago [-]
Yes, but many of his books are not translated or ocr’d. For instance, La pestilenzia or de mysteriis.

And he is one of the most central figures of the renaissance. Less than 20% of neolatin has been digitized, let alone translated.

It is fine to question whether including neolatin, Arabic or Sanskrit in AI training will make AI better.

But for me, it is a core set of humanism that would be a shame to neglect.

18 hours ago [-]
typpilol 18 hours ago [-]
Don't most models learn from different languages sets already?
kbenson 18 hours ago [-]
I interpreted it as a roundabout way of increasing quality. Take any given subreddit. You have posts and comments, and scores, but what if the data quality isn't very good overall? What if instead of using it as is, you instead had an AI evaluate and reason about all the posts, and classify them itself based on how useful the posts and comments are, how well they work out in practice (if easily simulated), etc? Essentially you're using the AI to provide a moderated and carefully curated set of information about the information that was already present. If you then ingest this information, does that increase the quality of the data? Probably(?), since you're throwing compute and AI reasoning at the problem ahead of time reducing compute and lowering the low quality data by adding additional high quality data.
19 hours ago [-]
j7ake 14 hours ago [-]
The totality of human knowledge is a rounding error to what’s needed for AGI
grumbelbart2 8 hours ago [-]
That is only true if your path to AGI is to take models similar to current models, and feed them with tons of data.

Advances in architecture and training protocols can and will easily dwarf "more data". I think that is quite obvious from the fact that humans learn to be quite intelligent using only a fraction of the data available to current LLMs. Our advantage is a very good pre-baked model, and feedback-based training.

adwn 10 hours ago [-]
What makes you think that? Especially given that fact that GI (without the 'A') is evidently very much possible with only a tiny fraction of the "totality of human knowledge".
brazzy 18 hours ago [-]
The point is that current methods are unable to get more than the current state-of-the-art models' degree of intelligence out of training on the totality of human knowledge. Previously, the amount of compute needed to process that much data was a limit, but not anymore.

So now, in order to progress further, we either have to improve the methods, or synthetically generate more training data, or both.

geetee 17 hours ago [-]
What does synthetic training data actually mean? Just saying the same things in different ways? It seems like we're training in a way that's just not sustainable.
reasonableklout 17 hours ago [-]
One example: when we want to increase performance on a task which can be automatically verified, we can often generate synthetic training data by having the current, imperfect models attempt the task lots of times, then pick out the first attempt that works. For instance, given a programming problem, we might write a program skeleton and unit tests for the expected behavior. GPT-5 might take 100 attempts to produce a working program; the hope is that GPT-6 would train on the working attempt and therefore take much less attempts to solve similar problems.

As you suggest, this costs lots of time and compute. But it's produced breakthroughs in the past (see AlphaGo Zero self-play) and is now supposedly a standard part of model post-training at the big labs.

incompatible 18 hours ago [-]
A lot of newspapers seem to be stuck behind paywalls, even when in the public domain.
sfpotter 15 hours ago [-]
Oh man, I love crazy stuff like this on HN. For a community which espouses rationality and careful thought, somehow an article with "C ~ D^2" has floated to the top. No notes.
4 hours ago [-]
eirikbakke 4 hours ago [-]
Humans require a _lot_ less training data to become, for instance, fluent in English. If a given AI algorithm needs to be trained on the entire Internet to accomplish the same, then it seems safe to assume that the data has not really been "mined out".

Generating more training data from the same original data should not be fundamentally problematic in that sense.

felipeerias 4 hours ago [-]
It only seems that way because much of the data that humans use is not in a format that computers would understand. A toddler learning to talk is engaging their full body.
ausbah 2 hours ago [-]
humans also have billions of years of evolution and trillions of organisms to develop a receptacle biased towards learning language
eirikbakke 2 hours ago [-]
Billions of years of evolution, but still limited to the data that is replicated in human genome/DNA, which is about 3 gigabytes (+epigenome).
nahuel0x 5 hours ago [-]
Also to consider, how the massive datasets powering LLMs were generated? For the case of text, it was generations of humans and humans lives, experiences and interactions with the real world that coagulated into masses of text and the language itself.. not to mention the evolutionary process that made that possible. There is an history of biological computation and interaction behind what it seems to be static data.
ausbah 2 hours ago [-]
isn’t the problem of not enough data just a problem of not having grounding in the world? world models and everything feel like they’re just dancing around the problem with thin veneers of human-in-the-loop and the verifiable domains we already have
madrox 17 hours ago [-]
In any field where there is a creative element, progress comes in fits and starts that are difficult to predict in advance. No one can accurately predict when we'll get the cure for cancer, for example, in spite of people working on it.

But that isn't how investors operate. They want to know what they will get in exchange for giving a company a billion dollars. If you're running an AI business, you need to set expectations. How do you do that? Go do the thing you know you can do on a schedule, like standing up a new GPU data center.

I don't think the bitter lesson is misunderstood in quite the way the author describes. I think most are well aware we're approaching the data wall within a couple years. However, if you're not in academia you're not trying to solve that problem; you're trying to get your bag before it happens.

That may sound a little flip, but this is yet another incarnation of the hungry beast: https://stvp.stanford.edu/clips/the-hungry-beast-and-the-ugl...

simianwords 12 hours ago [-]
Why do you assume investors don’t know about this? They know some investments follow the power law - very few of them work out but they bring most value.

The very existence of openAI and Anthropic are proof of it happening.

Imagine you were an investor and you know what you know now (creativity can’t be predicted). How would you then invest in companies? Your answer might converge on existing VC strategies.

frankenstine 19 hours ago [-]
> The path forward: data alchemists (high-variance, 300% lottery ticket) or model architects (20-30% steady gains)

No, the paths forward are: better design, training, feeding in more video, audio, and general data from the outside world. The web is just a small part of our experience. What about apps, webcam streams, radio from all over the world in its many forms, OTA TV, interacting with streaming content via remote, playing every video game, playing board games with humans, feeds and data from robots LLMs control, watching everyone via their phones and computers, car cameras, security footage and CCTV, live weather and atmospheric data, cable television, stereoscopic data, ViewMaster reels, realtime electrical input from various types of brains while interacting with their attached creatures, touch and smell, understanding birth, growth, disease, death, and all facets of life as an observer, observing those as a subject, expanding to other worlds, solar systems, galaxies, etc., affecting time and space, search and communication with a universal creator, and finally understanding birth and death of the universe.

fao_ 19 hours ago [-]
I'll give this comment more or less exactly the level of seriousness as it deserves, and say: lol
hn_acc1 18 hours ago [-]
Reminds me a bit of "Person of Interest" (the TV show).
mehulashah 11 hours ago [-]
I’m surprised by the argument. It’s not wrong. You need more data, but that presumes that the task is to pre-train on data. Additional compute is also useful for unearthing tacit capabilities in the models. This requires inference time scaling and post training usually on specific downstream tasks using RL. Sure that generates data, but it’s not the same as the Internet, and can be scaled.
EZ-Cheeze 8 hours ago [-]
The AI companies won't run out of data to train on. Almost every user interaction is a significant source of data. Chains of interactions are even more significant, especially the longer and more sophisticated they are. Yesterday I was given A/B tests from both GPT5-Thinking and Gemini 2.5 Pro, something neither of then had done before. OpenAI also just acquired Statsig for $1.1 billion. Statsig does A/B testing and other analytics.

The data scrapped from the Internet and scanned books served its purpose: it bootstrapped something that we all love talking to and discussing ANYTHING with. That's the new source of data and intelligence.

WesolyKubeczek 1 hours ago [-]
> it bootstrapped something that we all love talking to and discussing ANYTHING with.

We all? Speak for yourself, dude

benlivengood 18 hours ago [-]
I don't think anyone has yet trained on all videos on the Internet. Plenty of petabytes left there to pretrain on, and likely just as useful once the text/audio/image pretraining is done.
simianwords 12 hours ago [-]
It might have been trained on a select high quality of videos, say more than 10k views and only trained on its transcripts.
scrivna 2 hours ago [-]
Seems like reading a transcript of the commentary from a football game, it’s obviously missing a lot of information.
benob 10 hours ago [-]
Stop thinking about text being the data. There are so many other sources, even some that you can generate.

https://arxiv.org/pdf/2506.20057

JumpCrisscross 10 hours ago [-]
> Stop thinking about text being the data

Path #2 in TFA.

TheDudeMan 20 hours ago [-]
I interpret The Bitter Lesson as suggesting that you should be selecting methods that do not need all that data (in many domains, we don't know those methods yet).
theahura 17 hours ago [-]
Has HRM really dramatically changed the landscape? My read of the paper thus far is that it is an impressive result, but there have been a few of those in the past that have fizzled out, so I'm still in wait-and-see mode
g42gregory 18 hours ago [-]
The D here is not exactly defined (or maybe I just missed that).

Does synthetic data count? What about making several more passes through already available data?

back2dafucha 19 hours ago [-]
About 28 years ago a wise person said to me: "Data will kill you" Even mainframe programmers knew it.
bwhiting2356 19 hours ago [-]
Audio and video data can be collected from the real world. It won't be immediate and won't be cheap.
d--b 5 hours ago [-]
How does the brain do it?

A baby's brain isn't wired to the entire internet. A 2-year-old has access to at most 2 years of HD video data, plus some other belly-ache and poo-smell stimuli. And a baby's brain has no replay capacity.

That's not a lot to work with.

Yet, a 2-year-old clearly thinks, is conscious, can understand and create sentences, and wants to annihilate everything just as much as Grok.

Sure you can scale data all you want. But there should be enough to work with without scaling like crazy.

Having AI know all CSS tricks out there is one thing that requires a lot of data, AGI is different.

casey2 16 hours ago [-]
It's a boot-strapping problem. LLMs have shown that we can reproduce data that's already in the form we want, and use that data to solve novel problems. There is no shortage of data, it's just data that's in a form you want is hard to come by. You want to create a model that generates steps for a robot with a particular shape? First you have to create a robot with that shape that can walk, then create a million of them and record them walking all over the place. Now you have something that's probably going to be too slow to run. Not fesible in the real world, the closest we have today is something like driverless car, (which is already a solved problem they are called trains)

This is why I think China will ultimately win the AI race, they will be able to put tens of millions of people to a specific task until there is enough data generated to replace humans on that task in 99.99% of cases, and they have the manufacturing capability to make the millions of IO devices needed for this.

Yes, humanoid robots are a good idea, but only if you can train them with walking data from real people, I think it will probably translate well enough to most humanoid robots, but ideally you are designing the physical robot from the ground up to model human movement as close as possible. You have to accept that if we go the LM route for AI that the optimal hardware behaves like human wetware. The neuromorphic computing people get it, robotics people should too.

datadrivenangel 11 hours ago [-]
If a problem is worth throwing 10 million people at it, it's worth putting the problem into a deterministically solvable form.

Legal AI would be easy if we made our legal code more robust

Mistletoe 18 hours ago [-]
>And herein lies the problem — we’ve basically ingested the entire Internet, and there is no second Internet.

One of the best things I've read in a while about AI.

paulsutter 17 hours ago [-]
Physical simulation is the most important underutilized data source. It’s very large, but also finite. And once you’ve learned the complexity of reality you won’t need more data you’ll be done
Quarrelsome 18 hours ago [-]
I really enjoyed reading this article as I found its content extremely insightful, but I fear I must whine for far too long about something entirely minor.

As someone that didn't go to expensive maths club, the way people who did, talk about maths is disgraceful imho. Consider the equasion in this article:

(C ~ 6 N⋅D)

I can look up the symbol for "roughly equals", that was super cool and is a great part of curiousity. But this _implied_ multiplication between the 6 and the N combined with using a fucking diamond symbol (that I already despise given how long it took me to figure the first time I encountered it) is just gross. I figured it was likely that but then I was like: "but why not just 6ND? Maybe there's a reason why N⋅D but 6 N? Does that mean there's a difference between those operations"?

Thankfully I can use gippity these days to get by, but before gippity I had to look up an entire list of maths symbols to find the diamond symbol to work out what it meant. Its why I love code because there's considerably less implicit behaviour once you slap down the formula into code and you can play with the input/output.

I don't think mathsy people realise how exclusionary their communication is, but its so frustrating when I end up fumbling around in slow-mo when the maths kicks in, because "oh the /2 when discussing logarithms in comp sci is _obvious_, so we just don't put it in the equasion" just kills me. Idiot me, staring at the equasion thinking it actually makes sense without knowing the special maths knowledge of implication means that it actually doesn't solve as it reads on the page. Unless of course you went to expensive maths club where they tell you all this.

What drives me nuts is that every time I spend ages finally grokking something, I realise how obvious it is and how non-trivial it is to explain it simply. Comp sci isn't much better to be honest, where we use CQRS instead of "read here, write there". Which results in thousands of newbies trying to parse the unfathomable complexity of "Command Query Responsibility Segregation" and spending as much time staring at its opaqueness as I did the opening sentence of the wikipedia article on logarithms.

Idk what my point is, I just don't understand what's wrong with 6⋅N⋅D or 6*N*D. Do mathmeticians feel ugly if they write something down like that or smth?

marcosdumay 4 hours ago [-]
> I just don't understand what's wrong with 6⋅N⋅D or 6ND

I think most people that read this would be confused and try to find out why there are some undisclosed vectorial operations applied to what looks like scalar numbers.

And yeah, mathematical notation is ugly and confusing. But the fix is not as simple as you think it is.

Apropos CQRS, it's a marketing name. It's hard to understand on purpose. Actual CS-made names tend to be easier.

ghkbrew 17 hours ago [-]
I assume they use N⋅D rather than ND to make it explicit these are 2 different variables. That's not necessary for 6N because variable names don't start with a number by convention.
Quarrelsome 11 hours ago [-]
Its good we all learned this convention. Thanks for teaching to it to me though.

To clarify, if it read:

C ~ X N⋅D

you'd be as confused as me? Its because its a number it has special implied mechanics where we can skip operators because its "obvious".

ghkbrew 4 hours ago [-]
Well no actually it'd still clear to me that they mean the the multiplication of 3 different variables X, N, and D.

I don't think of it as eliding obvious operators. Rather in mathematics juxtaposition is used as an operator to represent multiplication. You would never elide an addition operator.

So X next to D still means multiplication as long as you can tell that X and D are separate entities.

I would wonder why they switched conventions in the middle of an expression though.

Chinjut 17 hours ago [-]
What diamond symbol?
Quarrelsome 11 hours ago [-]
Oh its a dot. Dots, diamonds,the absense of an operator, anything is multiplication it seems. While this comment might look like a paragraph, its actually a lot of maths.
Chinjut 3 hours ago [-]
But diamonds don't denote multiplication. That never happened. That was just you misreading.
throwaway314155 18 hours ago [-]
The scaling laws for transformers _deliberately_ factor in the amount of data as well as the amount of compute needed in order to scale.

The premise of this article, that data is more important than compute has been obvious to people who are paying attention.

Sorry but the unnecessary sensationalism in this article was mildly annoying to me. Like the author discovered some novel new insight. A bit like that doctor who published a "no el" paper about how to find the area under a curve.

gavmor 18 hours ago [-]
> The premise of this article... has been obvious to people who are paying attention.

Well, forgive me but I feel that the article is a much-needed injection of context into my thinking around the Bitter Lesson. I like the imperative to preface compute requests with data roadmaps.

I'm not an AI guy. Not an ML engineer. I've been studiously avoiding the low-level stuff, actually, because I didn't want to half-ass it when off-the-shelf solutions were still providing tremendous novelty and value for my customers.

So, for most of my career, "compute" has been practically irrelevant! RAM and disk constraints presented more frequent obstacles than processor cycles'. I would have easily told you that data presents more of a bottleneck to value than CPU. But that's just the era of computing I came up in.

The last few years have been different. Suddenly compute is at a premium, again. So it's easy to think, "if only I had more," and "line goes up!" and forget about s-curves and logarithmic scaling.

Is the article unnecessarily sensationalist? I don't know, maybe you've been overestimating how much the rest of us are "paying attention."[0]

0. https://xkcd.com/2501/

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 18:00:05 GMT+0000 (Coordinated Universal Time) with Vercel.