3779 stories
·
3 followers

A knockout blow for LLMs?

1 Share

Quoth Josh Wolfe, well-respected venture capitalist at Lux Capital:

Ha ha ha. But What’s the fuss about?

Apple has a new paper; it’s pretty devastating to LLMs, a powerful followup to one from many of the same authors last year.

There’s actually an interesting weakness in the new argument—which I will get to below—but the overall force of the argument is undeniably powerful. So much so that LLM advocates are already partly conceding the blow while hinting at, or at least hoping for, happier futures ahead.

Wolfe lays out the essentials in a thread:

§

In fairness, the paper both GaryMarcus’d and Subbarao (Rao) Kambhampati’d LLMs.

On the one hand, it echoes and amplifies the training distribution argument that I have been making since 1998: neural networks of various kinds can generalize within a training distribution of data they are exposed to, but their generalizations tend to break down outside that distribution. That was the crux of my 1998 paper skewering multilayer perceptrons, the ancestors of current LLM, by showing out-of-distribution failures on simple math and sentence prediction tasks, and the crux in 2001 of my first book (The Algebraic Mind) which did the same, in a broader way, and central to my first Science paper (a 1999 experiment which demonstrated that seven-month-old infants could extrapolate in a way that then-standard neural networks could not). It was also the central motivation of my 2018 Deep Learning: Critical Appraisal, and my 2022 Deep Learning is Hitting a Wall. I singled it out here last year as the single most important — and important to understand — weakness in LLMs. (As you can see, I have been at this for a while.)

On the other hand it also echoes and amplifies a bunch of arguments that Arizona State University computer scientist Subbarao (Rao) Kambhampati has been making for a few years about so-called “chain of thought” and “reasoning models” and their “reasoning traces” being less than they are cracked up to be. For those not familiar a “chain of thought” is (roughly) the stuff a system says it “reasons” its way to answer, in cases where the system takes multiple steps; “reasoning models” are the latest generation of attempts to rescue the inherent limitations of LLMs, by forcing them to “reason” over time, with a technique called “inference-time compute”. (Regular readers will remember that when Satya Nadella waved the flag of concession in November on pure pretraining scaling - the hypothesis that my deep learning is a hitting a wall paper critique addressed - he suggested we might find a new set of scaling laws for inference time compute.)

Rao, as everyone calls him, has been having none of it, writing a clever series of papers that show, among other things that the chains of thoughts that LLMs produce don’t always correspond to what they actually do. Recently, for example, he observed that people tend to overanthromorphize the reasoning traces of LLMs, calling it “thinking” when it perhaps doesn’t deserve that name. Another of his recent papers showed that even when reasoning traces appear to be correct, final answers sometimes aren’t. Rao was also perhaps the first to show that a “reasoning model”, namely o1, had the kind of problem that Apple documents, ultimately publishing his initial work online here, with followup work here.

The new Apple paper adds to the force of Rao’s critique (and my own) by showing that even the latest of these new-fangled “reasoning models” still —even having scaled beyond o1 — fail to reason beyond the distribution reliably, on a whole bunch of classic problems, like the Tower of Hanoi. For anyone hoping that “reasoning” or “inference time compute” would get LLMs back on track, and take away the pain of m mutiple failures at getting pure scaling to yield something worthy of the name GPT-5, this is bad news.

§

Hanoi is a classic game with three pegs and multiple discs in which you need to move all the discs on the left peg to the right peg, never stacking a larger disc on top of a smaller one.

(You can try a digital version at mathisfun.com.)

If you have never seen it before, it takes a moment or to get the hang of it. (Hint, start with just a few discs).

With practice, a bright (and patient) seven-year-old can do it. And it’s trivial for a computer. Here’s a computer solving the seven-disc version, using an algorithm that any intro computer science student should be able to write:

Claude, on the other hand, can barely do 7 discs, getting less than 80% accuracy, left bottom panel below, and pretty much can’t get 8 correct at all.

Apple found that the widely praised o3-min (high) was no better (see accuracy, top left panel, legend at bottom), and they found similar results for multiple tasks:

It is truly embarrassing that LLMs cannot reliably solve Hanoi. (Even with many libraries of source code to do it freely available on the web!)

An, as the paper’s co-lead-author Iman Mirzadeh told me via DM,

it's not just about "solving" the puzzle. In section 4.4 of the paper, we have an experiment where we give the solution algorithm to the model, and all it has to do is follow the steps. Yet, this is not helping their performance at all.

So, our argument is NOT "humans don't have any limits, but LRMs do, and that's why they aren't intelligent". But based on what we observe from their thoughts, their process is not logical and intelligent.

If you can’t use a billion dollar AI system to solve a problem that Herb Simon one of the actual “godfathers of AI”, current hype aside) solved with AI in 1957, and that first semester AI students solve routinely, the chances that models like Claude or o3 are going to reach AGI seem truly remote.

§

That said, I warned you that there was a weakness in the new paper’s argument. Let’s discuss.

The weakness, which was well-laid out by anonymous account on X (usually not the source of good arguments) was this: (ordinary) humans actually have a bunch of (well-known) limits that parallel what the Apple team discovered. Many (not all) humans screw up on versions of the Tower of Hanoi with 8 discs.

But look, that’s why we invented computers and for that matter calculators: to compute solutions large, tedious problems reliably. AGI shouldn’t be about perfectly replicating a human, it should (as I have often said) be about combining the best of both worlds, human adaptiveness with computational brute force and reliability. We don’t want an “AGI” that fails to “carry the one” in basic arithmetic just because sometimes humans do. And good luck getting to “alignment” or “safety” without reliabilty.

The vision of AGI I have always had is one that combines the strengths of humans with the strength of machines, overcoming the weaknesses of humans. I am not interested in a “AGI” that can’t do arithmetic, and I certainly wouldn’t want to entrust global infrastructure or the future of humanity to such a system.

Whenever people ask me why I (contrary to widespread myth) actually like AI, and think that AI (though not GenAI) may ultimately be of great benefit to humanity, I invariably point to the advances in science and technology we might make if we could combine the causal reasoning abilities of our best scientists with the sheer compute power of modern digital computers.

We are not going to be “extract the light cone” of the earth or “solve physics” [whatever those Altman claims even mean] with systems that can’t play Tower of Hanoi on a tower of 8 pegs. [Aside from this, models like o3 actually hallucinate a bunch more than attentive humans, struggle heavily with drawing reliable diagrams, etc; they happen to share a few weakness with humans, but on a bunch of dimensions they actually fall short.]

And humans, to the extent that they fail, often fail because of a lack of memory; LLMs, with gigabytes of memory, shouldn’t have the same excuse.

§

What the Apple paper shows, most fundamentally, regardless of how you define AGI, is that LLMs are no substitute for good well-specified conventional algorithms. (They also can’t play chess as well as conventional algorithms, can’t fold proteins like special-purpose neurosymbolic hybrids, can’t run databases as well as conventional databases, etc.)

In the best case (not always reached) they can write python code, supplementing their own weaknesses with outside symbolic code, but even this is not reliable. What this means for business and society is that you can’t simply drop o3 or Claude into some complex problem and expect it to work reliably.

Worse, as the latest Apple papers shows, LLMs may well work on your easy test set (like Hanoi with 4 discs) and seduce you into thinking it has built a proper, generalizable solution when it does not.

At least for the next decade, LLMs (with and without inference time “reasoning”) will continue have their uses, especially for coding and brainstorming and writing. And as Rao told me in a message this morning, “the fact that LLMs/LRMs don't reliably learn any single underlying algorithm is not a complete deal killer on their use. I think of LRMs basically making learning to approximate the unfolding of an algorithm over increasing inference lengths.” In some contexts that will be perfectly fine (in others not so much).

But anybody who thinks LLMs are a direct route to the sort AGI that could fundamentally transform society for the good is kidding themselves. This does not mean that the field of neural networks is dead, or that deep learning is dead. LLMs are just one form of deep learning, and maybe others — especially those that play nicer with symbols – will eventually thrive. Time will tell. But this particular approach has limits that are clearer by the day.

§

I have said it before, and I will say it again:

Share

Gary Marcus never thought he would live to become a verb. For an entertaining definition of that verb, follow this link.

Read the whole story
emrox
23 hours ago
reply
Hamburg, Germany
Share this story
Delete

You should know this before choosing Next.js

1 Share

Picking the technology stack for a project is an important and consequential decision. In the enterprise space in particular, it often involves a multi-year commitment with long-lasting implications on the roadmap of the project, the pace of its development, the quality of the deliverables, and even the ability to assemble and maintain a happy team.

The open-source software model is a fundamental answer to this. By using software that is developed in the open, anyone is free to extend it or modify it in whatever way fits their use case. More crucially, the portability of open-source software gives developers and organisations the freedom to move their infrastructure between different providers without fear of getting locked in to a specific vendor.

This is the expectation with Next.js, an open-source web development framework created and governed by Vercel, a cloud provider that offers managed hosting of Next.js as a service.

There is nothing wrong with a company profiting from an open-source software it created, especially when that helps fund the development of the project. In fact, there are plenty of examples of that model working successfully in our industry.

But I think that can only work sustainably if the boundaries between the company and the open-source project are abundantly clear, with well-defined expectations between the maintainers, the hosting providers and the users about how and where each feature of the framework can be used.

I want to explain why I don't think this transparency exists today.

My goal is not to stop anyone from using Next.js, but to lay out as much information as possible so developers and businesses can make an informed decision about their technology stack.

Declaration of interest

Let me lead with a declaration of interest:

  • I work at Netlify and have done so for over four years
  • Netlify is a frontend cloud platform that supports Next.js and other web frameworks as part of its product offering
  • Netlify and Vercel are direct competitors

It's important for me to establish this for a few reasons.

My job involves building the infrastructure and tooling needed to support the full feature set of Next.js on Netlify, which has exposed me to the internals of the framework in a way that most people won't see. Over the years, I have seen concerning patterns of tight coupling between the open-source framework and the infrastructure of the company that builds it.

My employment is also the reason why I have always been very wary of voicing these concerns in public. As a Netlify employee, I don't really get to voice an objective concern about Next.js without people dismissing my claims as Netlify unleashing one of its minions to spread FUD about a competitor.

I'm not keen on exposing myself and the company to that type of debate, so I have always chosen to work behind the scenes in supporting the developers who decide to deploy their sites on Netlify and shield them from all the complexity that goes into making that possible.

But then something happened.

Last weekend, Vercel disclosed a critical security vulnerability with Next.js. This type of issue is normal, but the way Vercel chose to handle it was so poor, reckless and disrespectful to the community that it has exacerbated my concerns about the governance of the project.

For me, things change once your decisions put other people at risk, so I felt the urge to speak up.

Openness and governance

I'll come back to this incident later, but before that I want to back up a little and give you a peek behind the curtain. My history of reservations about the openness and governance of Next.js stem from a series of decisions made by Vercel over the years that make it incredibly challenging for other providers to support the full feature set of the framework.

I'll cover these by laying out a series of facts about how Next.js is built. I'll then add some of my own considerations about how those facts live up to the expectations of an open, interoperable, enterprise-grade software product.

Fact #1: No adapters

Most modern web development frameworks use the concept of adapters to configure the output of the framework to a specific deployment target: Remix, Astro, Nuxt, SvelteKit and Gatsby are just a few examples. This pattern allows developers to keep the core of their applications untouched, and simply swap the adapter if they decide to start deploying to a different provider.

These adapters can be maintained by framework authors, by the hosting providers, by the community, or all of the above. Frameworks are typically structured in such a way that it’s possible for anyone to build their own adapter in case one isn’t available for the provider of their choice.

Next.js does not have the concept of adapters and they have stated in the past that they would not support them. The output of a Next.js build has a proprietary and undocumented format that is used in Vercel deployments to provision the infrastructure needed to power the application.

Vercel's alternative to this was the Build Output API, a documented specification for the output format of frameworks who wish to deploy to Vercel.

This is not an adapter interface for Next.js, and in fact has nothing to do with Next.js. The announcement blog post said that Next.js supports this format, but as of today that isn’t true.

In November 2023, the Next.js documentation has been updated to say that Next.js would adopt the Build Output API in the following major version of the framework (which would be version 15):

Next.js produces a standard deployment output used by managed and self-hosted Next.js. This ensures all features are supported across both methods of deployment. In the next major version, we will be transforming this output into our Build Output API specification.

Next.js 15.0.0 was released in October 2024 without support for the Build Output API.

Vercel have built the Build Output API because they wanted their customers to leverage the rich ecosystem of frameworks in the space, but their own framework doesn't support it to this day.

This means that any hosting providers other than Vercel must build on top of undocumented APIs that can introduce unannounced breaking changes in minor or patch releases. (And they have.)

Late last year, Cloudflare and Netlify have joined OpenNext, a movement of different cloud providers that collaborate on open-source adapters for Next.js. Shortly after, Vercel have engaged with the movement and committed to building support for adapters. They haven't made any timeline commitments, but have recently said they are actively working on it.

It's important to remember that it's been almost three years since the launch of the Build Output API, and to this day the framework still isn't portable. I'm cautiously optimistic about that actually changing this time.

Fact #2: No official serverless support

The official methods for self-hosting Next.js require running the application in a stateful way, as long-running servers. While technically possible, this is very hard to operate in any real-world production environment where a single instance isn’t sufficient.

The setup needs to be able to dynamically scale up very quickly in order to handle sudden bursts of traffic, while at the same time being able to scale down to zero in order to be cost-effective. This last part is essential when working with server components, for example, where the deep tangling between client and server code can break older clients unless every version of the server code ever deployed is available indefinitely.

One obvious answer to these requirements is serverless computing, as attested by official Next.js documentation that confirms the benefits of this model:

Serverless allows for distributed points of failure, infinite scalability, and is incredibly affordable with a "pay for what you use" model.

This clearly advantageous computing paradigm is precisely how Vercel has run Next.js sites in their own infrastructure for years. Given that Next.js is an open framework, it is reasonable to expect that you'd be able to use that same model in any serverless provider of your choice. But it's not that simple.

Next.js once had a serverless mode that you could enable with a configuration property, but it was removed without further explanation in October 2022. No equivalent mode was ever introduced.

The official React documentation, which the Next.js team help maintain, says that Next.js can be deployed to «any serverless hosting», but there is no official documentation whatsoever for this.

This means that any providers who want to offer support for Next.js with the same computing model that the framework itself promotes must reverse-engineer their way to a custom implementation.

Fact #3: Vercel-specific code paths

Next.js has code paths that are only ever executed for sites deployed to Vercel. An example of this is a private flag called minimal mode, which allows Vercel to shift work away from the framework and run it on their edge infrastructure.

Here's an example of why that matters. Next 12 introduced middleware, a way to address use cases such as feature flags, A/B tests and advanced routing. What's common in all of these use cases is the need to run logic on the hot path, behind the cache, with very low latency.

The announcement included this:

This works out of the box using next start, as well as on Edge platforms like Vercel, which use Edge Middleware.

In practice, this means that you have two options: use next start and run middleware alongside the rest of your application in your origin server (which is typically running in a single region, after the cache), or use one of the «Edge platforms like Vercel» to run middleware at the edge, before the cache, unlocking all the incredible use cases that Vercel boasted in the resources linked in the announcement.

The phrase «Edge platforms like Vercel» surely means that there are many alternatives out there because other providers were given the option to also implement middleware at the edge, right? No.

This secret minimal mode is what allowed Vercel to break out middleware from the rest of the application so they could run it at the edge, but only Vercel has access to it.

Netlify does support running middleware at the edge, but we've done it at the expense of having a full team of engineers dedicated to reverse-engineering the framework and building our own edge middleware implementation on top of undocumented APIs. This type of commitment is just impossible for smaller companies that simply do not have the resources to fight this battle, which makes most of them stop trying.

As far as I know, Netlify is the only cloud provider to support the full feature set of Next.js outside of Vercel, which doesn't make sense to me. With Next.js having such a sizeable share of the market, I would expect a lot more hosting options, which would foster competition and innovation across the board, ultimately benefitting users and the web.

So why is there a hidden door in Next.js for which only Vercel holds the key? I think it's expected that the framework maintainers regularly experiment with features before they're launched, but minimal mode isn't that. We're talking about an entirely different operation mode for the framework, which has been in the code base for many years and which unlocks capabilities that are reserved for the for-profit company that owns the framework.

If WordPress had a privileged code path that was only accessible to sites deployed to Automattic properties, would it be trusted as a truly open project and would it have the dominance it has today?

Security posture

Let's go back to the security incident. On Friday, March 21st at 10:17 AM (UTC), Vercel published a CVE for a critical security incident, ranked with a severity of 9.1 out of 10.

In essence, it was possible for anyone to completely bypass Next.js middleware by sending a specific header in the request. This is important because authentication was one of the flagship use cases of middleware, and this exploit meant that anyone could bypass the authentication layer and gain access to protected resources.

As the incident unravelled, a few things became apparent. First of all, the vulnerability was reported to the Next.js team on February 27th, but it wasn't until March 14th that the team started looking into it. Once they did, they started pushing fixes for Next 14 and Next 15 within a couple of hours.

So by March 14th (at the latest), Vercel knew they had a serious incident on their hands. The responsible thing to do at that point would be to immediately disclose the vulnerability to other providers, so that they could assess the impact to their own customers and take any necessary actions to protect them as quickly as possible. At times like these, our duty to protect users should rise above any competition between companies.

That is not what happened. It took Vercel 8 (eight) days to reach out to Netlify. In that time, they managed to push patches to Next.js, cut two releases, and even write a blog post that framed the incident as something that Vercel's firewall had «proactively protected» their customers from (even though their CTO later said that their firewall had nothing to do with it).

I think it's incredibly disingenuous to spin a critical security vulnerability in your open-source project as a strength of your product, with absolutely no consideration for whether users in other providers were also affected and what they should do to mitigate. In fact, they wouldn't even know this, because they hadn't even reached out to us at this point.

After being called out on social media, Vercel have rewritten the blog post to remove any mention of their firewall and clarify which providers had been affected and whether their customers had to take any action.

Vercel has then released a postmortem where they said — for the first time — that on March 21st they were able to «verify Netlify and Cloudflare Workers were not impacted». This is directly contradicted by their staff reaching out to Netlify on March 22nd offering help to «get a patch up». If we were not impacted, what was there to patch?

This lack of consideration for any users outside of Vercel has created unnecessary anxiety and confusion for a lot of people, leaving some providers scrambling to find a solution and then having to partially roll it back, others announcing that they were not vulnerable when in reality they were, etc.

As you read this, it's impossible for anyone to know how many sites out there are still vulnerable to this exploit, many of which would've been safe if things were handled differently.

And at the height of all this mess, Vercel's leadership had... a different focus.

But Vercel owns Next.js

They do. And they have every right to make a business out of the framework that they've put so much work, talent, time and energy into building and growing. I'm not disputing that.

But that growth holds them to a high bar of standards that, in my opinion, they have repeatedly failed to meet.

«If Vercel own Next.js, what incentive do they have to open it up to other providers?» is a question I sometimes see and which I find intriguing. What incentives does Redis have for opening up their software when they own Redis Cloud? Why make Grafana open when Grafana Cloud is owned by the same company? Or WordPress, ClickHouse and many others?

The incentive is that they have to do those things if they choose to publish their software as open-source and not as a closed, proprietary solution. Their success is intrinsically associated their users having the guarantee that they are free to choose whatever provider offers the service that meets their needs at any given time.

Wrapping up

It's not my business to say which framework you should use. If you like Next.js and you still think it's the best tool for the problem you need to solve, you should absolutely use it. But I hope that this information helps you feel more confident about your decision, whichever way you're leaning.

As for me, I'll keep doing my job to help support the developers who chose to deploy their sites to Netlify, whatever their framework of choice is. And competition aside, I'm genuinely looking forward to help Vercel make Next.js more open and interoperable through the OpenNext movement. ∎

Update (March 26th): Added a note about Vercel's most recent postmortem.

Update (March 28th): Vercel have committed «to not introduce any new privileged code paths and to either remove or fully document the ones that exist today, such as minimal mode». As for timelines, they are «hoping to get it done this year».

Update (April 23rd): I have submitted a PR to fix incorrect information about Next.js deployment options in the React documentation.

Read the whole story
emrox
1 day ago
reply
Hamburg, Germany
Share this story
Delete

My AI Skeptic Friends Are All Nuts

2 Shares
Read the whole story
emrox
10 days ago
reply
Hamburg, Germany
Share this story
Delete

Lessons From Cursor's System Prompt

1 Share
Read the whole story
emrox
11 days ago
reply
Hamburg, Germany
Share this story
Delete

Volcano

1 Share

Volcano

And other mountains.

Read the whole story
emrox
11 days ago
reply
Hamburg, Germany
Share this story
Delete

Which New Language Should I Learn for Web Development?

1 Share

Comments

Read the whole story
emrox
14 days ago
reply
Hamburg, Germany
Share this story
Delete
Next Page of Stories