OpenAI Says AGI is Close. The Hardest AI Test Says They Are at 4%.
I've watched the AGI conversation change dramatically in the past three years. It went from a fringe research topic to daily headlines — and somewhere in that transition, the term lost most of its meaning. Companies are now claiming they're "close to AGI" while the benchmarks specifically designed to measure AGI capability score their best systems at 4%. Those two facts can't both be true unless the definitions are fundamentally different. They are. Here's what's actually happening.
The gap between AI marketing claims about AGI and the scientific benchmarks designed to measure AGI capability is the most important and least-discussed story in AI in 2026.
Artificial General Intelligence is the version of AI that can think the way humans can — not just at one specific task, but flexibly, across novel domains, with the ability to learn new skills from minimal examples just as a person would.
We don't have that yet. But we also can't agree on what "having it" would even look like. And that definitional vacuum is being actively exploited.
What AGI Actually Means — And Why the Definition Matters So Much
The term "artificial general intelligence" was popularized in the early 2000s to distinguish a hypothetical future AI from the narrow, task-specific AI systems that already existed. Narrow AI (what we have today) is superhuman at chess, image recognition, protein folding, and language generation — but only at those specific tasks it was trained on.
AGI, in the original academic sense, would be able to do what humans can: transfer knowledge between domains, learn new skills from minimal examples, reason about novel problems it's never encountered, and adapt to entirely new environments.
The Four Most Used AGI Definitions — and Why They Produce Different Answers
This is the detail most coverage skips: there is no single definition of AGI, and different organizations deliberately use different ones.
- Cognitive Definition: AI that can perform any intellectual task a human can — the original academic meaning. Requires genuine generalization across all domains.
- Economic Definition (OpenAI's charter): "A highly autonomous system that outperforms humans at most economically valuable work." No requirement for general reasoning — just economic productivity.
- Behavioral Definition: If it reliably passes any test a human would pass, it's AGI — regardless of the underlying mechanism. This is often called the "Turing Test" framing.
- Efficiency Definition (Chollet/ARC): AGI must demonstrate the ability to acquire new skills efficiently from small amounts of data — the ability to generalize to genuinely novel tasks, not just pattern-match from training.
The economic definition is the broadest — it's possible to "outperform humans at most economically valuable work" with narrow AI systems handling specific job categories. The efficiency definition is the most demanding — and current AI clearly fails it.
⚡ The Overlooked Conflict of Interest in AGI Declarations
OpenAI's nonprofit charter explicitly states that if AGI is achieved, the commercial investors' returns are capped — the nonprofit mission takes over, and capped-profit investors don't get unlimited upside. This creates an unusual incentive: there is financial pressure to either declare AGI sooner (triggering the nonprofit governance structure) or to keep the definition just loose enough to never quite trigger it. Understanding this structural tension explains why "close to AGI" language appears frequently in OpenAI communications without ever resulting in a formal AGI declaration.
The ARC-AGI Benchmark — The Most Honest Test Nobody Talks About
In 2019, François Chollet — a Google AI researcher and creator of Keras, one of the most widely used deep learning frameworks — published a paper arguing that existing AI benchmarks were fundamentally flawed for measuring AGI progress.
His argument: any benchmark that uses training-data-accessible patterns can be "solved" by memorization and pattern matching, not genuine reasoning. To measure true general intelligence, you need tasks that require novel generalization from minimal examples — tasks you literally cannot memorize your way through.
What ARC Tasks Actually Test
ARC (Abstract and Reasoning Corpus) presents visual grid puzzles. You're shown two or three input-output example pairs. You then have to identify the abstract transformation rule and apply it to a new input.
Nine-year-old children solve these trivially because humans naturally do one-shot rule abstraction. AI systems with 100x more compute consistently struggle because they rely on statistical pattern matching from training — and these tasks are specifically constructed to defeat that approach.
ARC-AGI-1 results: OpenAI's o3 model (using significantly more compute than allowed in standard inference) scored 75.7% in late 2024. This was impressive. It did not go unnoticed that 75.7% at elevated compute is dramatically different from reliable human-level performance at normal compute.
ARC-AGI-2 results: The updated version, released in 2025 with tasks even more resistant to pattern-matching, produces AI scores of approximately 4%. Humans score near 100%.
OpenAI's Five-Level Framework — What Each Stage Actually Means
OpenAI published an internal framework describing five stages of AI development toward AGI. Understanding these levels explains how the company communicates about its position:
OpenAI's economic definition of AGI roughly maps to Level 3-4 of this framework. The cognitive/scientific definition most researchers mean maps closer to Level 4-5.
Where Current AI Actually Sits Relative to AGI
| Capability | Current AI | AGI Requires | Gap |
|---|---|---|---|
| Language understanding | Excellent (narrow) | Genuine comprehension across all domains | Semantic vs. statistical |
| Novel task learning | Poor — needs vast training data | Learn new skills from 1–10 examples | Large |
| Abstract reasoning (ARC-AGI-2) | ~4% on benchmark | ~100% (human baseline) | Enormous |
| Long-horizon planning | Improving with agents | Reliable multi-week autonomous goals | Significant |
| Scientific discovery | AlphaFold-style narrow wins | General cross-domain innovation | Large |
| Common-sense physical reasoning | Consistently fails edge cases | Reliable intuitive physics model | Large |
| Economic productivity | Outperforms humans (many tasks) | OpenAI's definition threshold | Near (by this definition) |
The AGI Facts Most Articles Ignore
đź’ˇ The "Situational Awareness" Argument — The Internal Bullish Case
In mid-2024, Leopold Aschenbrenner (a former OpenAI safety researcher) published a 165-page document titled "Situational Awareness" arguing that AGI would arrive by 2027 based on compute scaling trajectories. The document circulated widely in Silicon Valley and influenced investment decisions. Aschenbrenner left OpenAI and the document was explicitly not an OpenAI position — but it represents the most detailed articulation of the accelerationist case. The core argument: AI capability improvements follow predictable scaling laws, and the curves, if extrapolated, cross human-level performance in specific domains on documented timelines. The counterargument: scaling laws may not extrapolate indefinitely, and ARC-AGI-2 suggests we haven't found the architectural breakthrough needed for genuine generalization.
đź’ˇ AlphaFold and AlphaGeometry Are Not AGI Evidence — They're Narrow AI Achievements
Google DeepMind's AlphaFold solved the protein folding problem. AlphaGeometry solved International Math Olympiad geometry problems. Both are extraordinary scientific achievements. Neither is evidence of AGI. Both systems are trained specifically for their narrow domains and cannot transfer their capabilities to unrelated tasks. A system that can predict protein folding cannot answer a geometry question. A system that can solve geometry cannot fold proteins. The conflation of narrow AI breakthroughs with AGI progress is one of the most common errors in mainstream AI coverage.
đź’ˇ Yann LeCun's "World Model" Argument — The Dissenting View Nobody Covers
Yann LeCun, Meta's Chief AI Scientist and one of the founding fathers of modern deep learning, has consistently argued that current transformer-based LLMs cannot achieve AGI — and that the field is fundamentally heading in the wrong direction. His argument: genuine intelligence requires a world model, a learned representation of physical and causal reality. LLMs predict text; they don't model the world. He proposes a different architecture based on self-supervised learning from video — learning physics and causality from observational data rather than from text. In 2026, this remains a minority view among industry practitioners but a significant voice in the research community. If LeCun is right, AGI requires a foundational architectural shift, not more compute scaling.
⚠️ The Safety Consideration That's Actually Underappreciated
Most AGI safety coverage focuses on superintelligent AI taking over. The underappreciated near-term concern: an AI system that is capable enough to be deployed in high-stakes domains (medical diagnosis, legal advice, critical infrastructure) but not capable enough to reliably know the limits of its own knowledge. The "confident hallucination" problem — systems that are wrong with high confidence — is most dangerous at exactly the capability level between current AI and true AGI. This is the safety research priority most deserving attention in 2026, not science-fiction scenarios about superintelligence.
The Timeline — What Researchers Actually Say
Expert disagreement on AGI timelines is not a sign of ignorance — it's a reflection of genuine scientific uncertainty. The honest answer in 2026 is that nobody knows, and anyone claiming certainty in either direction (imminent or impossible) is overconfident.
- Sam Altman (OpenAI): "A few years" from now — consistently the most bullish public estimate
- Demis Hassabis (Google DeepMind): "Probably a decade away" — more cautious, notes multiple unsolved problems
- Yann LeCun (Meta AI): "Not with current architectures" — requires fundamental new approaches
- 2023 AI researcher survey median: 50% probability of high-level machine intelligence by ~2059 (enormous range)
- François Chollet: Current AI is far from AGI as measured by genuine abstract reasoning benchmarks
Frequently Asked Questions
What is artificial general intelligence (AGI)?
AGI refers to a hypothetical AI system capable of performing any intellectual task a human can — including reasoning, learning, planning, and creativity across novel domains without task-specific training. Unlike current narrow AI, which excels only at tasks it was trained for, AGI would generalize flexibly. There is no universally agreed definition: OpenAI's charter defines it as "a highly autonomous system that outperforms humans at most economically valuable work" (an economic definition), while researchers like François Chollet define it as systems that efficiently acquire new skills from minimal data — a much higher bar current AI cannot meet.
Has AGI been achieved in 2026?
No — not by any widely accepted scientific definition. The ARC-AGI-2 benchmark — specifically designed to measure AGI-level abstract reasoning — scores current AI systems at approximately 4% while humans score near 100%. While companies like OpenAI describe themselves as "close to AGI" using economic productivity definitions, the cognitive and reasoning capabilities required by more demanding scientific definitions remain far out of reach. The gap between marketing claims and measurable benchmark performance is the central tension in AGI discourse in 2026.
What is the ARC-AGI benchmark and why does it matter?
ARC (Abstract and Reasoning Corpus) is a benchmark created by François Chollet (Google AI) specifically designed to measure general fluid intelligence by requiring novel rule abstraction from minimal examples. Tasks can't be memorized from training data — they require genuine generalization. ARC-AGI-1 was partially solved by OpenAI's o3 (75.7% at elevated compute, 2024). ARC-AGI-2 (2025) produces AI scores of ~4% with humans scoring near 100%. It's considered the most scientifically honest test of AGI-relevant capabilities currently available.
What is OpenAI's definition of AGI?
OpenAI's charter defines AGI as "a highly autonomous system that outperforms humans at most economically valuable work." This is an economic and functional definition, not a cognitive one. OpenAI's five-level framework places AGI roughly at Level 3 (Agents) to Level 4 (Innovators). Importantly, OpenAI's charter caps investor returns if AGI is declared achieved — creating a structural tension around when and how AGI is formally recognized. OpenAI assessed itself as "approaching Level 3" in 2025.
When will AGI be achieved?
Expert estimates range enormously: Sam Altman says "a few years," Demis Hassabis says "probably a decade," Yann LeCun says "not with current architectures." A 2023 survey of AI researchers produced a median estimate of ~2059 for 50% probability of high-level machine intelligence. The honest answer is that nobody knows — and anyone expressing certainty is overconfident. The timeline debate is largely unresolvable until AGI has a clearer agreed definition, which it currently lacks.
The AGI debate in 2026 is more about definitions and incentives than about actual capability gaps — though the capability gaps are real and measurable. Understanding the difference between the economic definition, the cognitive definition, and the benchmark reality gives you a more accurate map of where AI actually is.
For more on how current AI tools work and where they're practically useful right now — check the tools and calculators on this site to see what's measurably possible today.