UC Berkeleys New AI Benchmark Shows Models Still Struggle on Real-World Jobs
When the AI world was buzzing about IPOs and federal warnings, a quiet lab at UC Berkeley quietly rolled out a test that could shake up the industry.
The benchmark, dubbed Agents’ Last Exam (ALE), was built with the help of more than 300 industry experts. It pits state‑of‑the‑art AI models against tasks that mirror actual professional work across more than 50 different fields—from audio processing to theoretical physics. The goal is simple yet ambitious: see whether an AI can earn a perfect score on a whole set of real‑world jobs.
In the first round of results, OpenAI’s GPT‑5.5 topped the leaderboard with a 24% pass rate. Claude Fable 5 from Anthropic followed closely at 22%, while Google Gemini, DeepSeek, and Grok each fell below 16%. A “pass” means the AI agent nailed every single task in a given run.
The test designers are careful to keep the evaluation fresh. They update the tasks regularly to reduce the risk of contamination—where training data overlaps with the test data—so the results reflect genuine ability rather than memorization.
The ALE project is led by the Berkeley Center for Responsible, Decentralized Intelligence, co‑directed by computer‑science professor Dawn Song and Haas School of Business professor Christine Parlour. Thirteen advisers from academia and industry—among them University of Southern California materials‑science professor Zhenglu Li—help shape the benchmark.
Yiyou Sun, a postdoc in Song’s group who runs the ALE initiative, says the test is meant to track AI’s impact on economically relevant tasks. “These tasks are actual jobs that experts have worked on,” he explains.
Li points out that the benchmark remains unbiased because contributors aren’t tied to any AI company. She also notes that the low pass rates expose a shortage of interdisciplinary expertise in current AI training. “Companies can fine‑tune models to perform well on model‑specific benchmarks, but they might not perform as well on some general tasks,” she says.
A Stanford Ph.D. student and test collaborator, Benjamin Liu, raised a red flag about how AI failures manifest. In an email, he warned that agents often produce answers that look plausible but are subtly wrong. “In science a confident wrong answer is more dangerous than no answer, because someone might build on it,” Liu cautioned.
The benchmark results arrived at a tense moment for the two leading AI firms. Earlier in June, OpenAI and Anthropic filed for initial public offerings. Last Friday, Anthropic received a federal warning that forced it to shut down access to its latest models.
Kunyang (Oliver) Sun, a postdoc studying computational chemistry at Berkeley, sees the ALE as a useful incentive. “Having a benchmark where all the frontier leading models are sitting at 20% is a good incentive for these models to continue becoming better,” he says.
The ALE leaderboard is publicly available and is designed to measure performance on long‑horizon, economically valuable tasks. The authors hope the benchmark will spur the development of AI agents that can reliably perform real professional workflows.
At present, the highest pass rate among the tested models is 24%, indicating that even the most advanced commercial AI systems still struggle to consistently handle complex, real‑world tasks. The findings underscore the need for continued research into AI evaluation and the creation of models that can support a wide range of professional activities with reliability.