Subscribe For More!

Get the latest creative news from us about politics, business, sport and travel

 
Subscription Form
Edit Template

Artificial Intelligence Systems Are Passing Increasingly Complex Exams

Can static tests still capture intelligence when AI systems rapidly adapt and match/exceed human baselines on some benchmarks?
Artificial Intelligence Systems Are Passing Increasingly Complex Exams

Artificial intelligence systems are achieving record-breaking scores on benchmarks designed to evaluate AI capabilities. Yet a growing body of research suggests these results may be misleading. Models that appear to reason, explain, and even outperform humans on benchmark evaluations often fail when deployed in unfamiliar or real world conditions. The problem, according to a new survey published in IEEE Access, is not the intelligence of the machines but the exams themselves.

In “The Artificial Intelligence Cognitive Examination: A Survey on the Evolution of Multimodal Evaluation From Recognition to Reasoning, first author Mayank Ravishankara, an independent researcher based in San Francisco, argues that AI evaluation has entered a critical phase. The paper reframes the history of AI benchmarks as a series of increasingly complex cognitive examinations, each designed to expose the weaknesses of the previous generation. As Ravishankara and co-author Varindra V Persad Maharaj explain, AI systems have become adept at passing static tests without demonstrating genuine understanding.

The article analyses how AI evaluation has evolved from simple recognition tasks to sophisticated multimodal reasoning benchmarks. It also raises a pressing question for researchers, clinicians, and policymakers alike. If today’s most advanced AI systems can rapidly saturate even the hardest reasoning tests, how should intelligence be measured in the future?

The rise of recognition and the illusion of understanding

The modern AI revolution began with a deceptively simple question. Can a machine recognise what it sees? Early benchmarks such as ImageNet and PASCAL VOC were designed to answer this by testing whether models could correctly identify objects in images.

In 2012, deep neural networks reduced error rates on ImageNet by unprecedented margins, triggering widespread optimism about artificial intelligence. These recognition benchmarks provided a common yardstick for progress in computer vision. They were statistically reliable, easy to score, and computationally efficient. For several years, rising accuracy numbers were treated as direct evidence that machines were learning to see the world more like humans.

However, cracks soon appeared. Researchers discovered that models could achieve high accuracy by exploiting superficial cues rather than learning meaningful visual concepts. Texture bias, background correlations, and dataset specific artefacts allowed systems to perform well on tests while failing under even minor distribution shifts. A cow on a beach or a medical scan from a different hospital could cause catastrophic errors.

From what to why and how

As these limitations became apparent, the focus of AI evaluation began to change. Rather than asking what a model can recognise, researchers started asking why it reached a particular conclusion and how it combined different sources of information. This shift mirrors developments in human education, where rote memorisation has gradually given way to assessments of reasoning and comprehension.

Benchmarks such as Visual Question Answering and GQA marked this transition. Instead of labelling images, models were required to answer natural language questions that demanded spatial reasoning, counting, and relational understanding. These tasks aimed to test whether AI systems could integrate perception with language and logic.

Yet even these reasoning benchmarks proved vulnerable. Large models learned to exploit statistical patterns in questions rather than grounding answers in visual evidence. When answer distributions were deliberately altered, performance collapsed. High scores once again masked fragile reasoning strategies, reinforcing the idea that passing a test does not guarantee understanding.

The problem of shortcut learning in AI systems

One of the central themes of Ravishankara’s survey is shortcut learning. This phenomenon occurs when models rely on spurious correlations that are sufficient to solve a benchmark but irrelevant to the underlying task. Shortcut learning is not unique to artificial intelligence, but its consequences are amplified by scale and automation.

In multimodal AI, shortcuts often arise from language priors. If most questions that begin with a particular phrase have the same answer, models can succeed without analysing the image at all. Diagnostic benchmarks such as VQA under Changing Priors were explicitly designed to expose this weakness by reversing these statistical patterns between training and testing.

State of the art systems that performed well on standard benchmarks suffered dramatic drops when shortcuts were removed. The implication is clear. Without careful evaluation design, AI systems may appear intelligent while lacking robust reasoning capabilities.

Why reasoning benchmarks also saturate

The survey highlights a recurring pattern in AI evaluation. Each new benchmark initially appears challenging, only to be rapidly saturated by the next generation of models. Benchmarks such as BIG Bench Hard and later extensions were once considered gold standards for reasoning. Today, leading multimodal large language models achieve near perfect scores on many of these tests.

This saturation is not merely a sign of progress. It reveals a fundamental limitation of static evaluation. When a benchmark becomes a target, models are optimised to perform well on its specific format. Prompt engineering, decoder strategies, and training data overlap can inflate scores without reflecting true generalisation.

The authors invoke Goodhart’s Law to describe this dynamic. When a measure becomes a target, it ceases to be a good measure. In the context of artificial intelligence, accuracy metrics that once signalled genuine advances may now reward test-specific optimisation rather than cognitive ability.

As multimodal AI moves into real products, the biggest risk is mistaking benchmark scores for real-world reliability. Evaluation needs to stress-test models for robustness, compositional reasoning, and failure modes – not just reward high performance on static datasets.

—Mayank Gowda (Lead Author)

Multimodal intelligence and expert-level evaluation

Modern AI systems are no longer limited to single modalities. Multimodal large language models process images, text, audio, and video simultaneously. Evaluating such systems requires benchmarks that can assess cross-modal integration, reasoning chains, and explanatory fidelity.

Expert-level benchmarks such as MMBench, SEED Bench, and MMMU attempt to meet this challenge. They draw on tasks from mathematics, science, and professional exams, pushing models beyond simple perception. These evaluations reflect the growing ambition of AI research to build systems capable of complex reasoning across domains.

However, the survey argues that even these benchmarks rely too heavily on outcome-based metrics. Accuracy alone cannot capture whether a model reached the right answer for the right reasons. Process-based evaluation, including analysis of chain of thought reasoning, is increasingly necessary to distinguish genuine understanding from sophisticated pattern matching.

The limits of static tests in a dynamic field

A central conclusion of the paper is that static benchmarks are ill-suited to a rapidly evolving AI landscape. As training data grows and models become more powerful, fixed test sets are increasingly vulnerable to contamination and memorisation. Post hoc audits can detect overlap, but they cannot fully restore the diagnostic value of a saturated benchmark.

To address this, the authors point to emerging approaches such as dynamic evaluation and human-in-the-loop testing. Platforms that continuously generate adversarial examples can stay ahead of model capabilities. Time-sensitive question answering and embodied environments introduce elements that cannot be memorised in advance.

These approaches reflect a broader shift in evaluation philosophy. Intelligence is no longer treated as a static trait that can be measured once and for all. Instead, evaluation becomes an ongoing adversarial process, where benchmarks evolve in response to model behaviour.

Reference

Ravishankara, M., & Maharaj, V. V. P. (2025). The Artificial Intelligence Cognitive Examination: A Survey on the Evolution of Multimodal Evaluation From Recognition to Reasoning. IEEE Access14, 2690-2725. https://doi.org/10.1109/ACCESS.2025.3649182

Mayank Gowda

Mayank Ravishankara, the lead author of this article, reviewed it for technical accuracy.

Mayank Ravishankara is a software engineer and AI researcher focused on making multimodal AI systems more reliable and trustworthy. His work spans evaluation of vision-language models, reasoning and robustness testing, and practical frameworks for measuring real-world performance beyond benchmark scores. He has published research on multimodal evaluation and contributes to the research community through peer review and mentoring. He is also building privacy-first productivity and wellness tools that apply AI responsibly in everyday settings.

Key Insights

AI models often ace benchmarks but fail in unpredictable real-world settings.
Static AI tests risk rewarding pattern exploitation over true reasoning.
Shortcut learning lets AI achieve high scores without genuine understanding.
Modern multimodal AI demands evaluation beyond simple accuracy metrics.
Dynamic, adversarial testing may better measure real AI intelligence.

Related Articles

Subscription Form

© 2025 all rights received by thesciencematters.org