Written by 9:16 am AI, Discussions

### Unveiling the Significance of AI Metrics: A Comprehensive Analysis

The most commonly used AI benchmarks haven’t been adapted or updated to reflect how models ar…

On Tuesday, the startup Anthropic introduced a new series of generative AI models, claiming they have achieved top-notch performance. Shortly after, Inflection AI, a rival company, revealed its own model, suggesting that it comes close to rivaling some of the most advanced models available, such as OpenAI’s GPT-4.

Both Anthropic and Inflection are among many AI companies that have asserted their models surpass the competition in some objective aspect. Google made a similar case with its Gemini models, while OpenAI did so with GPT-4 and its predecessors, including GPT-3, GPT-2, and GPT-1.

However, what exactly do these claims of state-of-the-art performance or quality entail? And more importantly, will a technically superior model actually translate to a noticeably enhanced user experience?

The answer to the latter question is likely no.

The issue lies in the benchmarks utilized by AI companies to measure a model’s strengths and weaknesses.

Current benchmarks for AI models, particularly those powering chatbots like OpenAI’s ChatGPT and Anthropic’s Claude, often fail to capture how the average individual interacts with these models. For instance, one benchmark mentioned by Anthropic, GPQA (“A Graduate-Level Google-Proof Q&A Benchmark”), consists of numerous PhD-level questions in fields like biology, physics, and chemistry. Yet, most users engage with chatbots for tasks such as email responses, cover letter writing, and personal conversations.

Jesse Dodge, a scientist at the Allen Institute for AI, highlighted that the industry is facing an “evaluation crisis.”

He explained, “Benchmarks tend to be static and narrowly focused on assessing a single capability, like a model’s accuracy in a specific domain or its aptitude for solving multiple-choice questions on mathematical reasoning.” Dodge emphasized that many benchmarks are outdated, originating from a time when AI systems were primarily used for research purposes and had limited real-world applications. Moreover, as generative AI models are increasingly positioned for mass-market use, these old benchmarks become less relevant.

David Widder, a postdoctoral researcher at Cornell specializing in AI and ethics, pointed out that many common benchmarks test skills that are not applicable to the majority of users, such as solving elementary math problems or identifying anachronisms in sentences.

Widder stated, “As systems are increasingly seen as ‘general purpose,’ this is less possible, so we increasingly see a focus on testing models on a variety of benchmarks across different fields.”

Apart from misalignment with user scenarios, there are doubts about whether some benchmarks accurately measure what they claim to assess.

For instance, an analysis of HellaSwag, a test for evaluating commonsense reasoning in models, revealed that over a third of the questions contained errors and nonsensical content. Similarly, MMLU (Massive Multitask Language Understanding), a benchmark used by companies like Google, OpenAI, and Anthropic, was criticized for asking questions that could be answered through memorization rather than genuine understanding.

Widder explained, “[Benchmarks like MMLU] are more about memorization and associating keywords, rather than true comprehension or reasoning abilities.”

The flaws in existing benchmarks raise the question: Can they be rectified?

Dodge believes that a combination of evaluation benchmarks and human assessment could pave the way for improvement.

She suggested, “The right approach is a blend of evaluation benchmarks and human assessment, where a model is presented with a real user query and then rated by a human evaluator based on the quality of the response.”

On the other hand, Widder expressed skepticism about the potential for significant enhancement of current benchmarks, even with fixes for evident issues like typos. He proposed that evaluations of models should focus on their real-world impacts and whether these impacts are perceived as beneficial by those affected.

He concluded, “We need to assess specific contextual goals for AI models and evaluate their success in those contexts. Moreover, we should evaluate whether AI should be used in such contexts at all.”

Visited 2 times, 1 visit(s) today
Tags: , Last modified: March 8, 2024
Close Search Window
Close