Bangkok Post

Artificial intelligen­ce has a measuremen­t problem

- Kevin Roose ©2024 THE NEW YORK TIMES Kevin Roose is a technology columnist for The New York Times.

There’s a problem with leading artificial intelligen­ce tools such as ChatGPT, Gemini and Claude: We don’t really know how smart they are. That’s because, unlike companies that make cars or drugs or baby formula, AI companies aren’t required to submit their products for testing before releasing them to the public. There’s no Good Housekeepi­ng seal for AI chatbots, and few independen­t groups are putting these tools through their paces in a rigorous way.

Instead, we’re left to rely on the claims of AI companies, which often use vague, fuzzy phrases like “improved capabiliti­es” to describe how their models differ from one version to the next. And while there are some standard tests given to AI models to assess how good they are at, say, math or logical reasoning, many experts have doubts about how reliable those tests really are.

This might sound like a petty gripe. But I’ve become convinced that a lack of good measuremen­t and evaluation for AI systems is a major problem.

For starters, without reliable informatio­n about AI products, how are people supposed to know what to do with them?

I can’t count the number of times I’ve been asked in the past year, by a friend or a colleague, which AI tool they should use for a certain task. Does ChatGPT or Gemini write better Python code? Is DALL-E 3 or Midjourney better at generating realistic images of people?

I usually just shrug in response. Even as someone who writes about AI for a living and tests new tools constantly, I’ve found it maddeningl­y hard to keep track of the relative strengths and weaknesses of various AI products. Most tech companies don’t publish user manuals or detailed release notes for their AI products. And the models are updated so frequently that a chatbot that struggles with a task one day might mysterious­ly excel at it the next.

Shoddy measuremen­t also creates a safety risk. Without better tests for AI models, it’s hard to know which capabiliti­es are improving faster than expected, or which products might pose real threats of harm.

In this year’s AI Index — a big annual report put out by Stanford University’s Institute for Human-Centered Artificial Intelligen­ce — the authors describe poor measuremen­t as one of the biggest challenges facing AI researcher­s.

“The lack of standardis­ed evaluation makes it extremely challengin­g to systematic­ally compare the limitation­s and risks of various AI models,” the report’s editor-in-chief, Nestor Maslej, told me.

One of the most common tests given to AI models today — the SAT for chatbots, essentiall­y — is a test known as Massive Multitask Language Understand­ing, or MMLU.

The MMLU, which was released in 2020, consists of a collection of roughly 16,000 multiple-choice questions covering dozens of academic subjects, ranging from abstract algebra to law and medicine. It’s supposed to be a kind of general intelligen­ce test — the more of these questions a chatbot answers correctly, the smarter it is.

It has become the gold standard for AI companies competing for dominance.

(When Google released its most advanced AI model, Gemini Ultra, earlier this year, it boasted that it had scored 90% on the MMLU

— the highest score ever recorded.)

Dan Hendrycks, an AI safety researcher who helped develop the MMLU while in graduate school at the University of California, Berkeley, told me that the test was never supposed to be used for bragging rights. He was alarmed by how quickly AI systems were improving, and wanted to encourage researcher­s to take it more seriously.

Mr Hendrycks said that while he thought MMLU “probably has another year or two of shelf life,” it will soon need to be replaced by different, harder tests. AI systems are getting too smart for the tests we have now, and it’s getting more difficult to design new ones.

(The New York Times has sued OpenAI, the maker of ChatGPT, and its partner, Microsoft, on claims of copyright infringeme­nt involving AI systems that generate text.)

There may also be problems with the tests themselves. Several researcher­s I spoke to warned that the process for administer­ing benchmark tests such as MMLU varies slightly from company to company, and that various models’ scores might not be directly comparable.

There is a problem known as “data contaminat­ion”, when the questions and answers for benchmark tests are included in an AI model’s training data, essentiall­y allowing it to cheat. And there is no independen­t testing or auditing process for these models, meaning that AI companies are essentiall­y grading their own homework.

In short, AI measuremen­t is a mess — a tangle of sloppy tests, apples-to-oranges comparison­s and self-serving hype that has left users, regulators and AI developers themselves grasping in the dark.

The solution here is likely a combinatio­n of public and private efforts.

Government­s can, and should, come up with robust testing programs that measure both the raw capabiliti­es and the safety risks of AI models, and they should fund grants and research projects aimed at coming up with new, high-quality evaluation­s.

(In its executive order on AI last year, the White House directed several federal agencies, including the National Institute of Standards and Technology, to create and oversee new ways of evaluating AI systems.)

Some progress is also emerging out of academia. Last year, Stanford researcher­s introduced a new test for AI image models that uses human evaluators, rather than automated tests, to determine how capable a model is. And a group of researcher­s from the University of California, Berkeley, recently started Chatbot Arena, a popular leaderboar­d that pits anonymous, randomised AI models against one another and asks users to vote on the best model.

AI companies can also help by committing to work with third-party evaluators and auditors to test their models, by making new models more widely available to researcher­s and by being more transparen­t when their models are updated. And in the media, I hope some kind of Wirecutter-style publicatio­n will eventually emerge to take on the task of reviewing new AI products in a rigorous and trustworth­y way.

Researcher­s at AI company Anthropic wrote in a blog post last year that “effective AI governance depends on our ability to meaningful­ly evaluate AI systems.”

I agree. Artificial intelligen­ce is too important a technology to be evaluated on the basis of vibes.

Until we get better ways of measuring these tools, we won’t know how to use them, or whether their progress should be celebrated or feared.

 ?? AFP ?? The logo of ChatGPT, a language model-based chatbot developed by OpenAI, on a smartphone in Mulhouse, eastern France.
AFP The logo of ChatGPT, a language model-based chatbot developed by OpenAI, on a smartphone in Mulhouse, eastern France.
 ?? ??

Newspapers in English

Newspapers from Thailand