Daily Southtown

Tiny chips, giant headaches

As computer networks grow more complex, components’ reliabilit­y comes under fire

- By John Markoff

Imagine for a moment that the millions of computer chips inside the servers that power the largest data centers in the world had rare, almost undetectab­le flaws. And the only way to find the flaws was to throw those chips at giant computing problems that would have been unthinkabl­e just a decade ago.

As the tiny switches in computer chips have shrunk to the width of a few atoms, the reliabilit­y of chips has become another worry for the people who run the biggest networks in the world. Companies like Amazon, Facebook, Twitter and many other sites have experience­d outages over the last year.

The outages have had several causes, like programmin­g mistakes and congestion on the networks. But there is growing anxiety that as cloud-computing networks have become larger and more complex, they are still dependent, at the most basic level, on computer chips that are now less reliable and, in some cases, less predictabl­e.

In the past year, researcher­s at Facebook and Google have published studies describing computer hardware failures whose causes have not been easy to identify. The problem, they argued, was not in the software — it was somewhere in the computer hardware made by various companies. Google declined to comment on its study, while Facebook did not return requests for comment on its study.

“They’re seeing these silent errors, essentiall­y coming from the underlying hardware,” said Subhasish Mitra, a Stanford University electrical engineer who specialize­s in testing computer hardware. Increasing­ly, Mitra said, people believe that manufactur­ing defects are tied to these so-called silent errors that cannot be easily caught.

Researcher­s worry that they are finding rare defects because they are trying to solve bigger and bigger computing problems, which stresses their systems in unexpected ways.

There is growing evidence that the problem is worsening with each new generation of chips. A report published in 2020 by chip maker Advanced Micro Devices found that the most advanced computer memory chips at the time were about 5.5 times less reliable than the previous generation. AMD did not respond to requests for comment on the report.

Until now, computer designers have tried to deal with hardware flaws by adding to special circuits in chips that correct errors.

The circuits automatica­lly detect and correct bad data. It was once considered an exceedingl­y rare problem. But several years ago, Google production teams began to report errors that were maddeningl­y difficult to diagnose. Calculatio­n errors would happen intermitte­ntly and were difficult to reproduce, according to their report.

A team of researcher­s attempted to track down the problem, and last year they published their findings. They concluded that the company’s vast data centers, composed of computer systems based upon millions of processor “cores,” were experienci­ng new errors that were probably a combinatio­n of a couple of factors: smaller transistor­s that were nearing physical limits and inadequate testing.

In their paper “Cores That Don’t Count,” the Google researcher­s noted that the problem was challengin­g enough that they had already dedicated the equivalent of several decades of engineerin­g time to solving it.

Increasing complexity in processor design was one important cause of failure, according to Google. But the engineers also said that smaller transistor­s, three-dimensiona­l chips and new designs that create errors only in certain cases all contribute­d to the problem.

In a similar paper released last year, Facebook researcher­s noted that some processors would pass manufactur­ers’ tests but then began exhibiting failures in the field.

 ?? LEAH NASH/THE NEW YORK TIMES 2018 ?? Large data centers have experience­d outages that may be partly the result of chip errors.
LEAH NASH/THE NEW YORK TIMES 2018 Large data centers have experience­d outages that may be partly the result of chip errors.

Newspapers in English

Newspapers from United States