Albuquerque Journal

A slow neutron beats a flipping fast bit

Bombardmen­t from space may be issue

- BY SUZANNE NOWICKI AND NATHAN DEBARDELEB­EN

Once every minute and for no good reason, a bit flips in a supercompu­ter at Los Alamos National Laboratory, causing an error. All of a sudden, say, 1 + 1 = 3.

Uh-oh.

Bits are the basic currency of all digital informatio­n. They come in two flavors, zeroes and ones. As a computer does its work, bits are called from disk storage, zip through processors and park temporaril­y in memory. When a bit randomly jumps from 0 to 1, it might alter a calculatio­n or hide a piece of informatio­n. Computer engineers call it a singleeven­t upset or a fault.

These upsets are tripping computers of all sizes more frequently, not just at the lab, but in the broader computing world, too. For Los Alamos, with a dozenplus supercompu­ters running jobs vital to national security and other important science missions, single-event upsets are a fact of life because of the density of components.

A fault can play out a few ways. Sometimes nothing happens — the hardware corrects itself. Other times, the program or even the entire system crashes in a detectable event. It’s like a flat tire, wasting lots of time as system administra­tors restore programs and data. In the worst case, an upset goes unnoticed. A scientific calculatio­n might come back with the wrong answer, but nobody knows. That’s rare, but it happens.

Upsets can result from excessive heat, a voltage spike, or — get ready for it — particles from outer space. Those particles are neutrons, and a team of physicists, space scientists and computer engineers at Los Alamos are researchin­g just how much trouble they cause. Are cosmicray neutrons a major culprit or a minor irritant? Understand­ing that will help the team create strategies for best managing the upsets.

Normally, protons and neutrons stick together to form the nucleus at the center of an atom. The trouble starts when high-energy cosmic rays — mostly protons from remote cosmic cataclysms — knock neutrons and other particles loose from atoms in the atmosphere. Every hour, eighty-some cosmic-ray neutrons strike a surface the size of a computer’s central processing unit, or CPU.

Most neutrons miss the nuclei of atoms in a CPU and pass right through. Eventually, though, a neutron hits a nucleus. If it’s a very high-energy, or fast, neutron, it bounces the struck nucleus right out of its home in the silicon chip. An upset occurs, corrupting data. Upsets are more likely to happen in supercompu­ters because they are densely packed with tens of thousands of CPUs — that’s what makes them super.

As the backbone of the nation’s Stockpile Stewardshi­p Program, the largest Los Alamos supercompu­ters hum away night and day in a data center the size of a football field. Their primary job is running the physics simulation­s related to assuring the nuclear stockpile is a safe, secure and effective deterrent in the absence of nuclear testing, but they also run jobs for a wide range of other scientific discipline­s. Engineers track the errors. Often when an upset strikes during the billions of calculatio­ns happening every second, complex engineerin­g in the hardware and software corrals the problem. But that can mean lost time and diminished productivi­ty.

As the Los Alamos team began studying singleeven­t upsets, they saw more faults at the top of the vertically stacked component racks than at the bottom. They wondered, was it because the computers are cooled from the bottom to the top? Or are the top racks exposed to more fast neutrons while also, in effect, shielding the lower racks? Is it both?

The Lab team approached the problem from a few angles. They needed to measure and benchmark error rates caused by fast neutrons, so they bombarded the computer parts in the neutron beam at the Los Alamos Neutron Science Center (LANSCE). Another part of the team purchased neutron detectors the size of a one-liter soda bottle and will soon start using them in the supercompu­ting center to measure the background fast-neutron rates. Others on the team are applying a computer code developed for modeling nuclear physics to study how cosmic rays interact with computers and buildings, which will help understand how the neutrons from outer space interact with the supercompu­ting center and the computers in it.

Using data from the detectors, the team will compare the number of neutrons hitting the computers to the number of faults in the componentr­y. Informatio­n from the LANSCE tests will tell the team how many of those faults were likely caused by fast neutrons. Having a more complete picture of the neutrons and their impact will support developing new ways of detecting the faults, blocking them and cleaning up the computer systems afterwards.

Keeping the Los Alamos supercompu­ters running at peak efficiency directly supports national security. These single-event upsets will be everyone’s problem someday as miniaturiz­ation increases the density of CPUs and memory chips. The upward trending curve of errors gets steeper as computers become more widespread in our phones, tablets, smart-house systems, airplanes, cars, all the controller­s in the internet of things — the list seems endless. Armed with a better understand­ing of this neutron bombardmen­t from space is a strong first step to keeping our digital world humming along. Suzanne Nowicki is a nuclear physicist in the Space Science and Applicatio­ns group at Los Alamos who designs instrument­s for spacecraft. Nathan DeBardeleb­en is a computer engineer in High Performanc­e ComputingD­esign who studies resilience and radiation effects in highperfor­mance computing systems.

 ?? COURTESY OF LOS ALAMOS NATIONAL LABORATORY ?? Nathan DeBardeleb­en and Suzanne Nowicki
COURTESY OF LOS ALAMOS NATIONAL LABORATORY Nathan DeBardeleb­en and Suzanne Nowicki

Newspapers in English

Newspapers from United States