The Mercury News Weekend

BART blames a faulty switch for system shutdown

Backup computer center expected to be fully operationa­l in a couple of months

- By Erin Baldassari and Mark Gomez Staff writers

BART is a couple of months away from activating a new backup computer center that will prevent another network failure like the one last weekend that crippled the entire system for hours, an agency official said Thursday.

During a meeting of the BART board of directors, Tamar Allen, the assistant general manager of operations, said a faulty switch in the computer network was to blame for the systemwide shutdown Saturday.

When the switch failed, there was no backup system in place. Allen told BART directors the transit agency is “very close to completing” a new “redundant center” that would have served as a backup to the computer network that failed Saturday.

“In the event of a failure, the traffic would have been automatica­lly redirected to that center, and we would have been able to keep operating,” Allen said.

The new data center is expected to be built out within a month and fully operationa­l within a couple of months, BART officials said.

BART is also “upgrading its computer hardware and the network infrastruc­ture to take advantage of emerging technology, protocols and standards for data management and cyber security.”

The last time BART experience­d this type of failure resulting in a major disruption to service was in March 2006. In that case, the failure was triggered by human error while updating software, BART officials said.

The switch that failed last weekend is a part of a complex computer network that runs the opera- tions control center — or OCC for short — located beneath the Lake Merritt station, said BART spokeswoma­n Alicia Trost. It’s been called the “brains” of the BART system, but is perhaps more closely analogous with its central nervous system, because it dispatches informatio­n to trains and their operators in the field.

About 2:45 a.m. Saturday, one switch in a complex system failed, Allen said in a statement.

“Essentiall­y, instead of processing and passing on data, the switch kept recirculat­ing data, generating an unmanageab­le data spike,” she said. “In this case the number of data packages requiring processing quickly increased from a norm of about 200 to more than 54,000 per millisecon­d. This overwhelme­d the failed switch and had a cascading impact on other switches in the network, resulting in a loss of communicat­ion between the Operations Control Center and all systems and devices in the field.”

So, no signals were sent between the OCC, the automatic train control system, which tells trains when to start and stop and which route to take, and the traction power, which powers the trains. The trains were supposed to start moving around 6 a.m., but stood still for another three hours while crews worked to fix the glitch. The system was fully operationa­l by around 11 a.m.

Once the system failed Saturday, BART brought in “all of our engineers who work in communicat­ions and computer engi- neering” and several maintenanc­e workers to go into the field and help reset the system, Allen told the board on Thursday.

“It was a large effort, and a lot of people contribute­d to the successful bringing back up of this system,” she said.

In March last year, there were more than 166,000 passengers, on average, each Saturday, according to BART.

Newspapers in English

Newspapers from United States