BART blames a faulty switch for system shutdown
Backup computer center expected to be fully operational in a couple of months
BART is a couple of months away from activating a new backup computer center that will prevent another network failure like the one last weekend that crippled the entire system for hours, an agency official said Thursday.
During a meeting of the BART board of directors, Tamar Allen, the assistant general manager of operations, said a faulty switch in the computer network was to blame for the systemwide shutdown Saturday.
When the switch failed, there was no backup system in place. Allen told BART directors the transit agency is “very close to completing” a new “redundant center” that would have served as a backup to the computer network that failed Saturday.
“In the event of a failure, the traffic would have been automatically redirected to that center, and we would have been able to keep operating,” Allen said.
The new data center is expected to be built out within a month and fully operational within a couple of months, BART officials said.
BART is also “upgrading its computer hardware and the network infrastructure to take advantage of emerging technology, protocols and standards for data management and cyber security.”
The last time BART experienced this type of failure resulting in a major disruption to service was in March 2006. In that case, the failure was triggered by human error while updating software, BART officials said.
The switch that failed last weekend is a part of a complex computer network that runs the opera- tions control center — or OCC for short — located beneath the Lake Merritt station, said BART spokeswoman Alicia Trost. It’s been called the “brains” of the BART system, but is perhaps more closely analogous with its central nervous system, because it dispatches information to trains and their operators in the field.
About 2:45 a.m. Saturday, one switch in a complex system failed, Allen said in a statement.
“Essentially, instead of processing and passing on data, the switch kept recirculating data, generating an unmanageable data spike,” she said. “In this case the number of data packages requiring processing quickly increased from a norm of about 200 to more than 54,000 per millisecond. This overwhelmed the failed switch and had a cascading impact on other switches in the network, resulting in a loss of communication between the Operations Control Center and all systems and devices in the field.”
So, no signals were sent between the OCC, the automatic train control system, which tells trains when to start and stop and which route to take, and the traction power, which powers the trains. The trains were supposed to start moving around 6 a.m., but stood still for another three hours while crews worked to fix the glitch. The system was fully operational by around 11 a.m.
Once the system failed Saturday, BART brought in “all of our engineers who work in communications and computer engi- neering” and several maintenance workers to go into the field and help reset the system, Allen told the board on Thursday.
“It was a large effort, and a lot of people contributed to the successful bringing back up of this system,” she said.
In March last year, there were more than 166,000 passengers, on average, each Saturday, according to BART.