Den­ver was ground zero for Centurylink’s re­cent chaotic net­work out­age

The Denver Post - - FRONT PAGE - By Aldo Svaldi

ATMS failed in Idaho, Wy­oming de­layed lottery re­sults, and 911 call cen­ters in Wash­ing­ton, Ari­zona, Mis­souri and other states strug­gled with busy sig­nals, dropped calls and miss­ing lo­ca­tion in­for­ma­tion.

At the North­ern Colorado Med­i­cal Cen­ter in Greeley, staff couldn’t ac­cess vi­tal pa­tient records on­line. And in parts of New Mex­ico and Mon­tana, Ver­i­zon faced ser­vice dis­rup­tions through no fault of its own.

Press re­ports have linked a long list of trou­bles to net­work prob­lems suf­fered by telecom­mu­ni­ca­tions com­pany Centurylink, based in Mon­roe, La., two days af­ter Christ­mas.

For about 30 hours, from the early morn­ing hours of Dec. 27 un­til late on Dec. 28, chaos reigned on Centurylink’s sys­tem. West­ern states that de­pend most heav­ily on the com­pany’s fiber-op­tic sys­tem were hard­est hit, but re­ports of out­ages and slower speeds came in from Alaska to Florida, ac­cord­ing to down­de­tec­

“Centurylink ex­pe­ri­enced a net­work event on one of our six trans­port net­works be­gin­ning on De­cem­ber 27 that im­pacted voice, IP, and trans­port ser­vices for some of our cus­tomers. The event also im­pacted Centurylink’s vis­i­bil­ity into our net­work man­age­ment sys­tem, im­pair­ing our abil­ity to trou­bleshoot and pro­long­ing the du­ra­tion of the out­age,” the com­pany said in a state­ment.

Tech­ni­cians were left scram­bling try­ing to pin­point the root cause, and that re­sulted in them los­ing time on fixes that didn’t work. New Or­leans as ground zero was an early sus­pect, and then it was San An­to­nio. Teams, which had to make phys­i­cal site vis­its, went into ac­tion in Kansas City, Mo., and then At­lanta, and so on.

But as they tried fixes in dif­fer­ent ar­eas, the prob­lem didn’t go away. Mak­ing mat­ters worse, the re­port­ing sys- tem that gath­ered cus­tomer com­plaints also failed.

The source of all that tur­moil and hours of angst for af­fected cus­tomers came down to one piece of equip­ment — a faulty third-party net­work man­age­ment card in Den­ver, ac­cord­ing to the com­pany.

But how could one bad piece of equip­ment in Den­ver dis­rupt in­ter­net and phone ser­vice in large swaths of the coun­try and im­pair crit­i­cal ser­vices to thou­sands of cus­tomers for hours on end? And could it hap­pen again?

Those are two ques­tions the Fed­eral Com­mu­ni­ca­tions Com­mis­sion, which has launched an in­ves­ti­ga­tion, wants an­swered, not to men­tion state util­ity reg­u­la­tors, com­puter sci­en­tists and irate cus­tomers.

A sor­cerer’s ap­pren­tice

In the clas­sic Dis­ney film “Fan­ta­sia,” Mickey Mouse casts a spell on a broom to get it to carry the water buck­ets that he, as the ap­pren­tice, is us­ing to fill a cis­tern for the sor­cerer, who has just left the room.

Mickey then falls asleep and things go hor­ri­bly wrong. The broom car­ries way too much water. Wak­ing and re­al­iz­ing his predica­ment, Mickey tries to smash the broom to pieces. But the splin­ters turn into dozens of new brooms, car­ry­ing hun­dreds of buck­ets of water. The cham­ber gets flooded.

Com­puter sci­en­tists bor­rowed the term “Sor­cerer’s Ap­pren­tice Syn­drome” to de­scribe what hap­pens when a part of a net­work sends out “pack­ets” of bad in­for­ma­tion that then get repli­cated and sent out over and over, said Craig Par­tridge, chair of the com­puter sci­ence de­part­ment at Colorado State Univer­sity in Fort Collins and a mem­ber of the In­ter­net Hall of Fame.

Even­tu­ally, the sys­tem gets bogged down and can crash un­til the source of the prob­lem is iden­ti­fied and the bad pack­ets, which can keep ric­o­chet­ing around, are cleared out of the sys­tem.

“The packet has a mis­take. It thinks it is sup­posed to make lots of copies and send it any­where. It then over­loads the whole net­work,” said Par­tridge.

Par­tridge said he doesn’t have any spe­cific knowl­edge of this out­age, but based on pub­lic re­ports, Centurylink ap­pears to have suf­fered from what is a well-known prob­lem that has plagued dig­i­tal net­works since their ear­li­est days.

Centurylink said the card was prop­a­gat­ing “in­valid frame pack­ets” that were sent out over its sec­ondary net­work, which con­trolled the flow of data traf­fic.

Here is a de­scrip­tion of the Sor­cerer’s Ap­pren­tice Syn­drome at work, in the more tech­ni­cal terms pro­vided by the com­pany:

“Once on the sec­ondary com­mu­ni­ca­tion chan­nel, the in­valid frame pack­ets mul­ti­plied, form­ing loops and repli­cat­ing high vol­umes of traf­fic across the net­work, which con­gested con­troller card CPUS (cen­tral pro­cess­ing unit) net­work-wide, caus­ing func- tion­al­ity is­sues and ren­der­ing many nodes un­reach­able,” the com­pany said in a state­ment.

Once the syn­drome gets go­ing, it can be dif­fi­cult to trace back to its orig­i­nal source and to stop, a big rea­son net­works are de­signed to iso­late fail­ures early and con­tain them.

“We have learned through ex­pe­ri­ence about th­ese dif­fer­ent types of fail­ure modes. We build our sys­tems to try and lo­cal­ize those fail­ures,” Par­tridge said. “I would hope that what is go­ing on is that Centurylink is try­ing to un­der­stand why a rel­a­tively well-known fail­ure mode has bit them.”

To re­solve the prob­lem, Centurylink said it re­moved the net­work card at fault, dis­abled the chan­nels that al­lowed for in­valid traf­fic to get repli­cated across its net­work, and put in fil­ters to catch the bad data.

It set up a more in­tense mon­i­tor­ing plan to spot prob­lems faster and to ter­mi­nate rogue pack­ets be­fore they can prop­a­gate. That took care of the bulk of prob­lems, but a small group of cus­tomers had is­sues that were fixed caseby-case into a third day.

“Centurylink teams worked around the clock un­til the is­sue was re­solved,” said spokes­woman Linda John­son. Centurylink, which pur­chased Qwest Com­mu­ni­ca­tions and Level 3 Com­mu­ni­ca­tions, is an im­por­tant em­ployer in metro Den­ver.

A ques­tion of trust

When an air­plane crashes, fed­eral in­ves­ti­ga­tors will look for the black box and painstak­ingly re­assem­ble ev­ery piece they can find to de­ter­mine pre­cisely what went wrong. If it was a me­chan­i­cal is­sue, an or­der will go out on an in­spec­tion, fix or re­place­ment. If it was a pi­lot er­ror, new train­ing rules are put in place.

The na­tion’s vi­tal com­mu­ni­ca­tion net­works, how­ever, are much less reg­u­lated than the air­ways and power grid. Even if sim­i­lar pro­to­cols were in place af­ter a fail­ure, prob­lems in the flow of light pack­ets and voice sig­nals are much more ephemeral and tougher to pin down.

“It is so un­likely they can re­pro­duce the sit­u­a­tion,” said Dirk Grun­wald, a pro­fes­sor of com­puter sci­ence at the Univer­sity of Colorado Boul­der, who has wit­nessed sce­nar­ios where prob­lem­atic com­po­nents get plugged back in and work fine.

All hell might have bro­ken loose be­cause one bit of in­for­ma­tion in a packet came in se­quence with an­other spe­cific bit while the card was op­er­at­ing at a cer­tain speed. A few mil­lisec­onds later or at a slightly dif­fer­ent speed and the wicked spell may not have been cast, Grun­wald said.

A more per­ti­nent line of in­ves­ti­ga­tion would be why the card didn’t sig­nal it was hav­ing prob­lems and take it­self out of the game like it was sup­posed to? And the card was en­cap­su­lat­ing the faulty data, which al­lowed it to keep mov­ing across the net­work, an is­sue the out­side ven­dor is try­ing to un­der­stand, ac­cord­ing to Centurylink.

Be­yond that, why didn’t other net­work safe­guards keep the prob­lem from get­ting out of hand.

Dan Massey, a com­puter sci­ence pro­fes­sor at the Univer­sity of Color­dado Boul­der, said net­works op­er­ate from an im­plicit as­sump­tion of trust as they com­mu­ni­cate — “Be con­ser­va­tive in what you send and lib­eral in what you ac­cept.”

Com­po­nents as­sume the in­for­ma­tion they are re­ceiv­ing is com­ing from good play­ers, not rogue or de­fec­tive ones.

Most of the time, pick up a phone or go on­line and the process is smooth and seam­less. What isn’t read­ily known is that tech­ni­cians are con­stantly chas­ing prob­lems and re­plac­ing parts and the sys­tem is mak­ing ad­just­ments. It might even hap­pen in the mid­dle of a call, with­out a blip.

What net­works strug­gle with is when a com­po­nent goes bad but pre­tends to be nor­mal, a fail­ure known as a Byzan­tine Fault. If that fault hap­pens in the “con­trol plane” — the sys­tem that man­ages the flow of data and the prob­lem de­tec­tion sys­tems — then things can spi­ral down quickly, Massey said.

Imag­ine cars on the road as bun­dles of in­for­ma­tion mov­ing to where they need to go. If too many cars are in mo­tion, then traf­fic will crawl to a halt. There might even be an ac­ci­dent. But com­mu­ni­ca­tions net­works are de­signed with a lot of spare ca­pac­ity and an abil­ity to clear ac­ci­dents quickly and reroute traf­fic when jams ap­pear.

That’s if the con­trol plane is work­ing. Now imag­ine if the traf­fic lights start act­ing er­rat­i­cally, like turn­ing all the lights at an in­ter­sec­tion red, or even worse, all of them green. That is a sim­pli­fied way of de­scrib­ing the chaos Centurylink tech­ni­cians were deal­ing with.

But it didn’t take ev­ery­thing down. One of six trans­port sys­tem in Centurylink’s net­work had prob­lems, ac­cord­ing to the com­pany. That is why cus­tomers in Greeley and some moun­tain towns re­ported is­sues, while many cus­tomers in Den­ver and other ar­eas didn’t no­tice any­thing amiss.

Don’t fail with 911

It is one thing if peo­ple can’t play For­nite or binge The Marvelous Mrs. Maisel be­cause of slow speeds. It is an en­tirely dif­fer­ent prob­lem when 911 calls are dis­rupted, a rea­son Centurylink is now fac­ing an in­ves­ti­ga­tion from the FCC.

John­son said that 911 calls were “largely com­pleted” but in some cases, the lo­ca­tion in­for­ma­tion didn’t tag along. But press re­ports say some call­ers to 911 cen­ters faced busy sig­nals and dropped calls. Util­ity reg­u­la­tors in Wy­oming and Wash­ing­ton state have said they will launch in­quiries.

“The Colorado PUC has not opened its own in­ves­ti­ga­tion. How­ever, the FCC has asked the states to help it gather in­for­ma­tion re­gard­ing the ex­tent and im­pact of the out­ages, and PUC staff is as­sist­ing with the FCC’S in­ves­ti­ga­tion,” said Terry Bote, a spokesman for the state’s util­ity reg­u­la­tor.

Massey, who worked on cy­ber­se­cu­rity is­sues at the De­part­ment of Home­land Se­cu­rity be­fore join­ing CU, said most states have in­vested very lit­tle in cy­ber­se­cu­rity and other safe­guards when it comes to 911 cen­ters. They are not as fail­proof as they need to be.

The tran­si­tion from ana­log to dig­i­tal has left the na­tion’s 911 call cen­ters much more ca­pa­ble, al­low­ing them to bet­ter han­dle calls from mo­bile phones and even sig­nals from au­to­mo­biles in­volved in a crash. But it has also left those cen­ters much less ro­bust, as the prob­lems on Dec. 27 showed.

Par­tridge said a deeper ex­am­i­na­tion may show Centurylink was do­ing ev­ery­thing right and it was hit by an en­tirely new and un­ex­pected kind of fail­ure. If so, the com­pany, its ven­dors, and the com­puter sci­ence com­mu­nity will work on fixes.

But if an old-style Sor­cerer’s Ap­pren­tice Syn­drome was at fault, then blam­ing an out­side party won’t fly.

“The net­work should not be so frag­ile that when you in­stall third­party equip­ment and it fails, your net­work fails. Your net­work needs to be ro­bust. That is stan­dard op­er­at­ing pro­ce­dure,” he said.

Illustration by Jeff Neu­mann, The Den­ver Post; photo by Think­stock for Getty Images

Newspapers in English

Newspapers from USA

© PressReader. All rights reserved.