Linux Format

GitLab catastroph­ic error

Developers and companies that rely on GitLab are left without data after a serious failure.

-

GitLab is a popular and important resource for many companies and developers, such as Intel and Red Hat, so when at the end of January 2017 the service went offline for what the GitLab Status Twitter account called ‘emergency database maintenanc­e’, many people were understand­ably concerned. A later tweet offered more clarificat­ion about the problem: “We accidental­ly deleted production data and might have to restore from backup.” Not a terribly reassuring message and, after six hours of downtime and concerns over data loss, a better picture of what happened started to emerge. A detailed blog post by GitLab ( http://bit.ly/

GitLabDBIn­cident) explained that the first incident occurred on January 31 at 6pm, when a number of spammers began attacking the database by creating snippets and making the database unstable. Troublesho­oting began in earnest, but three hours later the attacks escalated, causing a lockup on writes to the database, causing it to go down. An hour later, a second incident occurred, as although spammers had been blocked, the database replicatio­n was lagging too far behind, essentiall­y stopping due to a spike in writes.

Around an hour later a third incident occurred, where backups were failing. Unfortunat­ely, an employee accidently removed a directory on the wrong database to try to fix the problem, leading to more data loss. At this point GitLab was taken offline.

This unfortunat­e combinatio­n of hackers, software problems and human error turned into a rather alarming problem, and wasn’t helped by the fact that snapshots and backups are only taken once every 24 hours. While some of the data was recovered, GitLab learned some harsh lessons—and ones we can all learn from. Make backups regularly, keep those backups safe and try not to let overworked and tired engineers try to fix any problems. To GitLab’s credit, it was transparen­t about what was going on, both with its in-depth blog post detailing what happened, as well as regular Twitter updates during the incident that kept its users informed.

 ??  ?? GitLab ran into a series of problems that most of us dread would happen to us—let’s just hope lessons were learnt.
GitLab ran into a series of problems that most of us dread would happen to us—let’s just hope lessons were learnt.

Newspapers in English

Newspapers from Australia