GitLab catastrophic error
Developers and companies that rely on GitLab are left without data after a serious failure.
GitLab is a popular and important resource for many companies and developers, such as Intel and Red Hat, so when at the end of January 2017 the service went offline for what the GitLab Status Twitter account called ‘emergency database maintenance’, many people were understandably concerned. A later tweet offered more clarification about the problem: “We accidentally deleted production data and might have to restore from backup.” Not a terribly reassuring message and, after six hours of downtime and concerns over data loss, a better picture of what happened started to emerge. A detailed blog post by GitLab ( http://bit.ly/
GitLabDBIncident) explained that the first incident occurred on January 31 at 6pm, when a number of spammers began attacking the database by creating snippets and making the database unstable. Troubleshooting began in earnest, but three hours later the attacks escalated, causing a lockup on writes to the database, causing it to go down. An hour later, a second incident occurred, as although spammers had been blocked, the database replication was lagging too far behind, essentially stopping due to a spike in writes.
Around an hour later a third incident occurred, where backups were failing. Unfortunately, an employee accidently removed a directory on the wrong database to try to fix the problem, leading to more data loss. At this point GitLab was taken offline.
This unfortunate combination of hackers, software problems and human error turned into a rather alarming problem, and wasn’t helped by the fact that snapshots and backups are only taken once every 24 hours. While some of the data was recovered, GitLab learned some harsh lessons—and ones we can all learn from. Make backups regularly, keep those backups safe and try not to let overworked and tired engineers try to fix any problems. To GitLab’s credit, it was transparent about what was going on, both with its in-depth blog post detailing what happened, as well as regular Twitter updates during the incident that kept its users informed.