How Rust got back online after its servers literally caught fire
When Rust producer Alistair McFarlane went to bed well past midnight, it seemed like a brief outage of Rust's European servers was annoying and abnormal, but not severe. The Facepunch team in the US would be able to periodically check on the problem and bring the servers back up as soon as they could. "These problems never last more than a few hours," McFarlane told PC Gamer in an email.
Rust is one of the biggest PvP survival games, a mainstay of the genre and of Steam's most-played games list for years now. Each of the 24 servers that had gone down held nearly 10,000 people during peak hours, so more than a few players were inconvenienced.
By the time McFarlane was awake at 7 am, his phone was flooded with messages about what had occurred overnight: A fire at an OVH datacenter had affected their servers. It couldn't be too bad, he figured. "Data centres these days are built to high standards with advanced fire suppression systems," said McFarlane.
When he got to his PC, however, he discovered that it was far worse than he'd imagined. A massive fire had gutted the Strasbourg data center hosting Rust's EU servers.
It was a fire so large that a pump boat was deployed to ensure enough water could flow to the local firefighters. They'd been fighting it through the night and were still pumping water onto the site the next morning. Nobody was injured, but potentially millions of websites were offline.
"I was utterly shocked when I saw the pictures on Twitter of the building fully engulfed on fire," said McFarlane. "It was clear we needed a plan to get the affected servers back online and communicate to our players."
Simply identifying which servers were affected wasn't easy. "I'd guess everyone was trying to log in at once to find any information," he said, which meant that the server company's web panel nearly unresponsive.
"We confirmed a total loss of our servers and immediately started sourcing new servers," said McFarlane. That was less simple than they'd have liked. "We were not the only ones doing this, stock of OVH servers in other locations began selling out," he continued. Complicating things, Rust's servers see large DDoS attacks regularly, and as a result they're particular about the hardware they use. On top of the sudden scarcity, they couldn't take just any servers.
"We had to settle for OVH Poland," he said. That's some 1,000 kilometers from Strasbourg, the server's original location. It wasn't ideal for Rust's players, or for McFarlane's team. "We'll need to move the servers again in the coming weeks, but it's all we could source with the quantity we required quickly," he said.
Within 11 hours of the first Rust servers going down, the Facepunch team had gotten all of them back up and running. That was just four hours after they became fully aware of the fire. "We're always prepared to have servers go down but never en masse," said McFarlane.
As for the data on the servers, Facepunch had both local and on-site backups, as well as offsite backups, but player progression data wasn't backed up offsite. "Lessons learnt," said McFarlane. "We need to ensure player progression data is backed up offsite, and we'll make sure this happens going forward."
"I feel we did everything else right," he continued. "We did the best we could with the time we had, running on the little sleep we had."
Some players were upset that they lost days of progress in a game that runs in real time whether you're logged in or not, but McFarlane said that, in general, players understood that it was an exceptional situation.
McFarlane estimated that around 30 dedicated Rust servers were lost during the fire, though all of those are now back online. It's unknown how the fire started yet, or whether OVH was at fault. Regardless, McFarlane says OVH has done well for Facepunch in the past and that it'll continue using OVH in the future.
đ„ Incendie sur le site d'OVH Cloud Ă #Strasbourg : une centaine de #Pompiers67 đ a Ă©tĂ© mobilisĂ©e cette nuit dont le bateau-pompe franco-allemand Europa1. Les actions menĂ©es ont permis de prĂ©server la majeure partie des bĂątiments de l'entreprise. COD activĂ©. Pas de blessĂ©. pic.twitter.com/RKjI6F9DB7March 10, 2021
Post a Comment