New Year's Day Network Outages

Written By : Dustin Beason

Hello, and Happy New Years, friends!

As you know, on January 1st, another customer in the Data Center where Rails Machine hosts all of our equipment was hit by a massive, sophisticated, and sustained DDoS attack. This resulted in numerous outages for the entire datacenter, which included all Rails Machine Customers. Below is the RCA we received today for that event. We recognize that events like this create great troubles for you, our customers, and your own customers, and we take this very seriously. We are also well aware of the increase in frequency of these issues. Which is why each time they have occurred, we have reached out to our provider for more information, and to push for an expedient solution. Each time, we have been told the issue was being addressed.

I spoke with Zayo’s lead network engineer throughout the New Year’s Day event, and this attack was unlike any previous attacks, excepting that the target customer was the same. As a result of this targeted attack, it was decided that the best way to restore service for the rest of us was to completely remove that customer from the shared network, which they did. Since that time, traffic has returned to normal.

I reached out to Zayo during the event to setup a meeting and found that they have indeed taken steps to address the problem, and had already purchased a multitude of carrier class equipment to increase bandwidth, harden the routing infrastrcture, and help more quickly identify and mitigate these issues in the future. I’ve been informed that the equipment had already been delivered, racked, and cabled, but was scheduled to be rolled into production next week (after the Holidays). So while they have done the right things, those actions have taken time to research, purchase, and implement. And during that time the network remained vulnerable. While the threat of DDoS can never be completely eliminated, I do feel like the steps they have taken are moving in the right direction, and I believe the wide spread effects can be narrowed with this approach.

Again, our sincerest apologies for the inconvenience, and our deepest appreciation for your patience and understanding. If you have any questions or concerns, please reach out to me directly and I’ll be happy to do whatever I can to help.

Best,

Dustin


Begin RCA —

Reason For Outage (RFO) Report
Ticket Number: TTN-0000892619
Event Start: January 1, 2016, 10:17 AM EST
Event End: January 1, 2016, 6:35 PM EST
Outage Duration: 8:20 minutes from beginning of latency till restoration.
Total Event Duration: 8:20 minutes for normal service restoration.
Services Impacted: Customers on the zColo network in the 1100 South White Street Datacenter.

Outage Summary:
At 10:15 AM EST we began to monitor excessive utilization on one of the backbone connections providing the Internet access for the 1100 South White Street Datacenter. Network engineers were on the devices and began looking for sources of the excess traffic. It was determined that a large inbound traffic spike in excess of 10Gbs was saturating this Internet connection. The attacked IP was identified and the attack mitigated. At 11:15 AM EST traffic returned to normal. At 12:35 PM another attack began against the IP addresses of our Internet connections to each provider. Due to the attack being against the specific router addresses providers had to be involved to mitigate this traffic. At 3:40 PM the datacenter begins operating on the single Zayo connection as they have implemented filters to prevent the issue. From 4:30 PM to 6:35 PM providers are gradually restored and any attacks mitigated until the DC network was restored to normal operation.

Outage Sequence:
At 10:17 AM EST January 1st 2016 – Latency increased and some traffic began to see loss due to climbing utilization of one of the datacenter backbone connections.

At 11:15 AM – Successful null routing of the destination IP returned traffic to normal.

At 12:35 PM – New attacks began against the multiple IPs of our upstream Internet connections.

At 12:55 PM – The customer involved in the majority of the attacks was identified.

At 3:40 PM – After working with each provider to get mitigation in place we bring up the Zayo Internet connection.

At 6:35 PM – After mitigating and bringing up the remaining Internet connections all traffic resumed normal state.

Conclusion / Root Cause:
A directed and specific DDoS attack was carried out against one of the Datacenter’s customers. Simple null route mitigation proved insufficient once the attacker brought down the BGP sessions running on the Internet connections. After working with the carriers we were able to bring each DC uplink on-line over time. As all attacks were mitigated and links online the network resumed normal operation.

Future Mitigation:
zColo is in the process of upgrading all core routing gear to carrier class equipment in order to reduce the impact of events such as these. While we cannot eliminate the threat of DDoS totally we can harden the network to prevent it from saturating connections and minimize the impact to other customers in events like these. Not only will this harden the routing infrastructure itself but the modern systems have much more visibility into traffic allowing quicker and more accurate analysis of attack vectors and traffic types. This will allow zColo to not only identify the attack quicker but mitigate it as well. Finally combined with this new infrastructure we are also implementing full out-of-band access to prevent catastrophic events from impeding network management.