Committee Secretary Senate Standing Committees on Environment and Communications Parliament House Canberra Brisbane 16 November 2023
Submission Optus Network Outage – General Analysis of the Optus Network Outage Incident
I have been closely monitoring the developments of the Optus network outage since the early hours of Wednesday, 8 November. In response to this incident, I have participated in over a dozen TV and radio interviews throughout the following week, providing updates on the situation.
My insights have been reinforced by the valuable input from a global network of colleagues. While occurrences of such incidents are infrequent, they happen with enough regularity to discern patterns and identify underlying causes. The Optus event followed a familiar trajectory, and as more information surfaced, it became increasingly evident that there were fundamental issues at the core of the incident.
The information presented below aims to encourage a thorough inquiry into the root cause of the event, prompting questions about why such disruptions can occur and seeking clarification on the measures Optus has taken or is currently implementing to prevent similar incidents in the future.
In addition to directing these queries to Optus, it is essential to extend the conversation to the broader industry. Exploring the possibility of leveraging alternative networks in emergency situations could be a viable solution to mitigate such problems. We see that the industry is already working together in the case of bushfires and other natural disasters, using roaming between mobile networks, there may be opportunities for collective industry solutions.
Given the national impact of the Optus event, resulting in significant economic and public repercussions, it is prudent to scrutinise internal structures within the company. This includes a review of their disaster preparedness plan and an evaluation of the effectiveness of their Board structure.
As we reflect on the aftermath of this network outage, it prompts a broader discussion about industry resilience, emergency response strategies, and the need for continuous improvement to safeguard against future disruptions.
The following assessment has been produced by Senior Network Architect Owen deLong
Assessment by Owen deLong
Most of what I know is third-hand information from people running networks outside of Optus.
Some of them have talked directly to colleagues at Optus, most of them have not.
Probably about 80% of what I have said is best described as somewhere between inference and a very well-educated guess based on the accumulation of that “data”.
My confidence in the accuracy of my description of what happened is about 85% based solely on Optus public statements and about 95+% when I add in the comments from other engineers.
This statement is 100% speculation on my part: Optus is playing an interesting game here trying to balance blame-shifting the problem to their parent company, appearing to be transparent about what happened, and being very cagey trying to tell the smallest part of the story they think they can get away with.
With those caveats in place, this is my assessment:
From the outside looking in, based on discussions of technical colleagues on several mailing lists as well as the public statements from Optus, here is what I believe happened:
- One of their peers (parent company Singtel) made a configuration change to their BGP announcements which resulted in the announcement of many additional prefixes to Optus.
- It appears that this increase in prefixes was within the limits on the eBGP peering sessions between Optus and said peer, so the additional prefixes were accepted into the Optus routers that are connected to Singtel (likely a high percentage of Optus border routers).
- These routers then attempted to share those routes internally to other Optus BGP speaking routers. Apparently, Optus had maximum prefix limits on these internal peering sessions as well. Apparently, the number of added prefixes from this peer exceeded the headroom in those limits on the Optus routers.
- Once a maximum prefix limit is reached on a peering session, the receiving peer terminates the session and will not allow it to be re-established without manual intervention (an operator or management system must log into the router and reset the session).
All of that is almost certainly fact. However, if that were the end of the story, changing the prefix limit and resetting the sessions would be quick, easy, and relatively painless. Unfortunately, because these were triggered on the internal BGP and not the external BGP sessions, there are several likely side effects that are less obvious. ALL of these are pure speculation on my part, but I have high confidence that at least some of the occurred or the outage would have been resolved much faster…
- Since the Optus routers can no longer receive routes to substantial portions of your network, they may lose the ability to do things like:
- Authenticate users (TACACS, RADIUS, etc.).
- Be reached directly by network automation systems tasked with rectifying the situation.
- Provide appropriate alerts to monitoring systems about why the peering sessions dropped.
- If administrators and automation tools can’t log into the boxes in question to reset the sessions, life gets a lot more complicated (This almost certainly occurred based on the Optus public comment that resolution in some cases required dispatching technicians to remote router locations).
- Because the network is essentially unable to forward traffic and thus it has also taken down the cellular and telecom capabilities of the company, the ability for management and technical staff to communicate with each other is likely severely impacted. Imagine this scenario: Your cell phone and your internet and your land line all stopped working at the same time. You need to wake up a hundred technicians all over Australia to get them working on this problem. How do you proceed? This is another area where redundancy needs to be planned and provided for.
Senior Network Architect