Outages

The Enemy of Uptime: An Expired SSL Certificate

Every day thousands of SSL certificates expire. In most cases, these seemingly insignificant moments go unnoticed. Our websites stay secure, our servers keep humming, and it’s business as usual.

Then it happens. Another SSL certificate expires. Only this time, someone forgot to renew it. What happens next? In this blog, we’ll unpack why certificates expire, what happens when they do, and how to respond effectively.

2021 Ponemon Report State of Machine Identity Management

SSL certificates are an important factor in securing websites, applications, and machine-to-machine connections but, because of their widespread use, they are also incredibly difficult to track.

Why keep track of certificates? It comes down to one event – certificate expiration.

Every SSL certificate has a set expiry date. They used to last up to five years, then it was two. Now publicly issued certificates are only valid for 398 days. Unlike your Netflix subscription that auto-renews as you get your stream on, SSL certificates don’t renew themselves. That responsibility typically falls on the PKI or security team.

lifespanssl

If an organization is using just a handful of certificates, they may track and renew them manually. The reality, though, is that most companies have thousands upon thousands of SSL certificates. Keeping track of where these certificates reside, who owns them, and when they expire becomes extremely difficult to do manually at scale.

So why all the trouble? Why must certificates expire? And what happens when they do?

The importance (and risk) of certificate expiration

Certificate expiration is a good thing. For the same reason we renew our passport, personal ID, and passwords, certificates have an expiration date and must be renewed after a set validity period to ensure they are accurate, up-to-date, and trusted.

Certificates with shorter lifespans (90 days or less) are safer and more favorable in today’s fast-changing environments. However, these short-lived certificates also increase the burden on teams responsible for issuing and renewing them.

Outages happen when certificate expiration goes unnoticed, and website owners forget to renew the certificate in time – despite multiple emails and warning messages. The severity and impact of these certificate outages can range from a single user unable to access Wi-Fi to a global network or service disruption that impacts millions of customers.

To put things in perspective, Keyfactor recently kicked off a report on the State of Machine Identity Management. In the report, we asked respondents how often they experience certificate-related outages and what impact they have on their organization. Here is what we found:

  • 88% of companies continue to experience unplanned outages due to expired certificates
  • On average, they experienced more than 3 certificate outages within the past two years
  • 40% of respondents say their organization has a high likelihood of experiencing more outages
  • 59% of respondents say they are worried about the increased risk of outages due to shorter SSL/TLS lifespans

Outages1

One expired certificate, one Epic outage

Despite the importance of having updated firmware to ensure the safe and reliable operation of your fleet of IoT connected devices, firmware is a commonly unprotected attack surface that adversaries exploit to intrude networks, compromise data, or even take over control of the devices to cause harm or disruption. Bottom line, unsecure firmware equates to an unsecure IoT device.

Cybercriminals are eager to exploit weaknesses and gaps in IoT security, not to always attack the devices themselves, but to launch other malicious actions, such as DDoS attacks, malware distribution, or data breach and compromise.

OWASP Top 10 IoT highlights that the lack of a secure firmware update mechanism is one of the top vulnerabilities affecting IoT security:

“Lack of ability to securely update the device. This includes lack of firmware validation on device, lack of secure delivery (un-encrypted in transit), lack of anti-rollback mechanisms, and lack of notifications of security changes due to updates.”

To mitigate this vulnerability, it is essential that businesses ensure that the IoT firmware can be updated via OTA updates reliably, securely, and regularly. However, there certain key factors impacting the security of IoT firmware updates, including:

Signing compromise: Unauthorized access to code-signing keys or firmware signing mechanisms can enable attackers to impersonate trust and deliver malicious updates to devices that appear trusted. It’s essential to shore up defenses around your code signing keys and ensure that firmware signatures are verified before allowing the code to execute on the device.

Insecure coding: Buffer overflows may happen because of insecure device programming. Adversaries seek out these coding flaws to cause erratic application behavior or crashes that can lead to a security breach. Buffer overflows can allow criminals to remotely access devices and can be weaponized to create DDoS or malware-injection attacks.

Insecure software supply chain: The development of IoT devices relies heavily on software supply chains and extensive use of open-source components. Lack of procedures to secure the supply chain results in employing insecure open-source components, with embedded vulnerabilities, which attackers are eager to exploit. The latest SolarWinds attack is a fine example of what can go wrong with insecure software supply chains.

Forgotten testing services in production devices: During the development and testing of IoT devices, developers with debugging services and credentials should not migrate to the final production device, because they provide potentially easy access to adversaries.

Nearly every business struggles with certificate outages, but very few of them disclose when they happen or why. That is until Epic Games – maker of fan favorites like Fortnite, Rocket League, and Houseparty – experienced a massive outage due to (you guessed it) an expired SSL certificate.

On April 6, 2021, we had a wildcard TLS certificate unexpectedly expire. It is embarrassing when a certificate expires, but we felt it was important to share our story here in hopes that others can also take our learnings and improve their systems. If you or your organization are using certificate monitoring, this may be a good reminder to check for gaps in those systems.

What happens when an SSL certificate expires

Instead of downplaying the incident, Epic Games turned a bad situation into a lesson learned by sharing a play-by-play of what happened and how it could have been prevented.

Here is a quick overview of the incident timeline:

timeline

  • Expiration: At 12:00 PM UTC, an internal certificate expired. The expired wildcard certificate was installed across hundreds of backend services, resulting in widespread outages across Fortnite, Rocket League, Houseparty, Epic Online Services, and Epic Games Store.
  • Response: Once their incident management process kicked off, it took their teams only 12 minutes to discover the expired certificate as the source of the issue, and begin the renewal process.
  • Remediation: At 12:37 PM UTC, the updated certificate was reissued and rolled out across their backend services over the next 15 minutes. At this point, they had 25 people directly involved in fixing the issue and many more fighting fires in Player Support, Community, Engineering, and Production services.
  • Fallout: Their teams were able to renew the certificate and recover most services within the hour. However, the initial outage exposed a series of other issues in their IT infrastructure, causing further disruption to the Epic Games Launcher and Epic Games Store.

All in all, it took Epic Games nearly five and a half hours to fully recover. Unfortunately, this isn’t a unique incident. Certificate expirations have been the cause behind numerous recent high-profile and lengthy outages, such as Microsoft Teams, Azure AD, or Google Voice.

The root causes of SSL certificate outages

Insecure IoT firmware can have dire consequences to the consumers of these connected devices. Even more problematic is the fact that these types of malicious attacks do not require physical access, they can be employed remotely by an attacker without warning.

Here are just a few examples of how insecure updates resulted in an exposed firmware vulnerability or firmware attack:

Every outage is different, but the underlying problems that lead up to them are consistently the same: limited visibility and lack of automation. This incident is no different…

Limited visibility

The DNS zones for this internal service-to-service communication were not actively monitored by our certificate monitoring services, an oversight by us.

Certificate discovery is one of, if not the most, important part of certificate management. After all, you can’t renew a certificate you don’t know about. That said, 53% of companies still don’t know exactly how many keys and certificates (including self-signed) they have.

Lack of process and automation

Automated renewals were not enabled for this internal certificate, and work needed to accomplish this had not been prioritized when identified earlier this year.

To renew a certificate, certificate owners typically need to generate a new CSR, have it certified with a CA, install it, verify that it’s active, and then return to live operation. If you’re handling these processes manually, it’s nearly impossible to respond effectively when a certificate expires unexpectedly.

Wildcard certificates

The service-to-service wildcard certificate used was installed across hundreds of different production services, and because of this, the impact was very broad.

Wildcard SSL certificates are convenient, but from a security perspective, they create serious security challenges. If the private key is compromised or if the certificate is allowed to expire, the blast radius is multiplied by the number of servers it is installed on.

The need for certificate lifecycle automation

Unfortunately, it’s all too easy for IT and security leaders to view certificate outages as an unexpected event, rather than a symptom of a much bigger underlying issue – ad hoc and manual certificate management processes.

It’s not uncommon for organizations to rely on a mix of spreadsheets, CA interfaces, and homegrown tools to track and manage their certificates. In fact, only about one-third (36%) of companies use a dedicated certificate lifecycle management (CLM) solution. That means most organizations are still stuck in manual and siloed processes that fail to deliver the visibility they need across their IT environment.

manualmanagement-2

An effective certificate management strategy should actively monitor every certificate, enable automated renewals and deployment to workloads and endpoints, and limit the use of unknown, self-signed, and wildcard certificates that increase the risk and impact of an outage.

Find out why certificate lifecycle automation is now a must-have for digital business. Get insights from more than 1100 IT and security professionals in the first-ever State of Machine Identity Management and share it with your team.

Ryan Sanders

Senior Product Marketing Manager