Modernize Your PKI → Optimize Productivity → Reduce Risks    |Here’s how to replace Microsoft PKI with EJBCA

The Enemy of Uptime: An Expired SSL Certificate

Outages

Every day thousands of SSL certificates expire. In most cases, these seemingly insignificant moments go unnoticed. Our websites stay secure, our servers keep humming, and it’s business as usual.

Then it happens. Another SSL certificate expires. Only this time, someone forgot to renew it. What happens next? In this blog, we’ll unpack why certificates expire, what happens when they do, and how to respond effectively.

2021 Ponemon Report State of Machine Identity Management

SSL certificates are an important factor in securing websites, applications, and machine-to-machine connections but, because of their widespread use, they are also incredibly difficult to track.

Why keep track of certificates? It comes down to one event – certificate expiration.

Every SSL certificate has a set expiry date. They used to last up to five years, then it was two. Now publicly issued certificates are only valid for 398 days. Unlike your Netflix subscription that auto-renews as you get your stream on, SSL certificates don’t renew themselves. That responsibility typically falls on the PKI or security team.

lifespanssl

If an organization is using just a handful of certificates, they may track and renew them manually. The reality, though, is that most companies have thousands upon thousands of SSL certificates. Keeping track of where these certificates reside, who owns them, and when they expire becomes extremely difficult to do manually at scale.

So why all the trouble? Why must certificates expire? And what happens when they do?

The importance (and risk) of certificate expiration

Certificate expiration is a good thing. For the same reason we renew our passport, personal ID, and passwords, certificates have an expiration date and must be renewed after a set validity period to ensure they are accurate, up-to-date, and trusted.

Certificates with shorter lifespans (90 days or less) are safer and more favorable in today’s fast-changing environments. However, these short-lived certificates also increase the burden on teams responsible for issuing and renewing them.

Outages happen when certificate expiration goes unnoticed, and website owners forget to renew the certificate in time – despite multiple emails and warning messages. The severity and impact of these certificate outages can range from a single user unable to access Wi-Fi to a global network or service disruption that impacts millions of customers.

To put things in perspective, Keyfactor recently kicked off a report on the State of Machine Identity Management. In the report, we asked respondents how often they experience certificate-related outages and what impact they have on their organization. Here is what we found:

  • 88% of companies continue to experience unplanned outages due to expired certificates
  • On average, they experienced more than 3 certificate outages within the past two years
  • 40% of respondents say their organization has a high likelihood of experiencing more outages
  • 59% of respondents say they are worried about the increased risk of outages due to shorter SSL/TLS lifespans

 

Outages1

One expired certificate, one Epic outage

Nearly every business struggles with certificate outages, but very few of them disclose when they happen or why. That is until Epic Games – maker of fan favorites like Fortnite, Rocket League, and Houseparty – experienced a massive outage due to (you guessed it) an expired SSL certificate.

On April 6, 2021, we had a wildcard TLS certificate unexpectedly expire. It is embarrassing when a certificate expires, but we felt it was important to share our story here in hopes that others can also take our learnings and improve their systems. If you or your organization are using certificate monitoring, this may be a good reminder to check for gaps in those systems.

What happens when an SSL certificate expires

Instead of downplaying the incident, Epic Games turned a bad situation into a lesson learned by sharing a play-by-play of what happened and how it could have been prevented.

Here is a quick overview of the incident timeline:

timeline

  • Expiration: At 12:00 PM UTC, an internal certificate expired. The expired wildcard certificate was installed across hundreds of backend services, resulting in widespread outages across Fortnite, Rocket League, Houseparty, Epic Online Services, and Epic Games Store.
  • Response: Once their incident management process kicked off, it took their teams only 12 minutes to discover the expired certificate as the source of the issue, and begin the renewal process.
  • Remediation: At 12:37 PM UTC, the updated certificate was reissued and rolled out across their backend services over the next 15 minutes. At this point, they had 25 people directly involved in fixing the issue and many more fighting fires in Player Support, Community, Engineering, and Production services.
  • Fallout: Their teams were able to renew the certificate and recover most services within the hour. However, the initial outage exposed a series of other issues in their IT infrastructure, causing further disruption to the Epic Games Launcher and Epic Games Store.

 

All in all, it took Epic Games nearly five and a half hours to fully recover. Unfortunately, this isn’t a unique incident. Certificate expirations have been the cause behind numerous recent high-profile and lengthy outages, such as Microsoft Teams, Azure AD, or Google Voice.

The root causes of SSL certificate outages

Every outage is different, but the underlying problems that lead up to them are consistently the same: limited visibility and lack of automation. This incident is no different…

Limited visibility

The DNS zones for this internal service-to-service communication were not actively monitored by our certificate monitoring services, an oversight by us.

Certificate discovery is one of, if not the most, important part of certificate management. After all, you can’t renew a certificate you don’t know about. That said, 53% of companies still don’t know exactly how many keys and certificates (including self-signed) they have.

Lack of process and automation

Automated renewals were not enabled for this internal certificate, and work needed to accomplish this had not been prioritized when identified earlier this year.

To renew a certificate, certificate owners typically need to generate a new CSR, have it certified with a CA, install it, verify that it’s active, and then return to live operation. If you’re handling these processes manually, it’s nearly impossible to respond effectively when a certificate expires unexpectedly.

Wildcard certificates

The service-to-service wildcard certificate used was installed across hundreds of different production services, and because of this, the impact was very broad.

Wildcard SSL certificates are convenient, but from a security perspective, they create serious security challenges. If the private key is compromised or if the certificate is allowed to expire, the blast radius is multiplied by the number of servers it is installed on.

The need for certificate lifecycle automation

Unfortunately, it’s all too easy for IT and security leaders to view certificate outages as an unexpected event, rather than a symptom of a much bigger underlying issue – ad hoc and manual certificate management processes.

It’s not uncommon for organizations to rely on a mix of spreadsheets, CA interfaces, and homegrown tools to track and manage their certificates. In fact, only about one-third (36%) of companies use a dedicated certificate lifecycle management (CLM) solution. That means most organizations are still stuck in manual and siloed processes that fail to deliver the visibility they need across their IT environment.

manualmanagement-2

An effective certificate management strategy should actively monitor every certificate, enable automated renewals and deployment to workloads and endpoints, and limit the use of unknown, self-signed, and wildcard certificates that increase the risk and impact of an outage.

Find out why certificate lifecycle automation is now a must-have for digital business. Get insights from more than 1100 IT and security professionals in the first-ever State of Machine Identity Management and share it with your team.