The countdown is on to Keyfactor Tech Days     | secure your spot today!

  • Home
  • Blog
  • How I Lost Control of My PKI

How I Lost Control of My PKI

My alarm signals like an acoustic guitar. I really need to change that sound – it’s starting to get on my nerves. A quick email scan before I begin my morning routine. Justin was finally able to get the domain joined machines in the correct group policy – nice. Next email, a support ticket stating access to the vacation request system was denied. Probably just another user error, I’d get to it later on.

On a blissful drive into the office, I notice the budding birch trees and wonder when my allergies will start to peak, I guess about two weeks. Arriving at my desk, I receive a few nods from peers. Engage in small talk with a coworker who is considering a new car. One who is excited about a weekend get away with his soon to be fiancé. All status quo morning activities. Coffee refill. System boot. BitLocker entry. Certificate protected PIN entry to start up all my workstations. Like a rhythmic cadence. I begin check off my daily list of monitoring tasks.

  1. Check upcoming certificate expirations
  2. Check the CA logs
  3. Check the status of known CRLs
  4. Review and assign support tickets

Check, Check, Check, Check

A few minor support tickets continue to flow through the system. One stating something is awry with our secondary phone system – odd, but not too alarming. The system has always been flaky, along with our manual monitoring processes, SHA-1 PKI, and outrageous certificate count. Terry, our old “PKI guy” (deemed the PKI master among his many other responsibilities) had been with the company for 22 years and just six weeks ago retired, leaving me with:

  • a three page manual
  • copies of our CP/CPS from 2006
  • some chicken scratch architecture notes
  • and a couple of hours of “knowledge transfer” on his way out of the door

Just like that I became the “new PKI guy.”

The first day after Terry left, my first suggestion was to get a handle on our certificates, CRLs and CAs and the overall management of their lifecycle. Gather inventory of both internally issued and externally trusted certificates we’d purchased, followed by a better way to enroll, renew and revoke – but that did not go over well with upper management.

“Why fix something that isn’t broken? Keep up to date with your spreadsheets and notes”… “Terry did this for years with no issues – not sure why you can’t do the same.” In my four years with this company, I’d learned to keep my head down and not be too ambitious when it came to suggesting new automation platforms or methodologies for the following reasons:

  • no budget
  • too many affected departments
  • pain of change from current “system”

Don’t rock the boat.

Deep breathing, I snap back to reality and realize I cannot boil the ocean. A sense of peace rushes over me. I might actually have time to review this year’s top security predictions/threats (yes I know it’s June). I know I *should* be more forward thinking, but I can’t help but feel overwhelmed with our ticking time bomb, aka outdated and undermanaged-PKI, and manual tasks bogging me down on a daily basis. I wonder (just for an instant) how long it had been ticking before I inherited it. Maybe ignorance really is bliss.

A few hours slip away. I start to think about lunch, Jimmy John’s delivery again? Sure why not. Another email scan. I realize that the remediation chain for the support tickets I issued this morning were coming back as failures. My email to the person responsible for our vacation request system came back with no user error or server issues. Same with the secondary phone system contact. Strange. I go to ping Justin for further escalation, when I notice my internet access is down. Just the Wi-Fi? Nope the entire network. I look up to my desk phone and see the error message on it as well. Slow at first, then all at once, they begin to hit me like a ton of bricks. Access denial alerts. Nobody can access the network. Impending doom.

My cell phone rings, not once, but twice and then a third time in rapid succession. Email starts to flow in from our external alerting system monitoring the domain access. My heart starts to race. I begin to hope I missed a memo for a planned outage from our internet provider – but the situation was too much like “The Perfect Storm” for that to be the case.

I start to compile a list of access denials (as it sadly continued to grow):

  • company wide network
  • company wide Wi-Fi
  • main phone system
  • secondary phone system
  • client portal
  • internal file server
  • vacation request system

This was a tried and true fire drill. I frantically alert our senior team and begin to start digging. I scroll through the logs and re-run through my “checks” – did I miss something earlier?

A light bulb flicks on as I flash back to a Gartner white paper I read months ago. What about an expired certificate? Ha, which one? We have thousands. Focus. I begin to check expiration dates for the certificates serving the affected systems. Valid dates – maybe an unavailable or out of date certificate revocation list (CRL)? Geeze I have no idea where those lists are kept. This never happened when Terry was here. Why me?

The realization of why this never happened before was simple, which made it all the more frustrating. Here’s the kicker, Terry hosted the CRL for numerous applications on a server that was decommissioned. Turns out the IT ops team saw a disparate server, owned by Terry (gone), clearly not needed anymore and made the decision to pull the plug. One web server that hosted the CRL to which everything was pointed went down without a fight – no warnings. How delightful – all of this havoc caused by one random CRL. Which raises an existential question, if a CRL is not accessible or certificate expires and no one is around to see it, does it cause a problem? Absolutely.

I host the CRL elsewhere and launch into disaster recovery mode. Ensure access is restored across the board. Damage control with the internal teams. Personally ensure our CEO’s access is restored. Address company-wide loss of productivity, etc., but the backlash was just beginning. Our client success and sales teams angrily forage into my office, “How do I explain this to our clients? How can we ensure this doesn’t happen again?” I cower, giving an apologetic yet generic response, wondering in my head how I *really* was going to ensure this wouldn’t happen again. This time it was a CRL issue, but I had no idea where the other CRLs were hosted or where the thousands of certificates we’d issued lived or what they were even for. A pounding headache joined my twisted stomach.

That night, I started to evaluate just how out of hand my PKI was from the start and how much worse it had become over the last six weeks. I stood up a third CA to support secure Wi-Fi authentication. We were issuing a massive number of certificates for every device connecting to Wi-Fi deeming my old way of manually tracking certificate and CRL data in a spreadsheet fully invalid. In addition, we had disparate people from completely separate departments, in entirely different buildings issuing certificates unbeknownst to me. I was appalled at my irresponsibility. How could I let such a powerful authentication attribute go completely unmanaged? I had lost complete control of my PKI in the blink of an eye. While blissfully supporting fantastic new use cases, making our CISO happy with added layers of security through the certificates we’d issued internally from our own PKI, I unknowingly orchestrated a shot directly down at my own foot by letting these certificates and CRLs grow to be unwieldly.

That day will go down in infamy for me and my now former company. Please learn from my mistakes – consider and act on the implementation of a certificate lifecycle and PKI management platform before it’s too late – I sure wish I had.

Sincerely,

Ex-PKI Guy from ABC Corp.