Major Outage
Incident Report for Wakam Services
Postmortem

What happened?

Between 03:15 UTC and 10:30 UTC on 16 Nov 2023, many partners experienced Transport Layer Security (TLS) exceptions when using our APIs such as “The request was aborted: Could not create SSL/TLS secure channel”.

What went wrong?

Our cloud provider Microsoft Azure faced a global issue with one of the service we use in our infrastructure : API Management. This issue impacted the following regions : North Europe, West Europe and West US. They determined that a recent deployment on the service introduced a configuration bug that caused failures when trying to create client connections. This configuration prevented TLS interactions from succeeding, returning exceptions for a subset of users.

How did we respond?

The issue was reported at 7:32 UTC because our internal probes were not impacted by the issue. After the different checks on our infrastructure were done, we contacted Microsoft’s support at 8:37 UTC and they confirmed the issue at 10:13 UTC. They rolled back the recent deployment of API Management service to mitigate the issue. At 10:30 UTC the service was up again.

What happens next?

After this outage, two actions have been decided :

  • Put in place new monitoring in addition to the current external probes
  • Improve the operational procedure to mitigate the impact in this kind of context

NB: the official issue summary published by Microsoft Azure is available at https://app.azure.com/h/6LSL-JCG/b34b8a.

Posted Dec 13, 2023 - 11:01 CET

Resolved
We confirm the service is now back to normal.
A detailed resolution statement will be provided in the following days
Posted Nov 16, 2023 - 12:30 CET
Monitoring
It is confirmed the outage was global and affecting all Microsoft Azure Cloud Provider clients including Wakam.
Microsoft Azure Cloud Provider has proceeded to a rollback of our API Management Service to mitigate the issue.
Service is progressively being back to normal and we are still actively monitoring it.
More detail on the issue will be provided later on.
Posted Nov 16, 2023 - 12:11 CET
Identified
We have identified a potential root cause for his issue on the API Manager service and our cloud provider Microsoft Azure is rolling back a recent deployment to mitigate it.
An update will be provided in 30min or as events warrant.
Posted Nov 16, 2023 - 11:18 CET
Update
We are developing alternative migration strategies to provide a workaround.
We will provide update further.
Posted Nov 16, 2023 - 10:39 CET
Update
We have identified an issue with a key equipment enabling the access to our api pricing platform that prevent part of our partners to generate quotes and to access our developer portal.
We are investigating along with our service provider Microsoft Azure.
Posted Nov 16, 2023 - 10:01 CET
Update
We have received reports from some of our partners being unable to access our API. We are investigating further and we will provide update via this channel.
Next update to come by 10:00.
Posted Nov 16, 2023 - 09:36 CET
Investigating
We have received reports from some of our partners being unable to access our API. We are investigating further and we will provide update via this channel.
Next update to come by 10:00.
Posted Nov 16, 2023 - 09:35 CET
This incident affected: Pricing API and Developer Portal.