Blog

Hybrid TLS rollout model: lessons from a 12-month plan

Running ML-KEM alongside ECDH in TLS 1.3 across distributed infrastructure is straightforward in theory. This composite rollout model highlights practical planning issues around certificate chain sizes, middlebox interference, and HSM firmware constraints.

The following is a composite rollout model for hybrid TLS — X25519 + ML-KEM-768 — across a regulated financial-services estate. It uses representative scale assumptions: hundreds of load balancer endpoints, thousands of internal service-to-service connections, and a mix of on-premises HSMs and cloud-based key management services.

Month 1–3: inventory and baseline

The first three months were not about TLS at all. They were about discovering what we were actually running. The organisation had a nominal CMDB, but its cryptographic coverage was approximately 30% accurate. We found RSA-1024 certificates in production — years after the organisation’s policy had mandated RSA-2048 as a minimum. We found TLS 1.0 still enabled on a legacy payment gateway that had been earmarked for decommission in 2019.

The tooling for cryptographic inventory has improved significantly. Automated scanning with tools that perform TLS handshake analysis across all IP ranges gives you a reasonable picture of external-facing services within a few days. Internal service-to-service is harder — it requires either network traffic analysis or systematic interrogation of service mesh configurations. Neither approach is perfect; both are necessary.

Month 4–7: the middlebox problem

In pilot environments, hybrid TLS can surface a failure mode that teams often miss during lab testing: deep packet inspection appliances may terminate connections when they see the enlarged ClientHello. The hybrid key share adds approximately 1,200 bytes to the TLS ClientHello — taking it from a single TCP segment to two or three. Some DPI appliances can drop fragmented ClientHellos entirely, treating the split as malformed traffic.

The fix involved three changes: first, firmware updates to affected appliances (available for one vendor; a workaround was required for the other); second, enabling TLS connection coalescing at the load balancer tier to reduce the number of connections triggering the DPI path; third, a fallback negotiation path that allows the server to downgrade to classical-only TLS for clients that fail the hybrid handshake after two retries.

The middlebox problem is under-documented in the public literature. Anyone planning a large-scale hybrid TLS rollout should budget significant time for network path analysis and appliance compatibility testing before enabling hybrid negotiation at scale.

Month 8–10: HSM constraints

The on-premises HSMs — hardware security modules that store private keys and perform cryptographic operations — were a significant constraint. The installed firmware on three of the four HSM models did not support ML-KEM key generation or storage. Two vendors released firmware updates during the deployment window; one did not, and those units required offloading ML-KEM operations to a software implementation with the HSM used only for classical key storage.

This is a real risk for organisations with HSM-heavy infrastructure. The FIPS 140-3 validation path for ML-KEM implementations is still maturing, and HSM support needs to be confirmed for the exact module, firmware, and boundary in scope. Plan for software fallback paths and account for re-validation timelines when scoping your migration programme.

Month 11–12: stabilisation and metrics

After resolving middlebox and HSM issues in staging, teams should expect the deployment to stabilise before broad rollout. Representative pilot metrics are reassuring: hybrid TLS handshake latency is typically a few milliseconds higher than classical TLS 1.3 at the median, with low single-digit millisecond overhead at the 99th percentile on modern server hardware.

Certificate chain sizes did increase: the move to ML-DSA-87 signatures for intermediate certificates added 4,627 bytes per certificate. With a chain of three certificates, that adds about 13.9 KB to the initial connection before certificate metadata and encoding overhead. For latency-sensitive applications, this is worth profiling — it is not a problem in most contexts, but it is a measurable difference.

A twelve-month programme should aim to deliver hybrid TLS across the highest-risk external endpoints first, then internal service-to-service connections in waves. Legacy systems that cannot absorb larger handshakes should remain on a documented remediation track. The durable outcome is not only hybrid TLS coverage, but a cryptographic inventory that is genuinely accurate and a migration framework that engineering, risk, and compliance teams can maintain.