Azure Key Vault – Christopher Finlan

The Security Feature That's Actually About Speed

Here’s the thing nobody’s saying about Fabric Eventstream’s new Custom CA and mTLS support: it isn’t really a security feature. Or rather, it is, but the teams who’ll benefit most aren’t security teams. They’re Spark engineers who’ve been running shadow pipelines for months because the “secure” path was also the “impossible” path.

Let me explain.

The Workaround Tax

If you’ve been running Spark workloads against Kafka clusters in any regulated environment (banking, healthcare, telecom), you already know the drill. Your Kafka brokers sit behind certificates signed by an internal Certificate Authority. Your infosec team mandates mutual TLS. And until recently, Fabric Eventstream’s Kafka connectors only trusted the system-predefined CA list. Full stop.

So what did teams actually do? They built workarounds. They stood up intermediate proxy layers that terminated mTLS and re-encrypted with publicly trusted certs. They ran sidecar containers that handled the certificate dance outside of Eventstream. Some teams gave up on Eventstream entirely and wrote custom Spark Structured Streaming jobs that managed their own TrustStores and KeyStores. Jobs that worked, but that nobody wanted to maintain.

Every one of those workarounds carried a cost. Not just in engineering hours, but in latency. Every extra hop between your Kafka broker and your Spark processing layer adds milliseconds. In a world where Spark Structured Streaming microbatches are measured in seconds, those milliseconds compound. A proxy layer that adds 15ms per message at 50,000 messages per second means your pipeline is spending 12.5 minutes per million messages just waiting for the handshake relay. That’s not a rounding error. That’s the difference between a batch window that closes on time and one that doesn’t.

What Actually Changed

The preview announcement covers three Kafka-based Eventstream sources: Apache Kafka, Amazon Managed Streaming for Apache Kafka (MSK), and Confluent Cloud for Apache Kafka, plus Confluent Schema Registry. All of them now support two capabilities that were previously missing:

Custom CA certificates. You can import your internal CA certificate into Azure Key Vault in PEM format and reference it when configuring a Kafka source in Eventstream. The connector runtime fetches the certificate and trusts it for the TLS handshake. No more proxy layers to bridge the trust gap.

Mutual TLS (mTLS). Beyond custom CAs, you can import a client certificate and private key into the same Key Vault. Eventstream presents this client certificate during the TLS handshake, and the Kafka broker validates it. Two-way authentication without a single line of custom code.

The decision to anchor everything on Azure Key Vault solves a problem that’s plagued Spark teams for years: certificate distribution. In a traditional Spark cluster, you’d bake certificates into Docker images, mount them as secrets in Kubernetes, or distribute them through DBFS. Every rotation cycle meant redeploying or restarting jobs. With the Key Vault approach, you update the certificate once. Every Eventstream connector that references it picks up the new version automatically. No redeployment. No restart. No 3 AM pages because someone forgot to rotate the cert on the staging cluster.

The Spark Engineer’s Migration Checklist

If you’re currently running workaround pipelines, here’s the concrete path to cutting them over. Not theoretical. The sequence that will save you the most time with the fewest surprises.

Step 1: Audit your certificate chain. Before you touch Eventstream, document what you have. What CA signed your Kafka broker’s certificate? Is it a single root CA, or is there an intermediate chain? For mTLS, where does your client certificate live today, and who manages the private key? You need this inventory before anything else, because the Key Vault import requires PEM format, and many internal PKI systems export in PKCS#12 or DER.

Step 2: Set up Azure Key Vault with proper RBAC. Create or identify a Key Vault. Import your CA certificate as a certificate object, not a secret. This distinction matters because Eventstream’s certificate fetching logic expects it. If you’re using mTLS, import the client certificate and private key together as a single PEM bundle. Assign the “Key Vault Certificate User” role to the identity that Eventstream uses. For the initial import, use “Key Vault Administrator” and rotate down to least privilege afterward.

Step 3: Handle the private network case. If your Kafka brokers sit inside a VNet or on-premises network, you need Eventstream’s VNet injection. Create an Azure virtual network that can reach your Kafka brokers and also has a private endpoint to your Key Vault. The ordering matters: configure the VNet and private endpoints first, then configure the Eventstream source. If you reverse the order, the connector will fail silently during certificate fetch because it can’t reach the Key Vault through the public endpoint.

Step 4: Configure and test one source. Pick your lowest-risk Kafka topic, something with a steady, predictable message rate, and configure it as an Eventstream source with the Custom CA settings. For mTLS, enable both the trusted CA certificate and the client certificate references. Run it for 24 hours. Watch for two things: authentication errors in the Eventstream monitoring (which mean your certificate chain is incomplete or your Key Vault permissions are wrong) and message latency compared to your workaround pipeline (which should be lower, since you’ve eliminated the proxy hop).

Step 5: Migrate incrementally. Once your test source proves stable, migrate topics one at a time. Keep your workaround pipelines running in parallel until each Eventstream source has been stable for at least 48 hours under production load. When you decommission a workaround, don’t just turn it off. Remove the infrastructure. Proxy layers and sidecar containers have a way of becoming permanent fixtures if you leave them around.

What This Means for Your Spark Processing Layer

Here’s where it gets interesting for Spark engineers specifically. When Eventstream handles the mTLS connection directly, your Spark Structured Streaming jobs downstream no longer need to manage TLS configuration. The data arrives in Eventstream already authenticated and decrypted. Your Spark jobs read from Eventstream’s output—a KQL Database, a Lakehouse, or a derived event stream—without caring about the certificate logistics upstream.

This changes your Spark job’s failure domain. Before, a certificate expiration on your Kafka broker could cascade into a Spark Structured Streaming job failure that looked like a network timeout. Your on-call engineer would spend 45 minutes digging through logs before realizing it was a cert issue, not a cluster issue. With Eventstream handling the connection, certificate-related failures surface in Eventstream’s monitoring, not in your Spark job logs. The blast radius shrinks. Mean time to diagnosis drops.

There’s also a capacity planning angle. If you were running proxy layers or custom Spark Structured Streaming ingestion jobs to handle mTLS, you were burning Spark capacity on what’s essentially an I/O concern. Those compute units get freed up. On a Fabric F64 capacity, redirecting even a small percentage of compute from certificate-wrangling proxy jobs to actual analytics can measurably impact your batch completion times.

The Risks You Should Actually Worry About

This is a preview feature, and previews carry specific risks that experienced Spark engineers should plan around.

Certificate fetch latency at connector startup. The Eventstream connector fetches certificates from Key Vault at runtime. If your Key Vault has high latency (common with geo-replicated vaults under load), connector startup will be slower. This probably won’t affect steady-state streaming, but it could affect recovery time after a connector restart. Test your connector’s cold-start time under realistic Key Vault conditions.

Key Vault throttling under rotation. If you rotate certificates frequently (some compliance regimes require 90-day rotation), the moment you update a certificate in Key Vault, every connector referencing it will re-fetch. If you have dozens of Eventstream sources pointing at the same Key Vault, you could hit throttling limits during rotation. Stagger your connector restarts or use separate Key Vault instances for high-fanout scenarios.

Preview-to-GA contract changes. Microsoft may change the configuration interface, the certificate format requirements, or the Key Vault integration pattern between preview and GA. Don’t build automation that scrapes the Eventstream UI. If you need to automate connector provisioning, use the Fabric REST APIs and wrap them in a layer you can update when the API contract changes.

The Bigger Picture

What makes this feature worth paying attention to isn’t the TLS handshake mechanics. It’s the architectural shift it represents. Fabric is steadily pulling infrastructure concerns out of the Spark processing layer and into the platform layer. First it was storage with OneLake. Then compute scheduling with Fabric capacities. Now it’s connection security. Each time, the pattern repeats: something that Spark engineers used to manage through custom code and tribal knowledge becomes a platform capability with a configuration interface.

The teams that will get the most value from this aren’t the ones with the most sophisticated workarounds. They’re the ones who recognize that the workaround was always the wrong layer of abstraction and move quickly to eliminate it. The proxy layer was never the product. The data pipeline was.

Start with one topic. Prove it works. Then move fast.

This post was written with help from anthropic/claude-opus-4-6

Tag: Azure Key Vault

The Security Feature That’s Actually About Speed

The Workaround Tax

What Actually Changed

The Spark Engineer’s Migration Checklist

What This Means for Your Spark Processing Layer

The Risks You Should Actually Worry About

The Bigger Picture

The Workaround Tax

What Actually Changed

The Spark Engineer’s Migration Checklist

What This Means for Your Spark Processing Layer

The Risks You Should Actually Worry About

The Bigger Picture

Share this: