Skip to content

Use PEM certificates loaded from secrets for Kafka#11447

Merged
scholzj merged 2 commits intostrimzi:mainfrom
tinaselenge:use-pem-kafka
Oct 9, 2025
Merged

Use PEM certificates loaded from secrets for Kafka#11447
scholzj merged 2 commits intostrimzi:mainfrom
tinaselenge:use-pem-kafka

Conversation

@tinaselenge
Copy link
Copy Markdown
Contributor

@tinaselenge tinaselenge commented May 19, 2025

Type of change

  • Refactoring

Description

  • Use KubernetesSecretConfigProvider to access secrets directly to configure Kafka truststore and keystore used for nodes to authenticate each other and with clients.
  • OAuth and Authorization server configurations will still use PKCS12 certs generated by the script because they are getting deprecated and removed soon in CRD v1 release. Once they are removed, the script for preparing TLS certificates can be completely removed.
  • Remove volume mounts and environment variables for configuring truststore and keystore as they are no longer needed because secrets are directly accessed.
  • Refactored KafkaAgent to directly access the cluster CA and node certificates and use them to configure the HTTP server, instead of using PKCS12 certificates generated by the script. Added util class to allow creating JKS keystores from secrets.

Resolves part of #11294

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
  • Supply screenshots for visual changes, such as Grafana dashboards

@tinaselenge tinaselenge marked this pull request as ready for review May 28, 2025 14:12
@tinaselenge tinaselenge requested review from katheris and ppatierno and removed request for ppatierno May 28, 2025 14:12
@ppatierno ppatierno added this to the 0.47.0 milestone Jun 4, 2025
@tinaselenge
Copy link
Copy Markdown
Contributor Author

@ppatierno @katheris can you please review this PR when you get a chance? Thank you :)

Comment thread kafka-agent/src/main/java/io/strimzi/kafka/agent/KafkaAgent.java Outdated
@tinaselenge tinaselenge force-pushed the use-pem-kafka branch 2 times, most recently from 3d7de64 to 3363f8a Compare June 18, 2025 09:15
Comment thread kafka-agent/src/main/java/io/strimzi/kafka/agent/KafkaAgent.java Fixed
@tinaselenge
Copy link
Copy Markdown
Contributor Author

Thank @ppatierno so much for reviewing the PR. I have now addressed your comments.

Could you also please kick off the regression tests?

@im-konge
Copy link
Copy Markdown
Member

/azp run regression

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@im-konge
Copy link
Copy Markdown
Member

/azp run regression

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Member

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look pretty good to me, I just had a couple of questions and suggestions that I added.

@ppatierno
Copy link
Copy Markdown
Member

@tinaselenge I restarted failed regression tests, not sure if they were related to the PR but there were quite a few. Let's see the next run.

@im-konge
Copy link
Copy Markdown
Member

@tinaselenge I restarted failed regression tests, not sure if they were related to the PR but there were quite a few. Let's see the next run.

They failed even for the previous runs, so I guess they are related to the PR.

@tinaselenge
Copy link
Copy Markdown
Contributor Author

yes, they are definitely related as they failed locally for me as well. I fixed OAuth related failures but still trying to fix some failures in ListenersST that tests listeners with custom certificates. I will update the PR once I have it passing locally.

@katheris
Copy link
Copy Markdown
Member

/azp run regression

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj scholzj modified the milestones: 0.47.0, 0.48.0 Jul 10, 2025
@tinaselenge tinaselenge force-pushed the use-pem-kafka branch 2 times, most recently from bfede7c to 8ae072a Compare July 23, 2025 14:28
Comment thread CHANGELOG.md Outdated
*/
@SuppressWarnings("deprecation") // OAuth authentication is deprecated
private void configureAuthentication(String listenerName, List<String> securityProtocol, boolean tls, KafkaListenerAuthentication auth) {
private void configureAuthentication(String listenerName, List<String> securityProtocol, boolean tls, KafkaListenerAuthentication auth, String clusterName) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you considered storing the cluster name at the object level given it seems to be needed all around the place now? What happens if the Secret is deleted or if the fields inside it are renamed and the broker Pod restarts (not through the operator but for some other reason).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you considered storing the cluster name at the object level given it seems to be needed all around the place now?

I did but cluster name seems to be passed into most of the with* methods so if we make it at the object level, I would refactor all of those methods. Maybe should be done in a separate PR?

What happens if the Secret is deleted or if the fields inside it are renamed and the broker Pod restarts (not through the operator but for some other reason).

Pods will restart but brokers would fail to authenticate clients, I guess? Don't we have the similar risk today though? We generate p12 files based on the volume mounted secrets with the specific fields. If broker pod restarts but the secret does not exist, the pod would not restart or if *.crt field does not exist, it would not find the volume mounted file to generate the p12 files?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pods will restart but brokers would fail to authenticate clients, I guess? Don't we have the similar risk today though? We generate p12 files based on the volume mounted secrets with the specific fields. If broker pod restarts but the secret does not exist, the pod would not restart or if *.crt field does not exist, it would not find the volume mounted file to generate the p12 files?

Does it fail the clients? Or does it make the brokers crashlooping because the initialization fails? I think those are two different outcomes.

You are right that today the broker would end up pending I guess. But that does not mean we cannot improve on it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think I added the second part to a wrong comment - it should have been probably added to the one about copying the custom server certificates.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the broker will go into crashloop because of failing to initialise since the kubernetes config provider will run and fail to fetch the custom cert secret or its field if they are missing. So do we think copying the custom cert secrets into our internal secret would help us in case the secret is deleted or its field has changed?

If we copy them into the existing internal per broker secret, I wonder how we should reconcile it. We would append the key and cert with their original field names as some listeners might still use the internal per broker cert. If the field has changed, do we keep appending the new one and then remove the old field at some point?

I do agree that we should improve on it, but as the PR is already quite big, I wonder if we should tackle it a separate PR with more discussion, unless if people think this is a stopper for this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would need to use a separate field for it. But not sure the CA reconciliation would not wipe it out. Maybe the easiest thing would be to keep it as is and open a new issue for this? We can probably get back to it later and think how to best fix it. And it would not block this PR any further.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened an issue for this here #12000.

Copy link
Copy Markdown
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

@scholzj
Copy link
Copy Markdown
Member

scholzj commented Oct 6, 2025

/gha run pipeline=upgrade,regression

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Oct 6, 2025

⏳ System test verification started: link

The following 10 job(s) will be executed:

  • regression-brokers-and-security-amd64 (oracle-vm-8cpu-32gb-x86-64)
  • regression-operators-amd64 (oracle-vm-8cpu-32gb-x86-64)
  • regression-operands-amd64 (oracle-vm-8cpu-32gb-x86-64)
  • regression-brokers-and-security-arm64 (oracle-vm-8cpu-32gb-arm64)
  • regression-operators-arm64 (oracle-vm-8cpu-32gb-arm64)
  • regression-operands-arm64 (oracle-vm-8cpu-32gb-arm64)
  • upgrade-azp_kraft_upgrade-amd64 (oracle-vm-4cpu-16gb-x86-64)
  • upgrade-azp_kafka_upgrade-amd64 (oracle-vm-4cpu-16gb-x86-64)
  • upgrade-azp_kraft_upgrade-arm64 (oracle-vm-4cpu-16gb-arm64)
  • upgrade-azp_kafka_upgrade-arm64 (oracle-vm-4cpu-16gb-arm64)

Tests will start after successful build completion.

@scholzj
Copy link
Copy Markdown
Member

scholzj commented Oct 6, 2025

@strimzi/system-test-contributors Any chance you can run STs for this on some FIPS cluster?

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Oct 6, 2025

❌ System test verification failed: link

@tinaselenge
Copy link
Copy Markdown
Contributor Author

Looks like system test failed due to a flaky test that is unrelated to this PR. The flaky test was fixed by #11986. Should we kick off the tests again?

@scholzj
Copy link
Copy Markdown
Member

scholzj commented Oct 7, 2025

Looks like system test failed due to a flaky test that is unrelated to this PR. The flaky test was fixed by #11986. Should we kick off the tests again?

I'm not sure we need to re-run them as we know the failure is unrelated. Let's see if @ppatierno has any more comments. Maybe we can re-run them afterwards.

Copy link
Copy Markdown
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ppatierno
Copy link
Copy Markdown
Member

/gha run pipeline=regression

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Oct 8, 2025

⏳ System test verification started: link

The following 6 job(s) will be executed:

  • regression-brokers-and-security-amd64 (oracle-vm-8cpu-32gb-x86-64)
  • regression-operators-amd64 (oracle-vm-8cpu-32gb-x86-64)
  • regression-operands-amd64 (oracle-vm-8cpu-32gb-x86-64)
  • regression-brokers-and-security-arm64 (oracle-vm-8cpu-32gb-arm64)
  • regression-operators-arm64 (oracle-vm-8cpu-32gb-arm64)
  • regression-operands-arm64 (oracle-vm-8cpu-32gb-arm64)

Tests will start after successful build completion.

@ppatierno
Copy link
Copy Markdown
Member

@tinaselenge I re-ran the regression pipeline but just notice there is conflict to resolve on the CHANGELOG. Of course, it won't have impact on tests result.

Comment thread CHANGELOG.md
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Oct 8, 2025

❌ System test verification failed: link

Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
Signed-off-by: Gantigmaa Selenge <tina.selenge@gmail.com>
@tinaselenge
Copy link
Copy Markdown
Contributor Author

Not sure why a ST failed, but when running it locally, it passes. Can we please kick off the tests again? I only rebased and updated the CHANGELOG.md since the last successful STs.

@scholzj
Copy link
Copy Markdown
Member

scholzj commented Oct 8, 2025

I think it failed because the Pr was not rebased when Paolo started it 🙄.

@ppatierno
Copy link
Copy Markdown
Member

/gha run pipeline=regression

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Oct 8, 2025

⏳ System test verification started: link

The following 6 job(s) will be executed:

  • regression-brokers-and-security-amd64 (oracle-vm-8cpu-32gb-x86-64)
  • regression-operators-amd64 (oracle-vm-8cpu-32gb-x86-64)
  • regression-operands-amd64 (oracle-vm-8cpu-32gb-x86-64)
  • regression-brokers-and-security-arm64 (oracle-vm-8cpu-32gb-arm64)
  • regression-operators-arm64 (oracle-vm-8cpu-32gb-arm64)
  • regression-operands-arm64 (oracle-vm-8cpu-32gb-arm64)

Tests will start after successful build completion.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Oct 8, 2025

❌ System test verification failed: link

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Oct 9, 2025

🎉 System test verification passed: link

@scholzj scholzj merged commit 6c17f27 into strimzi:main Oct 9, 2025
29 checks passed
@tinaselenge tinaselenge deleted the use-pem-kafka branch October 9, 2025 07:55
@scholzj scholzj added this to Roadmap Oct 12, 2025
@scholzj scholzj moved this to 0.49.0 (Work in Progress) in Roadmap Oct 12, 2025
Comment thread CHANGELOG.md
If you want to deploy and run the Heartbeat connector, you can use separate `KafkaConnect` and `KafkaConnector` custom resources.
* The `.spec.build.output.additionalKanikoOptions` field in the `KafkaConnect` custom resource is deprecated and will be removed in the future.
* Use `.spec.build.output.additionalBuildOptions` field instead.
* Kafka nodes are now configured with PEM certificates instead of P12/JKS for keystore and truststore.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nitpicking feedback. We recently upgraded to 0.49.1, and one of the brokers got stuck in crashlooping with some errors about corrupted files with PEM. It turns out that we happened to be still using PKCS#1, while Kafka native support only supported PKCS#8 for PEM.

I'm not sure if it's worth a note or something in the release notes or upgrade tips though. However, I would like to share my experience and finding here so in case it matters or anyone else got trapped by PKCS#1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 0.49.0

Development

Successfully merging this pull request may close these issues.

7 participants