-
Notifications
You must be signed in to change notification settings - Fork 106
Description
Created on 2024-11-29 by Christoph Schwägerl (@c-pius).
Decision log
| Name | Description |
|---|---|
| Title | Aligned approach for Kyma controllers to revert additions to Kyma-managed resources |
| Due date | ? |
| Status | Re-proposed on 2025-01-08 (v0.0.2) |
| Decision type | Choice |
| Affected decisions | Extends #941 |
Changelog
v0.0.1- initial proposal
v0.0.2- clarified that SRE changes are also expected to be reverted unless
skip-reconcilation=true; this leads to a new challenge - tested that also changes applied via client-side apply are reverted
- added Option 4 to prevent unexpected changes via a ValidatingWebhook
- added Decision and Consequences for Option 2 - Reverting added fields with SSA
- clarified that SRE changes are also expected to be reverted unless
Context
Controllers within Kyma often use Server-Side Apply (SSA) for reconciling the resources they manage.
Reasons: shift load from the controller to the API server; see also benefits described by K8s
Recently, issues with using SSA have been observed when it comes to reverting resources that are managed by controller A, but where another actor B adds fields to the resource that controller A isn't aware of and that should be reverted by controller A. For example:
- lifecycle-manager reconciles a deployment
kind: Deployment spec: replicas: 1 template: spec: # ...
- another actor configures a volume for this deployment; lifecycle-manager isn't aware of this added volume (it is not part of the deployment specified in the module's raw manfiest)
kind: Deployment spec: replicas: 1 template: spec: # ... volumes: - configMap: defaultMode: 420 name: my-cfg name: config
- continued reconciles by lifecycle-manager do NOT remove this volume
Real examples where this has been problematic:
- Lifecycle-manager cannot patch modified resource correctly lifecycle-manager#1545 (lifecycle-manager)
- <internal-repo>/kyma/backlog/issues/6293 (istio-manager)
- Handling Module CRD Upgrade Regarding Conversion Webhook Setup #941 (lifecycle-manager + module managers)
The goal of this ADR is to align a uniform approach within Kyma that enables our controllers to revert undesired additions to Kyma-managed objects.
Expectations
The following cases summarize the expected scenarios where Kyma controllers are either expected to revert changes or to keep changes.
Expectation 1 - Revert resources to their Kyma-managed state
In general, it is expected that Kyma controllers revert additions to the resources they manage. This is also expressed by the disclaimer "DO NOT EDIT - This resource is managed by Kyma. Any modifications are discarded and the resource is reverted to the original state." present on resources managed by lifecycle-manager (currently not used uniformly). Any additions to those resources by the customer must be reverted. E.g.,
- resources from the module's raw manifest
- internal resources like SKR webhook-related ones
However, there are exceptions where Kyma provides the initial state of a resource, and it is expected that the customer extends this. In those cases, the changes must not be reverted. E.g.,
- KymaCR where the customer modifies the modules
- default CRs where the customer may change configuration
Expectation 2 - Don't revert additions by K8s
For some resources, K8s itself adds information to the resources. This is mostly limited to .metadata and .state. Those changes must not be reverted. E.g.,
kube-controller-managerextends.metadata.annotationsand.statusof a Deployment
Expectation 3 - Revert additions by SRE unless skip-reconcilation=true
SRE or other Kyma-internal actors may need to modify resources while troubleshooting. Those changes are also reverted unless the operator.kyma-project.io/skip-reconciliation=true label is present on the resource.
Expectation 4 - Don't revert modifications from other Kyma controllers
Some resources may be managed by multiple Kyma-internal controllers. It must be ensured that those don't end up in an endless "add-revert loop". E.g.,
- Namespace
kyma-systemis controlled by lifecycle-manager, but environment-broker adds labels - CustomResourceDefinitions (CRDs) are controlled by lifecycle-manager, but module managers may add a conversion webhook
Expectation 5 - Revert resources when managing a previously unmanaged module
When re-managing a previously unmanaged module (see feature request), the Kyma controller takes care of a resource initially provided by the customer. In this case, all extensions by the customer not part of the original config shall be reverted.
Key Assumptions
Key Assumption 1 - Managers don't give up ownership properly
When working with SSA, the clean approach would be to give up ownership. For fields that should remain to exist, this would be done by transferring ownership to another manager. For fields that were added and are not necessary anymore, this should be done by sending another SSA without the field included. Experience from the tickets linked in Context shows that this is not always the case. Also, this is not the case when customers edit the resources manually.
Option 1 - Recreating the object
The situation could be solved by re-creating the object. The problem is however that this may cause temporary disruptions until the object is re-created and all expected managers have added their patches. Therefore, this is considered as a non-option.
Option 2 - Reverting added fields with SSA
💡 It is recommended to first get a rough understanding of how SSA and especially the mechanism of "Field management" works before reading further. See Server-Side Apply in K8s docs.
In summary, SSA allows the caller to modify a resource by providing the subset of fields the caller cares for. The K8s runtime takes care of either creating a new resource if it doesn't exist yet, or patching an existing resource with the provided fields while maintaining the remaining state. To keep track of who is controlling what part of the single resource, K8s stores the caller's identity as "field manager" together with the fields this manager controls in .metadata.managedFields. If multiple managers try to modify the same field, this results in a conflict and the update is rejected. For controllers, it is recommended that they always force conflicts on objects they own meaning that the controller overwrites the other managers value.
Due to the way how SSA is designed to work, a controller A using SSA to reconcile a resource is not able to revert an addition by actor B that controller A is not aware of without further consideration. In the following, a possible way of enabling our controllers to revert such additions is outlined.
Approach - Detect unexpected managers, remove them, and take full ownership of the resource
This option consists of three steps that are executed consecutively, and a fourth step which is the regular reconcile of the resource:
- the controller detects that there are unexpected field managers present on the resource
- the controller removes the unexpected field managers from the resource
- the controller re-applies the full resource including the unknown fields to claim ownership of those fields
- the controller re-applies its desired state of the object without the unknown fields (regular reconcile); this removes those fields again as the controller previously claimed their ownership and the consecutive call without those fields is interpreted as desire to remove those fields
In pseudo code, this may look as follows:
func ssaObject(obj info.Object) error {
// 0. regularly reconcile the obj, the obj argument is the desired state as retrieved from the module's raw manifest
// after this call, obj is filled with the full state as currently in the cluster
clnt.Patch(ctx, obj, client.Apply, client.ForceOwnership, OWNER_NAME)
managedFields := obj.GetManagedFields()
needToTakeOwnership := false
for i := 0; i < len(managedFields); i++ {
// 1. detect unexpected field manager
if isUnexpectedFieldManager(managedFields[i].Manager) {
// 2. remove unexpected field manager from managedFields
patch := []byte(fmt.Sprintf(`[{"op": "remove", "path": "/metadata/managedFields/%d"}]`, i))
clnt.Patch(ctx, obj, client.RawPatch(types.JSONPatchType, patch))
needToTakeOwnership = true
}
}
// 3. if there was an unexpected field manager, take ownership by re-applying the full obj
// as we don't want to change fields, just claim ownership, we also can omit the client.ForceOwnership
if needToTakeOwnership {
clnt.Patch(ctx, obj, client.Apply, OWNER_NAME)
}
// 4. exit, the next regular recocnile will remove the unexpected fields
return nil
}In general, this approach works. It has been tested successfully with both, actor B using SSA and client-side apply. It does NOT work when controller A is using client-side-apply! However, it comes with the following challenges:
Challenge 1 - Detecting what field managers are unexpected
The controller needs to have information what field managers are expected per resource and which ones are not. Maintaining an allow-list per object appears too cumbersome.
As an alternative, it would be possible to create a convention that all controllers in Kyma MUST use a pattern for their field manager name like operator.kyma-project.io/<manager-name>. By this, all field managers not complying to this pattern could be filtered out as unexpected ones. Additionally, this would require SRE and any other Kyma-internal actor making changes to the object via kubectl to specify a field manager with the aligned prefix (kubectl flag --field-manager).
Also, the K8s internal managers like kube-controller-manager MUST be allow-listed.
Challenge 2 - Temporarily taking ownership from other expected managers
When re-sending the full object (step 3.), the controller also claims ownership of the fields owned by another expected controller. This should not be a problem since the controller gives up ownership again with the regular reconcile that doesn't include the other field managers fields (step 4.).
Challenge 3 - Automatic or on-demand action
It needs to be decided whether this reverting approach is executed automatically, or if it is executed on demand. On-demand would mean to actively trigger this approach to be executed. This could be done by adding an annotation to the resource. With this approach, it could also be decided whether the revert is executed on all unexpected field managers as described above, or if only a field manager named in the annotation value is removed.
Challenge 4 - Giving up ownership
If decided to follow an approach where we detect known field managers based on a name prefix, Kyma controller A does not revert additions from Kyma controller B. Consequently, it would not automatically solve the problem described in #941.
The proper approach for this case would be that the module-manager (controller B) still gives up ownership properly by sending an SSA with an empty object.
Option 3 - Not reverting added fields
Additions to the resource are not reverted. Instead, additions made by the user are ignored. If they lead to errors in the cluster, they are considered user errors.
When it comes to multiple Kyma controllers handling the same resource, the controllers must ensure that they properly give up ownership of patches they added when not needed anymore. I.e, they must send an SSA with an empty object. This will remove their additions to the object and the field manager entry.
Option 4 - Prevent updates with a validation webhook
Instead of reverting changes done by unkown actors, they could also be prevented in the first place.
The idea is to configure validating webhoooks that intercept all changes on kyma-managed resources.
If an unkown actor is trying to change the resource, the validating webhook should fail.
Challenge 1 - Detecting what resources to validate
It must be determined what resources to validate. This may be achieved by namespaceSelector and objectSelector (see ValidatingWebhookConfiguration) filtering all objects labelled with operator.kyma-project.io/managed-by: kyma.
In the webhook, we would need to be able to configure exceptions for KymaCR and DefaultCRs.
Challenge 2 - Detecting what actors are unexpected.
This could be solved in the same way as proposed in Option 2. AdmissionRequests contain an options field listing the FieldManager, e.g., see CreateOptions.
Challenge 3 - Handling the load
The validation webhook may need to handle a lot of load and should not add much latency to the change requests.
A central webhook in KCP managing all clusters may be facing too much load.
Instead, there should be one webhook per SKR where lifecycle manager ensures the webhook is present in the SKR.
Challenge 4 - Rotating the certificates
Webhooks must secure communication via certificates that need to be rotated.
This may be achieved by re-using the existing certificates for the SKR webhook that lifecycle manager ensures in the SKR.
Challenge 5 - Single Point of Failure
If this webhook breaks down for whatever reason, this would prevent all changes to Kyma-managed resources in the SKR.
A mitigation is to use failurePolicy: ignore. This will make the admission request succeed on unknown errors from the webhook. The downside is that this allows unexpected changes while the webhook is down or unavailable, e.g. when the cluster starts. Those changes are not reverted in the future but stick.
Decisions
It is decided to pursue Option 2 - Reverting added fields with SSA. Kyma controllers shall revert changes from unexpected managers via the approach outlined. Is is explicitly accepted that the approach for identifying expected managers by the naming pattern operator.kyma-project.io/<manager-name> is not providing genuine authenticity.
Major arguments against Option 4 - Prevent updates with a validation webhook are the added latency to each request, and the more native behavior of Option 2 (we reconcile to a desired state).
In a first step, lifecycle-manager will implement the detection of unexpected managers (PoC). It will not revert those changes but write observability data to get an understanding of the changes that would be reverted. Once verified that this matches the expectations, the approach will be rolled out to revert these changes.
Consequences
- Jellyfish (PoC):
- implement the detection of unexpected managers (first with an allow-list, later with the naming pattern)
- write observability data of what changes would be reverted
- Jellyfish (once approach is confirmed by PoC):
- rename their field-managers to follow the pattern
operator.kyma-project.io/<manager-name> - implement the reverting logic
- check whether the pattern could be implemented as a re-use library
- rename their field-managers to follow the pattern
- Kyma-teams with operators writing SKR resources (once approach is confirmed by PoC):
- rename their field-managers to follow the pattern
operator.kyma-project.io/<manager-name> - implement the reverting logic where applicable
- give up ownership properly where needed (see Option 2, Challenge 4)
- rename their field-managers to follow the pattern
Addendum
Notes
- client-side applying an existing object with a new field creates a field manager entry for this new field but DOES NOT take ownership of the existing field
- in contrast, SSA DOES take ownership of the existing fields