Skip to content

module: add listener_max_connection_lifetime_ms option#5583

Open
dethi wants to merge 1 commit intoemissary-ingress:masterfrom
aristanetworks:listener-max-conn-lifetime
Open

module: add listener_max_connection_lifetime_ms option#5583
dethi wants to merge 1 commit intoemissary-ingress:masterfrom
aristanetworks:listener-max-conn-lifetime

Conversation

@dethi
Copy link
Contributor

@dethi dethi commented Feb 26, 2024

Description

Add listener_max_connection_lifetime_ms, similar to how listener_idle_timeout_ms is implemented. This limit the maximum age of a downstream connections by adding max_connection_duration to the listener's common_http_protocol_options.

This option could already be set on the clusters (upstream connections), but setting it on the downstream connections could be useful in multiple scenarios:

  • to periodically rebalance the load between multiple emissary-ingress, especially when most of the clients are using HTTP2/gRPC and connection stick for a really long time.

  • when using mTLS (client certificate), to ensure that clients reconnect with their renewed certificate, as Envoy doesn't close the connection when the certificate expires.

Related Issues

Fix #2900

Testing

  • Added automated tests
  • Not yet deployed in a real cluster

Checklist

  • Does my change need to be backported to a previous release?

    • No, this is a new feature.
  • I made sure to update CHANGELOG.md.

    Remember, the CHANGELOG needs to mention:

    • Any new features
    • Any changes to our included version of Envoy
    • Any non-backward-compatible changes
    • Any deprecations
  • This is unlikely to impact how Ambassador performs at scale.

    Remember, things that might have an impact at scale include:

    • Any significant changes in memory use that might require adjusting the memory limits
    • Any significant changes in CPU use that might require adjusting the CPU limits
    • Anything that might change how many replicas users should use
    • Changes that impact data-plane latency/scalability
  • My change is adequately tested.

    Remember when considering testing:

    • Your change needs to be specifically covered by tests.
      • Tests need to cover all the states where your change is relevant: for example, if you add a behavior that can be enabled or disabled, you'll need tests that cover the enabled case and tests that cover the disabled case. It's not sufficient just to test with the behavior enabled.
    • You also need to make sure that the entire area being changed has adequate test coverage.
      • If existing tests don't actually cover the entire area being changed, add tests.
      • This applies even for aspects of the area that you're not changing – check the test coverage, and improve it if needed!
    • We should lean on the bulk of code being covered by unit tests, but...
    • ... an end-to-end test should cover the integration points
  • I updated DEVELOPING.md with any any special dev tricks I had to use to work on this code efficiently.

  • The changes in this PR have been reviewed for security concerns and adherence to security best practices.

@dethi dethi force-pushed the listener-max-conn-lifetime branch from d102148 to fe369e6 Compare February 26, 2024 12:26
@dethi
Copy link
Contributor Author

dethi commented Feb 26, 2024

Two questions:

  • any wording recommendation for the CHANGELOG?
  • how can I update the doc?

@rgs1
Copy link

rgs1 commented Oct 8, 2024

@dethi any chance you could rebase this? It might be nice to support this on upstream connections too, since the same type of issues apply there as well (e.g.: stickiness, ensuring mTLS still works, etc). Thanks!

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Oct 8, 2024
@rgs1
Copy link

rgs1 commented Oct 9, 2024

@kflynn any chance we could get this one reviewed/merged? It solves a couple of issues we've seen in prod...

@dethi
Copy link
Contributor Author

dethi commented Oct 9, 2024

Upstream option already exist today.

I can try to rebase, but I think I was missing something in this change as well for the CRD side of things. My testing kind of stopped when there wasn’t much maintainers activity on PR.

But I can try to get back to it. Would be nice to have the CI test running to see what’s missing.

@rgs1
Copy link

rgs1 commented Oct 9, 2024

Upstream option already exist today.

Oh nice, missed that (https://github.com/emissary-ingress/emissary/blob/master/python/ambassador/envoy/v3/v3cluster.py#L122 for others wondering where it is).

I can try to rebase, but I think I was missing something in this change as well for the CRD side of things. My testing kind of stopped when there wasn’t much maintainers activity on PR.

Yeah maybe the commit that added support for upstream is a good reference wrt what is needed for CRDs:

69bdabf

But I can try to get back to it. Would be nice to have the CI test running to see what’s missing.

cc: @kflynn or @cindymullins-dw to kick CI 🙏

Add listener_max_connection_lifetime_ms, similar to how
listener_idle_timeout_ms is implemented. This limit the maximum age of a
downstream connections by adding max_connection_duration to the
listener's common_http_protocol_options.

This option could already be set on the clusters (upstream connections),
but setting it on the downstream connections could be useful in multiple
scenarios:

- to periodically rebalance the load between multiple emissary-ingress,
  especially when most of the clients are using HTTP2/gRPC and
  connection stick for a really long time.

- when using mTLS (client certificate), to ensure that clients reconnect
  with their renewed certificate, as Envoy doesn't close the connection
  when the certificate expires.

Fix emissary-ingress#2900

Signed-off-by: Thibault Deutsch <thibault@arista.com>
@dethi dethi force-pushed the listener-max-conn-lifetime branch from fe369e6 to e69102e Compare October 12, 2024 10:59
@dethi
Copy link
Contributor Author

dethi commented Oct 12, 2024

Rebased the change on master. Thanks for the example for the implementation. I took a bit of time to dig into my change again and now I remember where I was left confused last time: listener_idle_timeout_ms which I was taking as inspiration for my change (since it's in the same Envoy config section and applies to listener too) doesn't show up at all in the CRD yaml. It's only present in the Python code.

Now with the commit you shared, I took a look at the Go section and from what I can see, the Module CRD accept a config key that is completely untyped. Thus why I couldn't find it last time. So based on this, I don't think I need to change anything else.

I will get it to run on one of our dev clusters to validate the config, and if we can get the CI tests to run as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Upper bound downstream connection lifetime in ambassador

2 participants