Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions being created w/ default TerminationGracePeriodSeconds #853

Closed
1 of 3 tasks
kevin-lindsay-1 opened this issue Oct 14, 2021 · 15 comments · Fixed by #869
Closed
1 of 3 tasks

Functions being created w/ default TerminationGracePeriodSeconds #853

kevin-lindsay-1 opened this issue Oct 14, 2021 · 15 comments · Fixed by #869

Comments

@kevin-lindsay-1
Copy link
Contributor

kevin-lindsay-1 commented Oct 14, 2021

When using the function CRD, pods that are created have the default terminationGracePeriodSeconds of 30. This means that if I have a function that needs to run for longer than 30 seconds, the pod may be forcefully killed before being allowed to complete its work.

I do not know if this behavior is exclusive to functions created with the Function CRD.

Expected Behaviour

When a function pod is terminating it should be gracefully shut down and exit after the exec_timeout.

Current Behaviour

Deployments generated by the Function CRD are created with the default TerminationGracePeriodSeconds (30).

This causes pods with timeouts > 30s to be SIGKILLed while they're still trying to gracefully exit.

Are you a GitHub Sponsor (Yes/No?)

Check at: https://github.com/sponsors/openfaas

List All Possible Solutions and Workarounds

Overtly making this configurable, or computed via one of the environment variables inside the yaml, such as exec_timeout.

Then actually just use the config, because I don't see any reference to TerminationGracePeriodSeconds when doing a search all for this repo.

Which Solution Do You Recommend?

Currently there's already a number of timeout config variables to keep track of. This might as well just be another one.

If there's a great way to combine all the existing timeouts, that'd be great, but otherwise just being able to manually specify the grace period termination seconds would be fine.

Steps to Reproduce (for bugs)

  1. Create a function using the Function CRD
  2. Get the deployment's YAML
  3. Look at the terminationGracePeriodSeconds

Alternatively:

  1. Create a sleep function with a duration of 2m. Invoke it once.
  2. Delete the pod after the start of execution.
  3. You should not expect to see a result returned by that pod, because it should receive a SIGKILL from kubernetes 30s after entering the terminating state.

Context

Pods are being forcefully killed before finishing their work, which effectively prevents autoscaling from being able to be enabled, because there is no guarantee that a pod that is created will have time to do its job.

Your Environment

  • FaaS-CLI version ( Full output from: faas-cli version ):
    0.13.13

  • Docker version docker version (e.g. Docker 17.0.05 ):
    20.10.8

  • What version and distriubtion of Kubernetes are you using? kubectl version
    server v1.21.3
    client v1.22.2

  • Operating System and version (e.g. Linux, Windows, MacOS):
    MacOS

  • Link to your project or a code example to reproduce issue:

  • What network driver are you using and what CIDR? i.e. Weave net / Flannel

@kevin-lindsay-1 kevin-lindsay-1 changed the title Function CRD creating deployment with default TerminationGracePeriodSeconds Function CRD creates deployment w/ default TerminationGracePeriodSeconds Oct 22, 2021
@kevin-lindsay-1 kevin-lindsay-1 changed the title Function CRD creates deployment w/ default TerminationGracePeriodSeconds Functions being created w/ default TerminationGracePeriodSeconds Oct 22, 2021
@alexellis
Copy link
Member

Hi @kevin-lindsay-1

Please can you provide an actual example of commands we can run here?

Let's say the Go sleep function from the docs for instance?

Alex

@alexellis
Copy link
Member

Can you demonstrate with logs and output of how your function gets scaled down and doesn't complete? I.e. with the above sleep function.

@alexellis
Copy link
Member

/add label: support,question

@kevin-lindsay-1
Copy link
Contributor Author

I'll see if I can put together a quick repro repo

@kevin-lindsay-1
Copy link
Contributor Author

Also, while digging a little it looks like this was brought up previously. #637

@kevin-lindsay-1
Copy link
Contributor Author

kevin-lindsay-1 commented Oct 22, 2021

@alexellis https://github.com/kevin-lindsay-1/openfaas-repro-gracePeriod

Please let me know if you have any issues starting anything or reproducing.

@alexellis
Copy link
Member

These steps are not showing an error or problem affecting you, they are just describing a configuration.

Create a function using the Function CRD
Get the deployment's YAML
Look at the terminationGracePeriodSeconds

You also mention:

Delete the pod after the start of execution.

This is not what happens in OpenFaaS, we need to use the specified logic of the system i.e. scaling to fewer replicas.

kubectl scale deploy/function -n openfaas-fn --replicas=0

We need to see a specific error i.e. console output, experienced as you would in production after running the scale down command.

It would help if you provided: a sample function, a command for invoking it, a command to scale down to fewer replicas, and output showing that the function failed to complete.

@kevin-lindsay-1
Copy link
Contributor Author

kevin-lindsay-1 commented Oct 27, 2021

This is not what happens in OpenFaaS, we need to use the specified logic of the system i.e. scaling to fewer replicas.

Ok, I have used said kubectl scale command on the provided sleep function in the repo, which produces the same result.

That said, I agree that I should have provided the exact command I used in order to delete the pod.

However, a small facet where I disagree with that exact phrasing is the implication (which I am not putting words in your mouth for) that administrators need to use the commands that OF uses internally. I disagree with that specifically because OF uses K8S' API, which is designed to gracefully terminate in a standard way across pods. That is to say that graceful termination without overrides using k8s' API is just graceful termination, and that the onus is on OF to be predictable when used in k8s, and to be predictable to k8s administrators, rather than the other way around. For example, if an administrator or a node drain is stupid enough to forcefully kill a pod mid-execution, I believe that OF should do what it can within reason to robustly handle that case, even if that situation is technically not supposed to happen during normal execution (that said, this issue isn't a bizarre case such as that).

Again, I should have provided the exact command I used, which probably would have prevented the request to use OF's internally-used k8s commands.

We need to see a specific error i.e. console output, experienced as you would in production after running the scale down command.

in shell A:

$ echo '2m' | faas invoke sleep
// waiting

in shell B:

$ kubectl scale deploy/sleep -n openfaas-fn --replicas=0

pod logs:

sleep 2021/10/27 18:23:50 Version: 0.8.4    SHA: bbd2e96214264d6b87cc97745ee9f604776dd80f
sleep 2021/10/27 18:23:50 Forking: ./handler, arguments: []
sleep 2021/10/27 18:23:50 Started logging: stderr from function.
sleep 2021/10/27 18:23:50 Started logging: stdout from function.
sleep 2021/10/27 18:23:50 Watchdog mode: http
sleep 2021/10/27 18:23:50 Timeouts: read: 5m0s, write: 5m0s hard: 5m0s.
sleep 2021/10/27 18:23:50 Listening on port: 8080
sleep 2021/10/27 18:23:50 Writing lock-file to: /tmp/.lock
sleep 2021/10/27 18:23:50 Metrics listening on port: 8081
sleep 2021/10/27 18:23:53 sleeping 2m0s
sleep 2021/10/27 18:24:03 SIGTERM received.. shutting down server in 5m0s
sleep 2021/10/27 18:24:03 Removing lock-file : /tmp/.lock
sleep 2021/10/27 18:24:03 [entrypoint[] SIGTERM received.. shutting down server in 5m0s
sleep 2021/10/27 18:24:08 Upstream HTTP request error: Post "http://127.0.0.1:8082/": context canceled
sleep 2021/10/27 18:24:08 Upstream HTTP killed due to exec_timeout: 5m0s

shell A:

$ echo '2m' | faas invoke sleep
// done waiting
// this is returned at 18:24:08
Server returned unexpected status code: 500 - Can't reach service for: sleep.

pod is killed at 18:24:33.


based on this, I think there actually might be two issues here:

  • context is canceled and the http request fails, even though the pod is still gracefully shutting down following a SIGTERM. I imagine that this is happening because of the scale to 0 specifically, which hopefully wouldn't be a problem during normal usage with the faas-idler.
    • i have noticed this error when scaling up from 0, however, due to a pod not being ready for the first second or two of being online. perhaps a future issue, or perhaps this is happening due to istio. i'll hold off on this for now to see what's up.
  • SIGKILL is sent 30s after SIGTERM regardless of exec_timeout, and does not appear to be configurable.

@alexellis
Copy link
Member

Having had a more detailed call with Kevin, we reproduced the problem using the tutorial from the OpenFaaS docs on extended timeouts.

2 min limit set on every core component
2 min limit set on the function
The workload was a 60 second wait

Scaling the pod down caused the watchdog to stop accepting connections, and print a message that it was going to wait for 2 minutes.

After 30 seconds, the watchdog was killed, short of its 2 minute grace period.

We then set a termination grace period on the Deployment of the function, and the watchdog was then able to complete its work. There was a side effect, where the watchdog then remained idle for the remainder of the grace period.

In another example, perhaps the grace period would be 1 hour, but the function completed within 1 minute, we'd then have a pod in a termination status for 59 minutes.

The watchdog could be updated to track active invocations, and then be able to exit immediately after all work had been drained.

https://github.com/openfaas/of-watchdog/blob/master/main.go#L110

To Kevin's point on adding new configurations for timeouts, we have plenty already. The watchdog uses the write_timeout to choreograph its graceful shutdown sequence, so I would suggest we read this value and use it to set the termination grace period in the Pod spec in faas-netes also.

func makeDeploymentSpec(request types.FunctionDeployment, existingSecrets map[string]*apiv1.Secret, factory k8s.FunctionFactory) (*appsv1.Deployment, error) {

Kevin mentioned the PR from the Istio team: istio/istio#35059

@alexellis alexellis self-assigned this Nov 1, 2021
alexellis added a commit to openfaas/of-watchdog that referenced this issue Nov 2, 2021
Prior to this change the watchdog would attempt to wait for
the write_timeout value twice, even if there were no connections
in flight.

This adds a new metric for in-flight connections and removes
the explicit wait.

Testing with killall of-watchdog -s SIGTERM performs as
expected, and clients that already connected complete their
work before the termination.

This change is a prerequisite for a change in faas-netes
openfaas/faas-netes#853

Signed-off-by: Alex Ellis (OpenFaaS Ltd) <[email protected]>
alexellis added a commit to openfaas/of-watchdog that referenced this issue Nov 2, 2021
Prior to this change the watchdog would attempt to wait for
the write_timeout value twice, even if there were no connections
in flight.

This adds a new metric for in-flight connections and removes
the explicit wait.

Testing with killall of-watchdog -s SIGTERM performs as
expected, and clients that already connected complete their
work before the termination.

This change is a prerequisite for a change in faas-netes
openfaas/faas-netes#853

Signed-off-by: Alex Ellis (OpenFaaS Ltd) <[email protected]>
alexellis added a commit to openfaas/of-watchdog that referenced this issue Nov 2, 2021
Prior to this change the watchdog would attempt to wait for
the write_timeout value twice, even if there were no connections
in flight.

This adds a new metric for in-flight connections and removes
the explicit wait.

Testing with killall of-watchdog -s SIGTERM performs as
expected, and clients that already connected complete their
work before the termination.

This change is a prerequisite for a change in faas-netes
openfaas/faas-netes#853

Signed-off-by: Alex Ellis (OpenFaaS Ltd) <[email protected]>
alexellis added a commit to openfaas/of-watchdog that referenced this issue Nov 2, 2021
Prior to this change the watchdog would attempt to wait for
the write_timeout value twice, even if there were no connections
in flight.

This adds a new metric for in-flight connections and removes
the explicit wait.

Testing with killall of-watchdog -s SIGTERM performs as
expected, and clients that already connected complete their
work before the termination.

This change is a prerequisite for a change in faas-netes
openfaas/faas-netes#853

Signed-off-by: Alex Ellis (OpenFaaS Ltd) <[email protected]>
alexellis added a commit to openfaas/of-watchdog that referenced this issue Nov 2, 2021
Prior to this change the watchdog would attempt to wait for
the write_timeout value twice, even if there were no connections
in flight.

This adds a new metric for in-flight connections and removes
the explicit wait.

Testing with killall of-watchdog -s SIGTERM performs as
expected, and clients that already connected complete their
work before the termination.

This change is a prerequisite for a change in faas-netes
openfaas/faas-netes#853

Signed-off-by: Alex Ellis (OpenFaaS Ltd) <[email protected]>
@LucasRoesler
Copy link
Member

I think that having the termination grace period match the write timeout makes a lot of sense, maybe with a small bump of 50 or 100ms so that there is no possibility of those competing with each other. But it seems reasonable that the longest a function could possibly do meaningful work is the write timeout.

@alexellis
Copy link
Member

To clarify:

  1. This change means updating faas-netes to set a Termination Grace Period on pods
  2. The watchdog has been updated to finish its exit sequence quicker, if there are no remaining active connections

I've completed 2) which does not fix this issue, but makes it more tolerable if/when it's released.

@kevin-lindsay-1
Copy link
Contributor Author

kevin-lindsay-1 commented Nov 3, 2021

I have tested this behavior using the golang-http template, which uses of-watchdog:0.9.0, and I have verified that the container indeed exits early when its healthcheck_interval is set.

I do notice, however, that when not set, it uses the value of write_timeout, which was maybe set to that for the sake of backwards compatibility, but it was unintuitive at first, because what was happening is that the pod was effectively waiting a full execution duration before actually attempting to check if any executions were actually running.

In my test, I tested timeout values of 24h0m0s for the sake of an overtly obvious proof, and updated terminationGracePeriodSeconds to match, and effectively what happened is that the pod was never actually going to exit early.

Maybe the healthcheck interval should be smaller, like 10s or something by default (or based of a percentage of write_timeout), instead of write_timeout? The reason I propose this is because my terminationGracePeriodSeconds would probably be a little over write_timeout, meaning the function would "exit early" by like, 2 seconds if you don't specify a healthcheck_interval manually.

In addition, it may be useful to perform one check outside of that tick interval, as soon as the shutdown process begins. This way, if you already have 0 active connections, the pod doesn't wait a tick, and instead immediately shuts down.

@alexellis
Copy link
Member

Thanks for sharing your thoughts Kevin.

Let me think about how this could be integrated into the implementation.

Did you also see the work drain and complete as expected?

@kevin-lindsay-1
Copy link
Contributor Author

I just tested that a pod being given 10 sleep requests all at the same time via a simple "batch" function, each with a duration of (10+ (10 * i))s, where i is the index of a function (so effectively a start of 10, step of 10, 10 iterations).

I invoked all of them in parallel with a healthcheck_interval of 1s, each invocation completed after 10 seconds of the previous execution, and after all 10 executions were finished the pod exited 0 within the healthcheck interval. All good here.

I also tested that batching those invocations twice also worked (first batch, scale 0, scale 1, wait for pod to be ready, second batch), in that none of the executions for the second batch were sent to the pod that was shutting down.

@kevin-lindsay-1
Copy link
Contributor Author

kevin-lindsay-1 commented Nov 3, 2021

To follow up on my previous comment regarding the default healthcheck interval, I would argue that for most functions, 1s or so would probably be fine, because assuming that the pod only performs its internal healthcheck during the shutdown process, the resources that the pod would free up by terminating early would likely outweigh the resources used by performing said healthcheck, and that ratio of freed on termination:used on healthcheck would likely grow as the pod drains, because the individual pod would become less "useful" as the amount of concurrent executions approaches 0.

In other words, the healthcheck probably uses 1-10m for a brief period, whereas the overall pod probably requests 100m+, so it might be advantageous during shutdown to "try to die" as fast as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants