Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KubeCPUOvercommit doesn't take node pools #481

Open
Duologic opened this issue Aug 5, 2020 · 6 comments
Open

KubeCPUOvercommit doesn't take node pools #481

Duologic opened this issue Aug 5, 2020 · 6 comments
Assignees

Comments

@Duologic
Copy link
Contributor

Duologic commented Aug 5, 2020

{
alert: 'KubeCPUOvercommit',
expr: |||
sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum{%(ignoringOverprovisionedWorkloadSelector)s})
/
sum(kube_node_status_allocatable_cpu_cores)
>
(count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores)
||| % $._config,
labels: {
severity: 'warning',
},
annotations: {
message: 'Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.',
},
'for': '5m',
},

The KubeCPUOvercommit doesn't take node pools and tolerations into account and it might even be a stretch to cover that. Anyone has thoughts about that?

@brancz
Copy link
Member

brancz commented Aug 5, 2020

I definitely love where you're going.

I have a feeling that only an effort like kubernetes/enhancements#1916 could potentially go in this direction. Node pools are not well defined unfortunately, I actually poked around on sig-node yesterday if there might be possibilities to standardize this, but even if that were to lead to something, it's in the very beginning.

I think this would make sense for node-pools where you are aware of the scheduling constraints, I don't think going as far as tolerations is really reasonable or possible, as we would essentially implement the scheduler again I think.

@Duologic
Copy link
Contributor Author

Duologic commented Aug 5, 2020

I think this would make sense for node-pools where you are aware of the scheduling constraints, I don't think going as far as tolerations is really reasonable or possible, as we would essentially implement the scheduler again I think.

Agreed, probably sufficient to link a pod to a node pool through a recording rule.

@nhuray
Copy link

nhuray commented Aug 8, 2020

Hi guys,

Maybe my question is not directly related but I don't understand the expression of the alerting rule.

I have a very small cluster with just 1 instance and that rule is always firing because (count(kube_node_status_allocatable_cpu_cores)-1) returns 0.

For example:

Let's say my containers reserved a total of 2 cpu on a single 4 cpu instance. The alerting rule give:

sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum{%(ignoringOverprovisionedWorkloadSelector)s}) / 
               sum(kube_node_status_allocatable_cpu_cores)    = 0.5
                 > 
               (count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores)  = 0

but in reality I just commit 50% of the CPU resource of the entire cluster

so for me the expression should be:

sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum{%(ignoringOverprovisionedWorkloadSelector)s}) / 
               sum(kube_node_status_allocatable_cpu_cores)
                 >  1

Am I not understanding something here ?

@Duologic
Copy link
Contributor Author

Duologic commented Aug 8, 2020

The alert message says:

... and cannot tolerate node failure

So this alert is not applicable if you don't intend to run more than one node.

@nhuray
Copy link

nhuray commented Aug 9, 2020

I get your point but the message should be:

Cluster has overcommitted CPU resource requests for Pods or cannot tolerate node failure.

Copy link

github-actions bot commented Nov 5, 2024

This issue has not had any activity in the past 30 days, so the
stale label has been added to it.

  • The stale label will be removed if there is new activity
  • The issue will be closed in 7 days if there is no new activity
  • Add the keepalive label to exempt this issue from the stale check action

Thank you for your contributions!

@github-actions github-actions bot added the stale label Nov 5, 2024
@skl skl self-assigned this Nov 5, 2024
@github-actions github-actions bot removed the stale label Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants