KubeCPUOvercommit doesn't take node pools #481

Duologic · 2020-08-05T08:59:27Z

kubernetes-mixin/alerts/resource_alerts.libsonnet

Lines 25 to 41 in dc563cb

    
                     { 
        
                       alert: 'KubeCPUOvercommit', 
        
                       expr: ||| 
        
                         sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum{%(ignoringOverprovisionedWorkloadSelector)s}) 
        
                           / 
        
                         sum(kube_node_status_allocatable_cpu_cores) 
        
                           > 
        
                         (count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores) 
        
                       ||| % $._config, 
        
                       labels: { 
        
                         severity: 'warning', 
        
                       }, 
        
                       annotations: { 
        
                         message: 'Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.', 
        
                       }, 
        
                       'for': '5m', 
        
                     },

The KubeCPUOvercommit doesn't take node pools and tolerations into account and it might even be a stretch to cover that. Anyone has thoughts about that?

brancz · 2020-08-05T13:52:17Z

I definitely love where you're going.

I have a feeling that only an effort like kubernetes/enhancements#1916 could potentially go in this direction. Node pools are not well defined unfortunately, I actually poked around on sig-node yesterday if there might be possibilities to standardize this, but even if that were to lead to something, it's in the very beginning.

I think this would make sense for node-pools where you are aware of the scheduling constraints, I don't think going as far as tolerations is really reasonable or possible, as we would essentially implement the scheduler again I think.

Duologic · 2020-08-05T17:05:23Z

I think this would make sense for node-pools where you are aware of the scheduling constraints, I don't think going as far as tolerations is really reasonable or possible, as we would essentially implement the scheduler again I think.

Agreed, probably sufficient to link a pod to a node pool through a recording rule.

nhuray · 2020-08-08T13:31:34Z

Hi guys,

Maybe my question is not directly related but I don't understand the expression of the alerting rule.

I have a very small cluster with just 1 instance and that rule is always firing because (count(kube_node_status_allocatable_cpu_cores)-1) returns 0.

For example:

Let's say my containers reserved a total of 2 cpu on a single 4 cpu instance. The alerting rule give:

sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum{%(ignoringOverprovisionedWorkloadSelector)s}) / 
               sum(kube_node_status_allocatable_cpu_cores)    = 0.5
                 > 
               (count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores)  = 0

but in reality I just commit 50% of the CPU resource of the entire cluster

so for me the expression should be:

sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum{%(ignoringOverprovisionedWorkloadSelector)s}) / 
               sum(kube_node_status_allocatable_cpu_cores)
                 >  1

Am I not understanding something here ?

Duologic · 2020-08-08T16:29:06Z

The alert message says:

... and cannot tolerate node failure

So this alert is not applicable if you don't intend to run more than one node.

nhuray · 2020-08-09T14:44:24Z

I get your point but the message should be:

Cluster has overcommitted CPU resource requests for Pods or cannot tolerate node failure.

github-actions · 2024-11-05T00:24:28Z

This issue has not had any activity in the past 30 days, so the
stale label has been added to it.

The stale label will be removed if there is new activity
The issue will be closed in 7 days if there is no new activity
Add the keepalive label to exempt this issue from the stale check action

Thank you for your contributions!

github-actions bot added the stale label Nov 5, 2024

skl self-assigned this Nov 5, 2024

github-actions bot removed the stale label Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KubeCPUOvercommit doesn't take node pools #481

KubeCPUOvercommit doesn't take node pools #481

Duologic commented Aug 5, 2020

brancz commented Aug 5, 2020 •

edited

Loading

Duologic commented Aug 5, 2020 •

edited

Loading

nhuray commented Aug 8, 2020 •

edited

Loading

Duologic commented Aug 8, 2020 •

edited

Loading

nhuray commented Aug 9, 2020

github-actions bot commented Nov 5, 2024

KubeCPUOvercommit doesn't take node pools #481

KubeCPUOvercommit doesn't take node pools #481

Comments

Duologic commented Aug 5, 2020

brancz commented Aug 5, 2020 • edited Loading

Duologic commented Aug 5, 2020 • edited Loading

nhuray commented Aug 8, 2020 • edited Loading

Duologic commented Aug 8, 2020 • edited Loading

nhuray commented Aug 9, 2020

github-actions bot commented Nov 5, 2024

brancz commented Aug 5, 2020 •

edited

Loading

Duologic commented Aug 5, 2020 •

edited

Loading

nhuray commented Aug 8, 2020 •

edited

Loading

Duologic commented Aug 8, 2020 •

edited

Loading