-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AKS Templates] Fix Prometheus and Grafana on templates #204
Comments
Regarding item 3 (missing Prometheus metrics), seems there is a major difference in how Prometheus is configured out of the box (be it the Azure managed one or from a fresh Helm setup) and how it is configured right now in the release clusters. This distinction is also encoded in the grafana dashboards we saved in dapr/dapr, which refer to metrics by names that only exists in the release longaul prometheus setup. As an example, I am pasting a diff of what one would find in a helm-installed grafana and what we have in release longhaul: --- fresh-from-helm-prometheus.yaml 2023-10-01 14:33:45.782910959 -0700
+++ release-prometheus.yaml 2023-10-01 14:33:45.793744284 -0700
@@ -1,4 +1,4 @@
-issue6946-prometheus.yml
+release-prometheus.yml
global:
evaluation_interval: 1m
scrape_interval: 1m
@@ -64,8 +64,7 @@
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
-- honor_labels: true
- job_name: kubernetes-service-endpoints
+- job_name: kubernetes-service-endpoints
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
@@ -73,10 +72,6 @@
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
- - action: drop
- regex: true
- source_labels:
- - __meta_kubernetes_service_annotation_prometheus_io_scrape_slow
- action: replace
regex: (https?)
source_labels:
@@ -88,7 +83,7 @@
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
- regex: (.+?)(?::\d+)?;(\d+)
+ regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
@@ -102,17 +97,16 @@
- action: replace
source_labels:
- __meta_kubernetes_namespace
- target_label: namespace
+ target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
- target_label: service
+ target_label: kubernetes_name
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
- target_label: node
-- honor_labels: true
- job_name: kubernetes-service-endpoints-slow
+ target_label: kubernetes_node
+- job_name: kubernetes-service-endpoints-slow
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
@@ -131,7 +125,7 @@
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
- regex: (.+?)(?::\d+)?;(\d+)
+ regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
@@ -145,15 +139,15 @@
- action: replace
source_labels:
- __meta_kubernetes_namespace
- target_label: namespace
+ target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
- target_label: service
+ target_label: kubernetes_name
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
- target_label: node
+ target_label: kubernetes_node
scrape_interval: 5m
scrape_timeout: 30s
- honor_labels: true
@@ -165,8 +159,7 @@
regex: pushgateway
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_probe
-- honor_labels: true
- job_name: kubernetes-services
+- job_name: kubernetes-services
kubernetes_sd_configs:
- role: service
metrics_path: /probe
@@ -190,12 +183,11 @@
regex: __meta_kubernetes_service_label_(.+)
- source_labels:
- __meta_kubernetes_namespace
- target_label: namespace
+ target_label: kubernetes_namespace
- source_labels:
- __meta_kubernetes_service_name
- target_label: service
-- honor_labels: true
- job_name: kubernetes-pods
+ target_label: kubernetes_name
+- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
@@ -203,10 +195,6 @@
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- - action: drop
- regex: true
- source_labels:
- - __meta_kubernetes_pod_annotation_prometheus_io_scrape_slow
- action: replace
regex: (https?)
source_labels:
@@ -218,18 +206,11 @@
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
- regex: (\d+);(([A-Fa-f0-9]{1,4}::?){1,7}[A-Fa-f0-9]{1,4})
- replacement: '[$2]:$1'
- source_labels:
- - __meta_kubernetes_pod_annotation_prometheus_io_port
- - __meta_kubernetes_pod_ip
- target_label: __address__
- - action: replace
- regex: (\d+);((([0-9]+?)(\.|$)){4})
- replacement: $2:$1
+ regex: ([^:]+)(?::\d+)?;(\d+)
+ replacement: $1:$2
source_labels:
+ - __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
- - __meta_kubernetes_pod_ip
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_annotation_prometheus_io_param_(.+)
@@ -239,21 +220,16 @@
- action: replace
source_labels:
- __meta_kubernetes_namespace
- target_label: namespace
+ target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
- target_label: pod
+ target_label: kubernetes_pod_name
- action: drop
regex: Pending|Succeeded|Failed|Completed
source_labels:
- __meta_kubernetes_pod_phase
- - action: replace
- source_labels:
- - __meta_kubernetes_pod_node_name
- target_label: node
-- honor_labels: true
- job_name: kubernetes-pods-slow
+- job_name: kubernetes-pods-slow
kubernetes_sd_configs:
- role: pod
relabel_configs:
@@ -272,18 +248,11 @@
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
- regex: (\d+);(([A-Fa-f0-9]{1,4}::?){1,7}[A-Fa-f0-9]{1,4})
- replacement: '[$2]:$1'
- source_labels:
- - __meta_kubernetes_pod_annotation_prometheus_io_port
- - __meta_kubernetes_pod_ip
- target_label: __address__
- - action: replace
- regex: (\d+);((([0-9]+?)(\.|$)){4})
- replacement: $2:$1
+ regex: ([^:]+)(?::\d+)?;(\d+)
+ replacement: $1:$2
source_labels:
+ - __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
- - __meta_kubernetes_pod_ip
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_annotation_prometheus_io_param_(.+)
@@ -293,19 +262,15 @@
- action: replace
source_labels:
- __meta_kubernetes_namespace
- target_label: namespace
+ target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
- target_label: pod
+ target_label: kubernetes_pod_name
- action: drop
regex: Pending|Succeeded|Failed|Completed
source_labels:
- __meta_kubernetes_pod_phase
- - action: replace
- source_labels:
- - __meta_kubernetes_pod_node_name
- target_label: node
scrape_interval: 5m
scrape_timeout: 30s
alerting:
@@ -319,12 +284,15 @@
- source_labels: [__meta_kubernetes_namespace]
regex: dapr-monitoring
action: keep
- - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
- regex: dapr-prom
+ - source_labels: [__meta_kubernetes_pod_label_app]
+ regex: prometheus
action: keep
- - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
+ - source_labels: [__meta_kubernetes_pod_label_component]
regex: alertmanager
action: keep
+ - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe]
+ regex: .*
+ action: keep
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex: "9093"
action: keep
|
The current grafana dashboards do not work in a fresh cluster where prometheus and grafana are installed using helm following Dapr Docs (see [1], [2]). They refer to metrics that are not available in such install. In short, based on bug-report from dapr/test-infra#204, the proposed fix can be summed by: ```bash sed -i \ -e 's/\bkubernetes_name\b/service/g' \ -e 's/\bkubernetes_namespace\b/namespace/g' \ -e 's/\bkubernetes_node\b/node/g' \ -e 's/\bkubernetes_pod_name\b/pod/g' \ *.json ``` Additionally: * Removes refresh rates smaller than 1 minute. * Sets default interval range to 14 days in the past to now * Sets default template values to match the longhaul clusters. Fixes dapr#7120 [1]: https://docs.dapr.io/operations/observability/metrics/prometheus/#setup-prometheus-on-kubernetes [2]: https://docs.dapr.io/operations/observability/metrics/grafana/#setup-on-kubernetes Signed-off-by: Tiago Alves Macambira <[email protected]>
The current grafana dashboards do not work in a fresh cluster where prometheus and grafana are installed using helm following Dapr Docs (see [1], [2]). They refer to metrics that are not available in such install. In short, based on bug-report from dapr/test-infra#204, the proposed fix can be summed by: ```bash sed -i \ -e 's/\bkubernetes_name\b/service/g' \ -e 's/\bkubernetes_namespace\b/namespace/g' \ -e 's/\bkubernetes_node\b/node/g' \ -e 's/\bkubernetes_pod_name\b/pod/g' \ *.json ``` Additionally: * Removes refresh rates smaller than 1 minute. * Sets default interval range to 14 days in the past to now * Sets default template values to match the longhaul clusters. Fixes dapr#7120 [1]: https://docs.dapr.io/operations/observability/metrics/prometheus/#setup-prometheus-on-kubernetes [2]: https://docs.dapr.io/operations/observability/metrics/grafana/#setup-on-kubernetes Signed-off-by: Tiago Alves Macambira <[email protected]>
* Fix Grafana dashboards. The current grafana dashboards do not work in a fresh cluster where prometheus and grafana are installed using helm following Dapr Docs (see [1], [2]). They refer to metrics that are not available in such install. In short, based on bug-report from dapr/test-infra#204, the proposed fix can be summed by: ```bash sed -i \ -e 's/\bkubernetes_name\b/service/g' \ -e 's/\bkubernetes_namespace\b/namespace/g' \ -e 's/\bkubernetes_node\b/node/g' \ -e 's/\bkubernetes_pod_name\b/pod/g' \ *.json ``` Additionally: * Removes refresh rates smaller than 1 minute. * Sets default interval range to 14 days in the past to now * Sets default template values to match the longhaul clusters. Fixes #7120 [1]: https://docs.dapr.io/operations/observability/metrics/prometheus/#setup-prometheus-on-kubernetes [2]: https://docs.dapr.io/operations/observability/metrics/grafana/#setup-on-kubernetes Signed-off-by: Tiago Alves Macambira <[email protected]> * Remove longhaul related settings. Signed-off-by: Tiago Alves Macambira <[email protected]> --------- Signed-off-by: Tiago Alves Macambira <[email protected]>
* Fix Grafana dashboards. The current grafana dashboards do not work in a fresh cluster where prometheus and grafana are installed using helm following Dapr Docs (see [1], [2]). They refer to metrics that are not available in such install. In short, based on bug-report from dapr/test-infra#204, the proposed fix can be summed by: ```bash sed -i \ -e 's/\bkubernetes_name\b/service/g' \ -e 's/\bkubernetes_namespace\b/namespace/g' \ -e 's/\bkubernetes_node\b/node/g' \ -e 's/\bkubernetes_pod_name\b/pod/g' \ *.json ``` Additionally: * Removes refresh rates smaller than 1 minute. * Sets default interval range to 14 days in the past to now * Sets default template values to match the longhaul clusters. Fixes dapr#7120 [1]: https://docs.dapr.io/operations/observability/metrics/prometheus/#setup-prometheus-on-kubernetes [2]: https://docs.dapr.io/operations/observability/metrics/grafana/#setup-on-kubernetes Signed-off-by: Tiago Alves Macambira <[email protected]> * Remove longhaul related settings. Signed-off-by: Tiago Alves Macambira <[email protected]> --------- Signed-off-by: Tiago Alves Macambira <[email protected]> Signed-off-by: Cassandra Coyle <[email protected]>
Steps to Reproduce the Problem
Install a new AKS cluster as described by README.md
Expected Behavior
The managed grafana should present a dashboard similar to the one existing on the current release longhaul and similar to what is shown on https://docs.dapr.io/operations/observability/metrics/grafana/
Actual Behavior
Dapr
prometheus datasource installed in managed Grafana by defaultThe text was updated successfully, but these errors were encountered: