Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AKS Templates] Fix Prometheus and Grafana on templates #204

Open
tmacam opened this issue Oct 3, 2023 · 1 comment
Open

[AKS Templates] Fix Prometheus and Grafana on templates #204

tmacam opened this issue Oct 3, 2023 · 1 comment

Comments

@tmacam
Copy link
Contributor

tmacam commented Oct 3, 2023

Steps to Reproduce the Problem

Install a new AKS cluster as described by README.md

Expected Behavior

The managed grafana should present a dashboard similar to the one existing on the current release longhaul and similar to what is shown on https://docs.dapr.io/operations/observability/metrics/grafana/

Actual Behavior

  1. There is no default dashboard installed
  2. There is no Dapr prometheus datasource installed in managed Grafana by default
  3. Installing the dashboards available on https://github.com/dapr/dapr/tree/master/grafana don't produce the expected result as most of the dependent metrics are not available on the managed Prometheus
@tmacam
Copy link
Contributor Author

tmacam commented Oct 3, 2023

Regarding item 3 (missing Prometheus metrics), seems there is a major difference in how Prometheus is configured out of the box (be it the Azure managed one or from a fresh Helm setup) and how it is configured right now in the release clusters. This distinction is also encoded in the grafana dashboards we saved in dapr/dapr, which refer to metrics by names that only exists in the release longaul prometheus setup.

As an example, I am pasting a diff of what one would find in a helm-installed grafana and what we have in release longhaul:

--- fresh-from-helm-prometheus.yaml	2023-10-01 14:33:45.782910959 -0700
+++ release-prometheus.yaml	2023-10-01 14:33:45.793744284 -0700
@@ -1,4 +1,4 @@
-issue6946-prometheus.yml
+release-prometheus.yml
 global:
   evaluation_interval: 1m
   scrape_interval: 1m
@@ -64,8 +64,7 @@
   tls_config:
     ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
     insecure_skip_verify: true
-- honor_labels: true
-  job_name: kubernetes-service-endpoints
+- job_name: kubernetes-service-endpoints
   kubernetes_sd_configs:
   - role: endpoints
   relabel_configs:
@@ -73,10 +72,6 @@
     regex: true
     source_labels:
     - __meta_kubernetes_service_annotation_prometheus_io_scrape
-  - action: drop
-    regex: true
-    source_labels:
-    - __meta_kubernetes_service_annotation_prometheus_io_scrape_slow
   - action: replace
     regex: (https?)
     source_labels:
@@ -88,7 +83,7 @@
     - __meta_kubernetes_service_annotation_prometheus_io_path
     target_label: __metrics_path__
   - action: replace
-    regex: (.+?)(?::\d+)?;(\d+)
+    regex: ([^:]+)(?::\d+)?;(\d+)
     replacement: $1:$2
     source_labels:
     - __address__
@@ -102,17 +97,16 @@
   - action: replace
     source_labels:
     - __meta_kubernetes_namespace
-    target_label: namespace
+    target_label: kubernetes_namespace
   - action: replace
     source_labels:
     - __meta_kubernetes_service_name
-    target_label: service
+    target_label: kubernetes_name
   - action: replace
     source_labels:
     - __meta_kubernetes_pod_node_name
-    target_label: node
-- honor_labels: true
-  job_name: kubernetes-service-endpoints-slow
+    target_label: kubernetes_node
+- job_name: kubernetes-service-endpoints-slow
   kubernetes_sd_configs:
   - role: endpoints
   relabel_configs:
@@ -131,7 +125,7 @@
     - __meta_kubernetes_service_annotation_prometheus_io_path
     target_label: __metrics_path__
   - action: replace
-    regex: (.+?)(?::\d+)?;(\d+)
+    regex: ([^:]+)(?::\d+)?;(\d+)
     replacement: $1:$2
     source_labels:
     - __address__
@@ -145,15 +139,15 @@
   - action: replace
     source_labels:
     - __meta_kubernetes_namespace
-    target_label: namespace
+    target_label: kubernetes_namespace
   - action: replace
     source_labels:
     - __meta_kubernetes_service_name
-    target_label: service
+    target_label: kubernetes_name
   - action: replace
     source_labels:
     - __meta_kubernetes_pod_node_name
-    target_label: node
+    target_label: kubernetes_node
   scrape_interval: 5m
   scrape_timeout: 30s
 - honor_labels: true
@@ -165,8 +159,7 @@
     regex: pushgateway
     source_labels:
     - __meta_kubernetes_service_annotation_prometheus_io_probe
-- honor_labels: true
-  job_name: kubernetes-services
+- job_name: kubernetes-services
   kubernetes_sd_configs:
   - role: service
   metrics_path: /probe
@@ -190,12 +183,11 @@
     regex: __meta_kubernetes_service_label_(.+)
   - source_labels:
     - __meta_kubernetes_namespace
-    target_label: namespace
+    target_label: kubernetes_namespace
   - source_labels:
     - __meta_kubernetes_service_name
-    target_label: service
-- honor_labels: true
-  job_name: kubernetes-pods
+    target_label: kubernetes_name
+- job_name: kubernetes-pods
   kubernetes_sd_configs:
   - role: pod
   relabel_configs:
@@ -203,10 +195,6 @@
     regex: true
     source_labels:
     - __meta_kubernetes_pod_annotation_prometheus_io_scrape
-  - action: drop
-    regex: true
-    source_labels:
-    - __meta_kubernetes_pod_annotation_prometheus_io_scrape_slow
   - action: replace
     regex: (https?)
     source_labels:
@@ -218,18 +206,11 @@
     - __meta_kubernetes_pod_annotation_prometheus_io_path
     target_label: __metrics_path__
   - action: replace
-    regex: (\d+);(([A-Fa-f0-9]{1,4}::?){1,7}[A-Fa-f0-9]{1,4})
-    replacement: '[$2]:$1'
-    source_labels:
-    - __meta_kubernetes_pod_annotation_prometheus_io_port
-    - __meta_kubernetes_pod_ip
-    target_label: __address__
-  - action: replace
-    regex: (\d+);((([0-9]+?)(\.|$)){4})
-    replacement: $2:$1
+    regex: ([^:]+)(?::\d+)?;(\d+)
+    replacement: $1:$2
     source_labels:
+    - __address__
     - __meta_kubernetes_pod_annotation_prometheus_io_port
-    - __meta_kubernetes_pod_ip
     target_label: __address__
   - action: labelmap
     regex: __meta_kubernetes_pod_annotation_prometheus_io_param_(.+)
@@ -239,21 +220,16 @@
   - action: replace
     source_labels:
     - __meta_kubernetes_namespace
-    target_label: namespace
+    target_label: kubernetes_namespace
   - action: replace
     source_labels:
     - __meta_kubernetes_pod_name
-    target_label: pod
+    target_label: kubernetes_pod_name
   - action: drop
     regex: Pending|Succeeded|Failed|Completed
     source_labels:
     - __meta_kubernetes_pod_phase
-  - action: replace
-    source_labels:
-    - __meta_kubernetes_pod_node_name
-    target_label: node
-- honor_labels: true
-  job_name: kubernetes-pods-slow
+- job_name: kubernetes-pods-slow
   kubernetes_sd_configs:
   - role: pod
   relabel_configs:
@@ -272,18 +248,11 @@
     - __meta_kubernetes_pod_annotation_prometheus_io_path
     target_label: __metrics_path__
   - action: replace
-    regex: (\d+);(([A-Fa-f0-9]{1,4}::?){1,7}[A-Fa-f0-9]{1,4})
-    replacement: '[$2]:$1'
-    source_labels:
-    - __meta_kubernetes_pod_annotation_prometheus_io_port
-    - __meta_kubernetes_pod_ip
-    target_label: __address__
-  - action: replace
-    regex: (\d+);((([0-9]+?)(\.|$)){4})
-    replacement: $2:$1
+    regex: ([^:]+)(?::\d+)?;(\d+)
+    replacement: $1:$2
     source_labels:
+    - __address__
     - __meta_kubernetes_pod_annotation_prometheus_io_port
-    - __meta_kubernetes_pod_ip
     target_label: __address__
   - action: labelmap
     regex: __meta_kubernetes_pod_annotation_prometheus_io_param_(.+)
@@ -293,19 +262,15 @@
   - action: replace
     source_labels:
     - __meta_kubernetes_namespace
-    target_label: namespace
+    target_label: kubernetes_namespace
   - action: replace
     source_labels:
     - __meta_kubernetes_pod_name
-    target_label: pod
+    target_label: kubernetes_pod_name
   - action: drop
     regex: Pending|Succeeded|Failed|Completed
     source_labels:
     - __meta_kubernetes_pod_phase
-  - action: replace
-    source_labels:
-    - __meta_kubernetes_pod_node_name
-    target_label: node
   scrape_interval: 5m
   scrape_timeout: 30s
 alerting:
@@ -319,12 +284,15 @@
     - source_labels: [__meta_kubernetes_namespace]
       regex: dapr-monitoring
       action: keep
-    - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
-      regex: dapr-prom
+    - source_labels: [__meta_kubernetes_pod_label_app]
+      regex: prometheus
       action: keep
-    - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
+    - source_labels: [__meta_kubernetes_pod_label_component]
       regex: alertmanager
       action: keep
+    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe]
+      regex: .*
+      action: keep
     - source_labels: [__meta_kubernetes_pod_container_port_number]
       regex: "9093"
       action: keep

tmacam added a commit to tmacam/dapr that referenced this issue Oct 31, 2023
The current grafana dashboards do not work in a fresh cluster where
prometheus and grafana are installed using helm following Dapr Docs
(see [1], [2]). They refer to metrics that are not available in
such install.

In short, based on bug-report from dapr/test-infra#204, the proposed
fix can be summed by:

```bash
sed -i \
    -e 's/\bkubernetes_name\b/service/g' \
    -e 's/\bkubernetes_namespace\b/namespace/g' \
    -e 's/\bkubernetes_node\b/node/g' \
    -e 's/\bkubernetes_pod_name\b/pod/g' \
    *.json
```

Additionally:

* Removes refresh rates smaller than 1 minute.
* Sets default interval range to 14 days in the past to now
* Sets default template values to match the longhaul clusters.

Fixes dapr#7120

[1]: https://docs.dapr.io/operations/observability/metrics/prometheus/#setup-prometheus-on-kubernetes
[2]: https://docs.dapr.io/operations/observability/metrics/grafana/#setup-on-kubernetes

Signed-off-by: Tiago Alves Macambira <[email protected]>
tmacam added a commit to tmacam/dapr that referenced this issue Nov 3, 2023
The current grafana dashboards do not work in a fresh cluster where
prometheus and grafana are installed using helm following Dapr Docs
(see [1], [2]). They refer to metrics that are not available in
such install.

In short, based on bug-report from dapr/test-infra#204, the proposed
fix can be summed by:

```bash
sed -i \
    -e 's/\bkubernetes_name\b/service/g' \
    -e 's/\bkubernetes_namespace\b/namespace/g' \
    -e 's/\bkubernetes_node\b/node/g' \
    -e 's/\bkubernetes_pod_name\b/pod/g' \
    *.json
```

Additionally:

* Removes refresh rates smaller than 1 minute.
* Sets default interval range to 14 days in the past to now
* Sets default template values to match the longhaul clusters.

Fixes dapr#7120

[1]: https://docs.dapr.io/operations/observability/metrics/prometheus/#setup-prometheus-on-kubernetes
[2]: https://docs.dapr.io/operations/observability/metrics/grafana/#setup-on-kubernetes

Signed-off-by: Tiago Alves Macambira <[email protected]>
mukundansundar pushed a commit to dapr/dapr that referenced this issue Nov 4, 2023
* Fix Grafana dashboards.

The current grafana dashboards do not work in a fresh cluster where
prometheus and grafana are installed using helm following Dapr Docs
(see [1], [2]). They refer to metrics that are not available in
such install.

In short, based on bug-report from dapr/test-infra#204, the proposed
fix can be summed by:

```bash
sed -i \
    -e 's/\bkubernetes_name\b/service/g' \
    -e 's/\bkubernetes_namespace\b/namespace/g' \
    -e 's/\bkubernetes_node\b/node/g' \
    -e 's/\bkubernetes_pod_name\b/pod/g' \
    *.json
```

Additionally:

* Removes refresh rates smaller than 1 minute.
* Sets default interval range to 14 days in the past to now
* Sets default template values to match the longhaul clusters.

Fixes #7120

[1]: https://docs.dapr.io/operations/observability/metrics/prometheus/#setup-prometheus-on-kubernetes
[2]: https://docs.dapr.io/operations/observability/metrics/grafana/#setup-on-kubernetes

Signed-off-by: Tiago Alves Macambira <[email protected]>

* Remove longhaul related settings.

Signed-off-by: Tiago Alves Macambira <[email protected]>

---------

Signed-off-by: Tiago Alves Macambira <[email protected]>
cicoyle pushed a commit to cicoyle/dapr that referenced this issue May 24, 2024
* Fix Grafana dashboards.

The current grafana dashboards do not work in a fresh cluster where
prometheus and grafana are installed using helm following Dapr Docs
(see [1], [2]). They refer to metrics that are not available in
such install.

In short, based on bug-report from dapr/test-infra#204, the proposed
fix can be summed by:

```bash
sed -i \
    -e 's/\bkubernetes_name\b/service/g' \
    -e 's/\bkubernetes_namespace\b/namespace/g' \
    -e 's/\bkubernetes_node\b/node/g' \
    -e 's/\bkubernetes_pod_name\b/pod/g' \
    *.json
```

Additionally:

* Removes refresh rates smaller than 1 minute.
* Sets default interval range to 14 days in the past to now
* Sets default template values to match the longhaul clusters.

Fixes dapr#7120

[1]: https://docs.dapr.io/operations/observability/metrics/prometheus/#setup-prometheus-on-kubernetes
[2]: https://docs.dapr.io/operations/observability/metrics/grafana/#setup-on-kubernetes

Signed-off-by: Tiago Alves Macambira <[email protected]>

* Remove longhaul related settings.

Signed-off-by: Tiago Alves Macambira <[email protected]>

---------

Signed-off-by: Tiago Alves Macambira <[email protected]>
Signed-off-by: Cassandra Coyle <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant