Observability
Introduction
The goal of observability is to understand what’s happening across all of your environments and among the technologies, so you can detect and resolve issues to keep your systems efficient and reliable.
Observability is a measure of how well the system’s internal states can be inferred from knowledge of its external outputs. It uses the data and insights that monitoring produces to provide a holistic understanding of your system, including its health and performance. The observability of your system, then, depends partly on how well your monitoring metrics can interpret your system's performance indicators.
Prerequisites
To complete this section you will need:
- Helm installed as explained in the Installing required tools section.
- A Kubernetes cluster (DOKS) up and running as explained in the Set up DOKS section.
- The online boutique sample application deployed to your cluster as explained in the Deploying the app section.
- A DO Spaces bucket for
Loki
storage. Please follow the officialDigitalOcean
tutorial to create one. Make sure that it is set torestrict file listing
for security reasons.
Installing the Prometheus Monitoring Stack
-
Add the
Helm
repository and list the available charts: -
Install the
kube-prometheus-stack
, usingHelm
:HELM_CHART_VERSION="35.5.1" helm install kube-prom-stack prometheus-community/kube-prometheus-stack --version "${HELM_CHART_VERSION}" \ --namespace monitoring \ --create-namespace \ -f "docs/04-production/assets/manifests/prom-stack-values-v${HELM_CHART_VERSION}.yaml"
Note
A
specific
version for theHelm
chart is used. In this case35.5.1
was picked, which maps to the0.56.3
version of the application. To check if the installation was successful, run thehelm ls -n monitoring
command, and confirm the deployment status. -
Connect to
Grafana
(using default credentials:admin/prom-operator
) - by port forwarding to local machine:Info
You should NOT expose
Grafana
topublic
network (eg. you should create an ingress mapping orLB
service). -
Open a web browser and point to localhost:3000. You should see the
Grafna
login page.Info
Grafana
installation comes with a number of dashboards. Open a web browser on localhost:3000. Once in, you can go toDashboards -> Browse
, and choose different dashboards. As an example, you can open theGeneral / Kubernetes / Compute Resources / Node (Pods)
and view the resource metrics for a node and its related pods.
Configuring Persistent Storage for Prometheus
In this section, you will learn how to enable persistent storage
for Prometheus
, so that metrics data is persisted across server restarts
, or in case of cluster failures
. For the production
environment, you will define a 10 Gi Persistent Volume Claim
(PVC), using the DigitalOcean Block Storage
.
-
Open the
"docs/04-production/assets/manifests/prom-stack-values-v35.5.1.yaml
file provided and uncomment thestorageSpec
section. The definition should look like this:prometheusSpec: replicas: 2 retention: 20 storageSpec: volumeClaimTemplate: spec: storageClassName: do-block-storage accessModes: ["ReadWriteOnce"] resources: requests: storage: 10Gi
Note
The default retention time for metrics is set to
10d
by default in thekube-prometheus-stack
helm chart. In production the retention time will be set to20d
. After 20 days the metrics will be deleted from theVolume
. -
Apply the new settings using
Helm
:HELM_CHART_VERSION="35.5.1" helm upgrade kube-prom-stack prometheus-community/kube-prometheus-stack --version "${HELM_CHART_VERSION}" \ --namespace monitoring \ -f "docs/04-production/assets/manifests/prom-stack-values-v${HELM_CHART_VERSION}.yaml"
Note
Check the
PVC
status by runningkubectl get pvc -n monitoring
. A newVolume
should appear in the Volumes web page, from yourDigitalOcean
account panel.
Configuring Persistent Storage for Grafana
In this section, you will learn how to enable persistent storage
for Grafana
, so that metrics data is persisted across server restarts
, or in case of cluster failures
. For the production
environment, you will define a 10 Gi Persistent Volume Claim
(PVC), using the DigitalOcean Block Storage
.
-
Open the
docs/04-production/assets/manifests/prom-stack-values-v35.5.1.yaml
file provided and uncomment thestorageSpec
section. The definition should look like this: -
Apply the new settings using
Helm
:HELM_CHART_VERSION="35.5.1" helm upgrade kube-prom-stack prometheus-community/kube-prometheus-stack --version "${HELM_CHART_VERSION}" \ --namespace monitoring \ -f "docs/04-production/assets/manifests/prom-stack-values-v${HELM_CHART_VERSION}.yaml"
Note
Check the
PVC
status by runningkubectl get pvc -n monitoring
. A newVolume
should appear in the Volumes web page, from yourDigitalOcean
account panel.
Installing the Loki Stack
In this section you will learn about Loki, which is a log aggregation
system inspired by Prometheus
. Loki
uses Promtail
to fetch logs from all Pods
running in your cluster. Then, logs are aggregated
and compressed
, and sent to the configured storage
. Next, you can connect Loki
data source to Grafana
and view the logs.
-
Add the
Grafana
Helm repository and list the available charts: -
Intall the
Loki
stack usingHelm
:HELM_CHART_VERSION="2.6.4" helm install loki grafana/loki-stack --version "${HELM_CHART_VERSION}" \ --namespace=loki-stack \ --create-namespace \ -f "docs/04-production/assets/manifests/loki-stack-values-v${HELM_CHART_VERSION}.yaml"
Note
The above values file, enables
Loki
andPromtail
for you, so no other input is required.Prometheus
andGrafana
installation is disabled, because Installing the Prometheus Monitoring Stack took care of it already. The2.6.4
Helm chart version is picked forloki-stack
, which maps to application version2.4.2
. To check if the installation was successful, run thehelm ls -n loki-stack
command, and confirm the deployment status.
Configuring Grafana with Loki
In this section, you will add the Loki
data source to Grafana
. First, you need to expose the Grafana
web interface on your local machine (default credentials: admin/prom-operator
):
Next, open a web browser on localhost:3000, and follow below steps:
- Click the
Configuration
gear from the left panel. - Select
Data sources
. - Click the
Add data source
blue button. - Select
Loki
from the list and addLoki
urlhttp://loki.loki-stack:3100
. - Save and test.
Info
If everything goes well, a green label message will appear, saying Data source connected and labels found.
You can access logs from the Explore
tab of Grafana
. Make sure to select Loki
as the data source. Use the Help
button for log search cheat sheet. !!! info As an example query, to retrieve all the logs for the microservices-demo-prod
namespace you can run: {namespace="microservices-demo-prod"}
.
Configuring Persistent Storage for Loki
In this step, you will learn how to enable persistent
storage for Loki
. You're going to use the DO Spaces
bucket created in the Prerequisites section.
-
Open the
docs/04-production/assets/manifests/loki-stack-values-v35.5.1.yaml
file provided and remove the comments surrounding theschema_config
andstorage_config
keys. The definition should look like this:Click to expand the
loki config
loki: enabled: true config: schema_config: configs: - from: "2020-10-24" store: boltdb-shipper object_store: aws schema: v11 index: prefix: index_ period: 24h storage_config: boltdb_shipper: active_index_directory: /data/loki/boltdb-shipper-active cache_location: /data/loki/boltdb-shipper-cache cache_ttl: 24h shared_store: aws aws: bucketnames: <YOUR_DO_SPACES_BUCKET_NAME_HERE> endpoint: <YOUR_DO_SPACES_BUCKET_ENDPOINT_HERE> # in the following format: <region>.digitaloceanspaces.com region: <YOUR_DO_SPACES_BUCKET_REGION_HERE> # short region name (e.g.: fra1) access_key_id: <YOUR_DO_SPACES_ACCESS_KEY_HERE> secret_access_key: <YOURDO_SPACES_SECRET_KEY_HERE> s3forcepathstyle: true
-
Apply the new settings using
Helm
:HELM_CHART_VERSION="2.6.4" helm upgrade loki grafana/loki-stack --version "${HELM_CHART_VERSION}" \ --namespace=loki-stack \ -f "docs/04-production/assets/manifests/loki-stack-values-v${HELM_CHART_VERSION}.yaml"
Note
Check if the main
Loki
application pod is up and running by runnging the followingkubectl
command:kubectl get pods -n loki-stack -l app=loki
. If everything goes well, you should see theDO Spaces
bucket containing theindex
andchunks
folders (thechunks
folder is calledfake
, which is a strange name - this is by design, when not running inmulti-tenant
mode).
Setting up a retention policy
In this step you will set up a retention policy for your DO Spaces
bucket. S3CMD
is a good utility to have in order to set up retention policies. Please follow the DigitalOcean
guide for installing and setting up s3cmd.
-
Configure the
Loki
bucket lifecycle, usings3cmd
:Click to expand the
Loki bucket lifecycle
<LifecycleConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/"> <Rule> <ID>Expire old fake data</ID> <Prefix>fake/</Prefix> <Status>Enabled</Status> <Expiration> <Days>10</Days> </Expiration> </Rule> <Rule> <ID>Expire old index data</ID> <Prefix>index/</Prefix> <Status>Enabled</Status> <Expiration> <Days>10</Days> </Expiration> </Rule> </LifecycleConfiguration>
-
Check that the
policy
was set (please replace the<>
placeholders accordingly):Note
The
DO Spaces
backend implementation will clean the objects for youautomatically
, based on the expiration date. You can always go back and edit the policy if needed later on, by uploading a new one.
Setting up Alert Manager
Alertmanager is deployed alongside Prometheus and forms the alerting layer of the kube-prom-stack
. It handles alerts generated by Prometheus by deduplicating, grouping, and routing them to various integrations such as email, Slack or PagerDuty. Alerts and notifications are a critical part of your workflow. When things go wrong (e.g. any service is down, or a pod is crashing, etc.), you will want to get notifications in real time to handle critical situations as soon as possible.
To create a new alert, you need to add a new definition in the additionalPrometheusRule
s section from the kube-prom-stack Helm values file. You will be creating a sample alert that will trigger if the microservices-demo-prod
namespace does not have an expected number of instances. The expected number of pods for the online boutique
application is 10.
-
Open the
docs/04-production/assets/manifests/prom-stack-values-v35.5.1.yaml
file provided and uncomment theadditionalPrometheusRulesMap
block. The definition should look like this:additionalPrometheusRulesMap: rule-name: groups: - name: online-boutique-instance-down rules: - alert: OnlineBoutiqueInstanceDown expr: sum(kube_pod_owner{namespace="microservices-demo-prod"}) by (namespace) < 10 for: 1m labels: severity: 'critical' annotations: description: ' The Number of pods from the namespace {{ $labels.namespace }} is lower than the expected 10. ' summary: 'Pod {{ $labels.pod }} down'
-
Apply the new settings using
Helm
:HELM_CHART_VERSION="35.5.1" helm upgrade kube-prom-stack prometheus-community/kube-prometheus-stack --version "${HELM_CHART_VERSION}" \ --namespace monitoring \ -f "docs/04-production/assets/manifests/prom-stack-values-v${HELM_CHART_VERSION}.yaml"
Info
To check that the alert has been created successfully, first port-forward to your local machine by running this command:
kubectl --namespace monitoring port-forward service/kube-prom-stack-kube-prome-prometheus 9090:9090
. Navigate to the Promethes Console click on theAlerts
menu item and identify theOnlineBoutiqueInstanceDown
alert. It should be visible at the bottom of the list.
Configuring Alertmanager to Send Notifications to Slack
To complete this section you need to have administrative rights over a workspace. This will enable you to create the incoming webhook you will need in the next steps. You will also need to create a channel where you would like to receive notifications from AlertManager
.
Steps to follow:
- Open a web browser and navigate to
https://api.slack.com/apps
and click on theCreate New App
button. - In the
Create an app
window select theFrom scratch
option. Then, give your application a name and select the appropriate workspace. - From the
Basic Information
page click on theIncoming Webhooks
option, turn it on and click on theAdd New Webhook to Workspace
button at the bottom. - On the next page, use the
Search for a channel...
drop-down list to select the desired channel where you want to send notifications. When ready, click on theAllow
button. - Take note of the
Webhook URL
value displayed on the page. You will be using it in the next section.
Next you will tell Alertmanager
how to send Slack
notifications.
-
Open the
docs/04-production/assets/manifests/prom-stack-values-v35.5.1.yaml
file provided and uncomment thealertmaanager.config
block. Make sure to update the<>
placeholders accordingly. The definition should look like:Click to expand the
alertmanager config
alertmanager: enabled: true config: global: resolve_timeout: 5m slack_api_url: "<YOUR_SLACK_APP_INCOMING_WEBHOOK_URL_HERE>" route: receiver: "null" repeat_interval: 12h routes: - receiver: "slack-notifications" matchers: - alertname="OnlineBoutiqueInstanceDown" continue: false receivers: - name: "null" - name: "slack-notifications" slack_configs: - channel: "#<YOUR_SLACK_CHANNEL_NAME_HERE>" send_resolved: true title: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}" text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
-
Apply the new settings using
Helm
:
!!! info At this point you should only receieve alerts from the matching OnlinqBoutiqueInstanceDown
alertname. Since the continue
is set to false Alertmanager
will only send notifications from this alert and stop sending for others. Clicking on the notification name in Slack
will open a web browser to an unreachable web page with the internal Kubernetes DNS of the Alertmanager
pod. This is expected. For more information you can check out this article. For additional information about the configuration parameters for Alertmanager
you can check out this doc. You can also at some notification examples in this article.
Setting up Event Exporter for events retention
A Kubernetes event is an object that shows what’s happening inside a cluster, node, pod, or container. These objects are usually generated in response to changes that occur inside your K8s system. The Kubernetes API Server enables all core components to create these events. Generally, each event is accompanied by a log message as well. Event objects are not regular log events, therefore the Kubernetes logs do not include them. Kubernetes has no builtin support to store or forward these events on the long term, and they are cleaned up after a short retention time defaulting to just 1 hour. In this section you will learn how to configure the Kubernetes Events Exporter and collect and perist those events using Loki
.
-
Create the
ServiceAccount
,ClusterRole
andClusterRoleBinding
usingkubectl
:The manifest file looks like the following:
Click to expand the manifest file
apiVersion: v1 kind: Namespace metadata: name: event-exporter --- apiVersion: v1 kind: ServiceAccount metadata: namespace: event-exporter name: event-exporter --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: event-exporter roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: view subjects: - kind: ServiceAccount namespace: event-exporter name: event-exporter
-
Create the
event exporter
config usingkubectl
:The manifest file for the config looks like the following:
Click to expand the config manifest file
-
Finally, create the
event exporter
deployment using kubectl:The manifest file for the deployment looks like the following:
Click to expand the deployment manifest file
apiVersion: apps/v1 kind: Deployment metadata: name: event-exporter namespace: event-exporter spec: replicas: 1 template: metadata: labels: app: event-exporter version: v1 spec: serviceAccountName: event-exporter containers: - name: event-exporter image: ghcr.io/resmoio/kubernetes-event-exporter:latest imagePullPolicy: IfNotPresent args: - -conf=/data/config.yaml volumeMounts: - mountPath: /data name: cfg volumes: - name: cfg configMap: name: event-exporter-cfg selector: matchLabels: app: event-exporter version: v1
Viewing events in Grafana
If the installation went well and no errors were reported you should start seeing event start to flow in Grafana.
-
Connect to
Grafana
(using default credentials:admin/prom-operator
) by port forwarding to local machine: -
Open a web browser on localhost:3000. Once in, you can go to
Explore
menu and select theLoki
as a datasource. -
In the
Log browser
input enter the following:Info
You should see events in the
Logs
section. Any of the fields in theDetected fields
section of a log detail view can be used to query. For example you can perform a query using the pod name and view specific logs for a certain pod:{app="event-exporter"} |= "shippingservice-79bdd5f858-gqm6g"
.
For a more in depth explanation on the Obervability
topic you can check out the Kubernetes Starter Kit Tutorial.
Next, you will learn how to configure the CI/CD process and associated GitHub workflows for all project components used in this guide.