Commit graph

13 commits

Author SHA1 Message Date
Matthias Sohn 5423672a21 Make gerrit_monitoring.py executable
Change-Id: Id9ab768dc5d1f38e18079f01e381a10a629e627e
2023-02-22 14:41:27 +01:00
Matthias Sohn 09eccc6e78 Fix name of gerrit_monitoring.py script in README.md
Change-Id: I1bf67dd6dcf54114db2796fdc8d32693ce684874
2023-02-21 20:16:35 +01:00
Thomas Draebing fad4eba966 Support a federated Prometheus setup
Gerrit instances that are loadbalanced cannot easily e scraped by
an external Prometheus, since the request won't end up at a specified
Gerrit instance. A typical setup to solve this issue, is to install a
local Prometheus and scrape the local Prometheus from the central
Prometheus. This is a so called federated setup.

Now such a setup is supported and can be configured.

Change-Id: I0119d3c1d846cd8e975e5732f4d59cf863c6d2b8
2021-12-16 19:05:00 +01:00
Thomas Draebing 8e8a55e650 Add healthcheck ping and dashboard for Gerrit
The healthcheck plugin for Gerrit provides a convenient way to determine
the health of different functionalities and components of Gerrit. If
the endpoint provided by the plugin is pinged, it will execute a set
of checks and return either 200 if all checks passed or 500 if at least
one failed. It will also provide metrics that can be scraped by
Prometheus.

This change adds the option for Gerrit installations outside of Kubernetes
to install a sidecar container in the Prometheus deployment that every
30 s pings the healthcheck plugin's endpoint, thereby triggering the
checks. This is not provided for kubernetes, since there the ping should
be the task of the Kubernetes liveness probes.

The change additionally adds a dashboard displaying the status of the
healthcheck for each Gerrit instance over time.

Change-Id: Ieeedc4406b642e542c89679a8314d771ca0928af
2021-02-12 13:47:16 +01:00
Thomas Draebing ce5b8300f1 Start using Grafonnet to create Grafana dashboards
Versioning the pure JSON files representing the Grafana dashboards
had some disadvantages. It was hard to review them, they were very
cluttered and a lot was duplicated.

There are some tools that deal with that. One of them is Grafonnet,
which is a superset of Jsonnet, a tool to create JSON files using a
domain specific language.

This change implements the Gerrit Process dashboard in Grafonnet.
It also extends the installer to be able to install dashboards in
the Jsonnet format.

Change-Id: I6235fb7d045bd71557678a4e3b0d4ad4515f4615
2020-12-04 08:31:21 +01:00
Thomas Draebing 3b4005a047 Sort monitoring and logging components into sub-maps in the config
This is done in preparation to allow multiple logging stacks.

Change-Id: I950200805ec01851bfdf6ccc3a5243893a947616
2020-05-27 16:30:33 +02:00
Thomas Draebing de8fee4f68 Add promtail chart to collect logs from cluster
This adds the promtail chart to the installation that allows to
collect the logs of the applications in the cluster, which are written
to stdout of the containers.

This will only collect logs from pods in the same namespace as the
monitoring setup. In a later change also logs from Gerrit instances
in Kubernetes will be added.

Change-Id: I86c5c5470eaa31191fb5ac339ee21dee85106097
2020-05-27 16:30:31 +02:00
Thomas Draebing 451882b7e9 Allow to monitor Gerrit on Kubernetes
So far it was only possible to monitor single instance Gerrit servers.
This was due to to the fact that a URL had to be used that pointed to
a dedicated instance, since if multiple replicas would be behind the
instance, the metrics of a random replica would be scraped and not of
all.

Prometheus has a service discovery functionality for deployments running
in Kubernetes. This is now used, when monitoring a Gerrit instance in
Kubernetes. This allows to have a variable number of replicas running,
which will be automatically discovered by Prometheus.

The dashboards were adapted accordingly and allow now to select the
replica to be observed. For now, no summary of all replicas can be
displayed in the dashboards, but that feature is planned to be added
in the future.

Change-Id: I96efc63a192cd90f5e3e91a53dace8e1ae83132e
2020-05-14 15:55:35 +02:00
Thomas Draebing 0bdb1d02e0 Create promtail config per Gerrit host
So far the install-script could only create a single promtail config.
Since the monitoring setup is able to monitor multiple Gerrit servers,
this caused manual work to create a promtail config per Gerrit server.

Now ytt will create a configuration for each Gerrit host configured
in the config.yaml. Ytt is only able to do that in a single file. Thus,
csplit is used to split the files into separate files that can then
be used to configure promtail on the respective hosts. The config-
files can then be found under
$OUTPUT/promtail/promtail-$GERRIT_HOSTNAME.yaml.

Change-Id: Ib09fba83d8a8fbd45b42e9e5388a85a37ab1a952
2020-04-16 14:25:53 +02:00
Thomas Draebing 6b75c12831 Rewrite the scripts in python
The scripts were written in bash. Using bash became quite unwieldy.

Python by nature can deal well with yaml and is thus better suited
in dealing with the yaml-based configuration files. This change
rewrites the original scripts staying as close as possible to the
original ones.

Right now, the python scripts call subprocesses a lot to work with
the tools, which were already used before. At least for yaml-
templating there may be better tools that have a python integration,
which could be used in the future.

Change-Id: Ida16318445a05dcfdada9c7a56a391e4827f02e7
2020-04-16 14:25:50 +02:00
Thomas Draebing aa0c5252f0 Describe infrastructure dependencies
Change-Id: I1ba3967a10e5cd35aff60579eff388252c81874b
2020-03-24 16:01:36 +01:00
Thomas Draebing eb4e6ea191 Use object store to store chunks created by Loki
The chunks created by Loki were stored in a persistent volume. This
does not scale well, since volumes cannot easily be resized in
Kubernetes. Also, at least the ext4-filesystem had issues, when large
numbers of logs were saved. These issues are due to the dir_index as
discussed in [1].

An object store provides a more scalable and cheaper solution. Loki
supports S3 as an object storage and also other object stores that
understand the S3 API like Ceph or OpenStack Swift.

[1] https://github.com/grafana/loki/issues/1502

Change-Id: Id55095c3b6659f40708712c1a494753dbcab7686
2020-03-24 16:01:34 +01:00
Thomas Draebing be862d863e Move internal project to open source
This change adds the current status of a project that aims to create
a simple monitoring setup to monitor Gerrit servers, which was developed
internally at SAP.

The project provides an opinionated and basic configuration for helm
charts that can be used to install Loki, Prometheus and Grafana on a
Kubernetes cluster. Scripts to easily apply the configuration and
install the whole setup are provided as well.

The contributions so far were done by (with number of commits)

  80  Thomas Draebing
  11  Matthias Sohn
   2  Saša Živkov

Change-Id: I8045780446edfb3c0dc8287b8f494505e338e066
2020-03-11 15:23:19 +01:00