Commit graph

71 commits

Author SHA1 Message Date
Thomas Draebing
517332653f Disable PodSecurityPolicies by default to support Kubernetes 1.25+
PodSecurityPolicies were removed in favour of Pod security standards
that are configured on a cluster or namespace level [1].

[1] https://kubernetes.io/blog/2022/08/25/pod-security-admission-stable/

Change-Id: Ia1e55c09bfad30fd209e96b3eddbda339edc31aa
2023-07-12 12:58:29 +00:00
Matthias Sohn
5423672a21 Make gerrit_monitoring.py executable
Change-Id: Id9ab768dc5d1f38e18079f01e381a10a629e627e
2023-02-22 14:41:27 +01:00
Matthias Sohn
2ea9735067 Add shebang to gerrit_monitoring.py script
This allows to run the script without explicitly specifying the
interpreter to use.

Change-Id: I900b2dae90a87fb6bae65c6d1549ad9d5d29cd48
2023-02-21 20:18:25 +01:00
Matthias Sohn
09eccc6e78 Fix name of gerrit_monitoring.py script in README.md
Change-Id: I1bf67dd6dcf54114db2796fdc8d32693ce684874
2023-02-21 20:16:35 +01:00
Thomas Draebing
7088daaa31 Add option to use vault to manage key used for encryption
Using a local PGP-key for encryption of the secrets in the configuration
is not very secure and makes it hard to rotate and distribute the
key. Sops provides the option to use managed services for this
purpose, e.g. HashiCorp Vault.

This change adds the option to use HashiCorp Vault, when using the
provided python scripts to encrypt the config file.

Change-Id: I7683fbfdbed00506c3bca264ac8565f48bc5ea73
2022-05-09 06:59:40 +00:00
Thomas Draebing
fad4eba966 Support a federated Prometheus setup
Gerrit instances that are loadbalanced cannot easily e scraped by
an external Prometheus, since the request won't end up at a specified
Gerrit instance. A typical setup to solve this issue, is to install a
local Prometheus and scrape the local Prometheus from the central
Prometheus. This is a so called federated setup.

Now such a setup is supported and can be configured.

Change-Id: I0119d3c1d846cd8e975e5732f4d59cf863c6d2b8
2021-12-16 19:05:00 +01:00
Thomas Draebing
4a9d167637 Adapt receivecommits metrics
The metric name was changed in I1aae3bc0c0fe430086221503b8e529fa06967517.

Change-Id: I466b01f05a2f679ef49437998992f5aa678bd58c
2021-10-29 10:14:56 +02:00
Pat Long
37dc340371
Ensure that all dashboards are using 'defaults.datasource'
Some dashboards were still explicitly specifying 'Prometheus' as the
datasource, which leads to issues when trying to import the dashboards
into a grafana instance where the prometheus datasource has a different
name.

Change-Id: I13135af32a6f312a4feb32ab828f906f7b13edfe
2021-06-28 11:13:35 -04:00
Thomas Draebing
8e8a55e650 Add healthcheck ping and dashboard for Gerrit
The healthcheck plugin for Gerrit provides a convenient way to determine
the health of different functionalities and components of Gerrit. If
the endpoint provided by the plugin is pinged, it will execute a set
of checks and return either 200 if all checks passed or 500 if at least
one failed. It will also provide metrics that can be scraped by
Prometheus.

This change adds the option for Gerrit installations outside of Kubernetes
to install a sidecar container in the Prometheus deployment that every
30 s pings the healthcheck plugin's endpoint, thereby triggering the
checks. This is not provided for kubernetes, since there the ping should
be the task of the Kubernetes liveness probes.

The change additionally adds a dashboard displaying the status of the
healthcheck for each Gerrit instance over time.

Change-Id: Ieeedc4406b642e542c89679a8314d771ca0928af
2021-02-12 13:47:16 +01:00
Thomas Draebing
6813b84a99 Update Grafana helm-chart to 6.2.2 (Grafana 7.3.5)
Change-Id: Iec16e455cbdea3bc83bb7970dd6cdfbfaf701ffb
2021-02-09 15:09:26 +01:00
Matthias Sohn
73d4326206 gc panel: use for loop to add prometheus targets
Grafonnet doesn't yet provide addSeriesOverrides() accepting an array.
Also use a different color for each gc so that switching to another
gc shows up in the graph.

Change-Id: I4e424280d44a63f57ad7196dfdb7e77ba2f13f24
2021-02-06 00:47:10 +01:00
Matthias Sohn
c6a7a985cd Fix yAxis label of gc-time panel
Change-Id: Ib330398a5a9034ed34a07df50930aab2202b27d5
2021-02-05 01:04:06 +01:00
Matthias Sohn
7d9aff0488 Fix series override alias for G1 old gen gc metric
Change-Id: Ice4908d335214749989966219fad410c783652af
2021-02-05 01:01:32 +01:00
Matthias Sohn
b72d83f48e Add gc metrics for ZGC and ShenandoahGC
Change-Id: I518a655f4c8080a8b5b23e67d6a518b503000949
2021-02-05 00:59:13 +01:00
Thomas Draebing
f839c376af Convert overview dashboard to grafonnet
In addition this updates Grafonnet to include bar gauges.

Change-Id: I538bd965d52f841b24c9607fc97d5ac748b9d68b
2020-12-04 08:31:27 +01:00
Thomas Draebing
7e3e4b76c5 Update Grafana chart
This updates the Grafana chart to the new repository, since the old
repository is now deprecated. This also updates the container images
and Grafana version.

Change-Id: I29e38d7c23bfa95992537efae7b8b3967d71ffd0
2020-12-04 08:31:26 +01:00
Thomas Draebing
893b0c4f36 Convert replication dashboard to grafonnet
Change-Id: Icffb8ffbec8541e5b956487e5ce9ec54b3c8b617
2020-12-04 08:31:26 +01:00
Thomas Draebing
c7c17679e9 Divide latency dashboard
There are a lot of latency metrics. This change splits up the existing
dashboard for latencies. For REST API latencies, it also allows to
select the REST API calls to look at. This change also adds latency
dashboards for the NoteDB and UI Actions.

Change-Id: Idb9631cc1bc838d06e626d58f163e71fb78b30c5
2020-12-04 08:31:26 +01:00
Thomas Draebing
0b4c16e881 Convert latency dashboard to grafonnet
Change-Id: Id97759996259eea802c80c2ef3261ba1883d92d3
2020-12-04 08:31:25 +01:00
Thomas Draebing
3e811f272b Convert git fetch/clone dashboard to grafonnet
Change-Id: I735f94599199ae2d0f304030fa023c55359e9a47
2020-12-04 08:31:25 +01:00
Thomas Draebing
12aba901e4 Extract yAxis object
Change-Id: I98c0708e521c0122beb53869242a3a1df8db3f3d
2020-12-04 08:31:24 +01:00
Thomas Draebing
82d9ead576 Convert caches dashboard to grafonnet
Change-Id: I42f10428bb5f85991cef2abbcdfab9424b8bb48d
2020-12-04 08:31:23 +01:00
Thomas Draebing
72391ac5e5 Convert queues dashboard to grafonnet
Change-Id: Ia3307a923b99ecacaaa8c803aa2af0c9bf4eabcb
2020-12-04 08:31:22 +01:00
Thomas Draebing
ce5b8300f1 Start using Grafonnet to create Grafana dashboards
Versioning the pure JSON files representing the Grafana dashboards
had some disadvantages. It was hard to review them, they were very
cluttered and a lot was duplicated.

There are some tools that deal with that. One of them is Grafonnet,
which is a superset of Jsonnet, a tool to create JSON files using a
domain specific language.

This change implements the Gerrit Process dashboard in Grafonnet.
It also extends the installer to be able to install dashboards in
the Jsonnet format.

Change-Id: I6235fb7d045bd71557678a4e3b0d4ad4515f4615
2020-12-04 08:31:21 +01:00
Thomas Draebing
baa386bd98 Update Prometheus chart to 12.0.0.
This also changes the helm chart repository, since the old one was
deprecated. Further, the new version adapts the resources to not
contain deprecated APIs.

Change-Id: Idd3f1ed48e22da303fd62d9c2ee63ccb959ed948
2020-12-01 07:14:29 +00:00
Thomas Draebing
f9867a49ef Update helm chart stable repository URL
The stable repository for helm charts was moved to a new URL. The
old one will be unavailable soon.

Change-Id: I34300992764bab012e8dd602d75f19817dcdd7ba
2020-11-27 10:40:11 +01:00
Thomas Draebing
bec7bf7897 Adapt dashboards to be accepted by Grafana dashboard repository
Grafana provides a repository for dashboards that can be used to easily
import dashboards. Providing these dashboards there would make it easier
for users not using the full setup provided here to still use the
dashboards. To be able to upload however, the datasource reference in the
dashboards has to be a template.

This is however not compatible with the way how the dashboards are imported
in the Grafana of the stack provided here. Thus, the variables are
removed during the installation.

Change-Id: I99f127882a6f7594ca1c40fbe1e299378e89f4e9
2020-11-27 10:40:09 +01:00
Thomas Draebing
65582f2deb Also monitor parallel GC
This change

- adds metrics for parallel GC to the GC panel in the Gerrit Process
  dashboard
- configures the GC panel to only show queries with values other than
  null
- changes the interval to one minute, which fits the scrape interval
- changes the default time frame to the last 24h, which is used for
  most other dashboards

Change-Id: I3b6587e769ae7486a02e26b8d7f2822319eb94e6
2020-08-25 13:20:11 +02:00
Thomas Draebing
f5c4885e67 Remove basic auth between promtail chart and loki
The promtail chart is anyway configured to use the Loki service for
pushing logs. The service itself is not password protected and this
was thus not required.

Change-Id: I886b76ca7e5d6e8af370a2cd0f527892008c7600
2020-08-19 13:28:44 +02:00
Thomas Dräbing
50c3a5aac8 Merge changes I574c3b05,I95020080,I894e47f3,I86c5c547
* changes:
  Adapt to ytt 0.28.0
  Sort monitoring and logging components into sub-maps in the config
  Collect logs from Gerrit in Kubernetes
  Add promtail chart to collect logs from cluster
2020-06-30 12:51:50 +00:00
Thomas Draebing
ad0b8c71ee Add alert on Gerrit threads in deadlock
This adds an alert that is firing, if 1 or more threads of a Gerrit
instance are in a deadlock.

Change-Id: Ie2e14e81381e07de2559b42b91d6e483639831ef
2020-06-25 09:00:06 +02:00
Thomas Draebing
89ee46a081 Adapt to ytt 0.28.0
Ytt 0.28.0 introduced a breaking change. The --output-directory
option was removed. This was done because this option implicitly
emptied the directory, which could be dangerous. While this option
still exist under a different name, the --output-files option is
now recommended.

The installer now uses the --output-files option, but to ensure a
clean installation, it checks, whether the directory already exists
and if it does, asks the user, whether it can empty it. If it is
not allowed to do so, the installation will abort.

Change-Id: I574c3b054e9293c0534d609c062946cd39890793
2020-06-19 17:40:09 +02:00
Thomas Draebing
3b4005a047 Sort monitoring and logging components into sub-maps in the config
This is done in preparation to allow multiple logging stacks.

Change-Id: I950200805ec01851bfdf6ccc3a5243893a947616
2020-05-27 16:30:33 +02:00
Thomas Draebing
3887f2b53c Collect logs from Gerrit in Kubernetes
This adds a service discovery configuration for promtail to also
collect logs for Gerrit installations in Kubernetes. The installations
will be discovered by namespace and a given label.

Change-Id: I894e47f37428add9b44df6596950d314ee2a3ed0
2020-05-27 16:30:33 +02:00
Thomas Draebing
de8fee4f68 Add promtail chart to collect logs from cluster
This adds the promtail chart to the installation that allows to
collect the logs of the applications in the cluster, which are written
to stdout of the containers.

This will only collect logs from pods in the same namespace as the
monitoring setup. In a later change also logs from Gerrit instances
in Kubernetes will be added.

Change-Id: I86c5c5470eaa31191fb5ac339ee21dee85106097
2020-05-27 16:30:31 +02:00
Thomas Draebing
aab93a806b Fix error if output directory didn't exist
Change-Id: Ib1fecac1433bf20d4c6c45a4f13b17ee8c864e73
2020-05-26 14:29:26 +02:00
Thomas Draebing
451882b7e9 Allow to monitor Gerrit on Kubernetes
So far it was only possible to monitor single instance Gerrit servers.
This was due to to the fact that a URL had to be used that pointed to
a dedicated instance, since if multiple replicas would be behind the
instance, the metrics of a random replica would be scraped and not of
all.

Prometheus has a service discovery functionality for deployments running
in Kubernetes. This is now used, when monitoring a Gerrit instance in
Kubernetes. This allows to have a variable number of replicas running,
which will be automatically discovered by Prometheus.

The dashboards were adapted accordingly and allow now to select the
replica to be observed. For now, no summary of all replicas can be
displayed in the dashboards, but that feature is planned to be added
in the future.

Change-Id: I96efc63a192cd90f5e3e91a53dace8e1ae83132e
2020-05-14 15:55:35 +02:00
Thomas Draebing
7663baf7be Use gerrit_build_info metric to display Gerrit version
This replaces the hacky graph showing the Gerrit version with a table
showing the current Gerrit version information.

Change-Id: Idfbdc85e376953aead40fea06544e5c84fb777e7
2020-05-14 15:33:14 +02:00
Matthias Sohn
e8b2651af2 Add latency dashboard
Add graphs for the following latency metrics
- receive-commit
- query total
- query changes
- REST total
- REST change list comments
- REST change list robot comments
- REST change post review
- REST get change detail
- REST get change diff
- REST get change
- REST get commit
- REST get change revision actions

Change-Id: Id782e12335ae76820cac4e4e8c80450671bf8216
2020-05-05 18:30:18 +02:00
Thomas Draebing
dc60bd1654 Fix installation if TLS verification is skipped
The installation failed, if TLS verification was disabled and no CA
certificate was given in the configuration. This happened because the
installation script always expected the CA certificate.

The installation now only expects the certificate, if TLS verification
is enabled.

Change-Id: I5429fc1ee0d230c74cc0689607cf2736d6520030
2020-04-29 17:36:08 +02:00
Thomas Draebing
d0b53a0970 Create CA-certificate file for promtail during installation
For TLS-verification promtail requires a CA-certificate, which had to
be created manually.

Change-Id: Ia1fe191bad7f3d1ca4a1568921ad67d22c47efd7
2020-04-16 14:25:53 +02:00
Thomas Draebing
2ead0f0a05 Version promtail version
This adds the promtail version used in the setup to a file and adds
an installation step downloading promtail, if the installation is not
run in `dryrun`-mode.

Change-Id: I1127220a57b2610b5c4458ce2205077706a860e6
2020-04-16 14:25:53 +02:00
Thomas Draebing
0bdb1d02e0 Create promtail config per Gerrit host
So far the install-script could only create a single promtail config.
Since the monitoring setup is able to monitor multiple Gerrit servers,
this caused manual work to create a promtail config per Gerrit server.

Now ytt will create a configuration for each Gerrit host configured
in the config.yaml. Ytt is only able to do that in a single file. Thus,
csplit is used to split the files into separate files that can then
be used to configure promtail on the respective hosts. The config-
files can then be found under
$OUTPUT/promtail/promtail-$GERRIT_HOSTNAME.yaml.

Change-Id: Ib09fba83d8a8fbd45b42e9e5388a85a37ab1a952
2020-04-16 14:25:53 +02:00
Thomas Draebing
6b75c12831 Rewrite the scripts in python
The scripts were written in bash. Using bash became quite unwieldy.

Python by nature can deal well with yaml and is thus better suited
in dealing with the yaml-based configuration files. This change
rewrites the original scripts staying as close as possible to the
original ones.

Right now, the python scripts call subprocesses a lot to work with
the tools, which were already used before. At least for yaml-
templating there may be better tools that have a python integration,
which could be used in the future.

Change-Id: Ida16318445a05dcfdada9c7a56a391e4827f02e7
2020-04-16 14:25:50 +02:00
Thomas Draebing
3f8594c3cb Fix typo in install.sh script
Change-Id: Ib4529df6924d80032a24387db26719a8105b5496
2020-04-15 14:03:45 +02:00
Thomas Dräbing
81ab4f166a Merge changes I1ba3967a,Id55095c3
* changes:
  Describe infrastructure dependencies
  Use object store to store chunks created by Loki
2020-04-08 13:18:16 +00:00
Thomas Dräbing
b34c47f817 Merge changes I1efdc490,I220d90d3,I405f09f7,I392b2ddf,I84062d6e
* changes:
  Relabel the instance label for prometheus and loki metrics
  Add dashboard for Loki metrics
  Add dashboard to monitor Prometheus data
  Only show Gerrit instances in the instance dropdowns
  Create a configmap per dashboard
2020-04-08 13:17:58 +00:00
Thomas Draebing
a8135ce8c4 Relabel the instance label for prometheus and loki metrics
The instance label for Prometheus had the value localhost:9090, which
was misleading.

Now the label is relabeled to prometheus-<namespace> or loki-<namespace>.
This is still not ideal for cases, where multiple replicas are deployed,
but until then, it is already a slight improvement.

Change-Id: I1efdc49071b1d3bf99d21315ca03821e9d58c906
2020-04-03 13:36:34 +02:00
Thomas Dräbing
e2a5902494 Merge "Show more lines in log queries in Grafana" 2020-04-03 09:58:31 +00:00
Thomas Draebing
f960eb5eab Add dashboard for Loki metrics
Change-Id: I220d90d33be3ed292402f3adb7386953cad7b0de
2020-04-03 11:56:24 +02:00