Commit graph

45 commits

Author SHA1 Message Date
Thomas Draebing
bec7bf7897 Adapt dashboards to be accepted by Grafana dashboard repository
Grafana provides a repository for dashboards that can be used to easily
import dashboards. Providing these dashboards there would make it easier
for users not using the full setup provided here to still use the
dashboards. To be able to upload however, the datasource reference in the
dashboards has to be a template.

This is however not compatible with the way how the dashboards are imported
in the Grafana of the stack provided here. Thus, the variables are
removed during the installation.

Change-Id: I99f127882a6f7594ca1c40fbe1e299378e89f4e9
2020-11-27 10:40:09 +01:00
Thomas Draebing
65582f2deb Also monitor parallel GC
This change

- adds metrics for parallel GC to the GC panel in the Gerrit Process
  dashboard
- configures the GC panel to only show queries with values other than
  null
- changes the interval to one minute, which fits the scrape interval
- changes the default time frame to the last 24h, which is used for
  most other dashboards

Change-Id: I3b6587e769ae7486a02e26b8d7f2822319eb94e6
2020-08-25 13:20:11 +02:00
Thomas Draebing
f5c4885e67 Remove basic auth between promtail chart and loki
The promtail chart is anyway configured to use the Loki service for
pushing logs. The service itself is not password protected and this
was thus not required.

Change-Id: I886b76ca7e5d6e8af370a2cd0f527892008c7600
2020-08-19 13:28:44 +02:00
Thomas Dräbing
50c3a5aac8 Merge changes I574c3b05,I95020080,I894e47f3,I86c5c547
* changes:
  Adapt to ytt 0.28.0
  Sort monitoring and logging components into sub-maps in the config
  Collect logs from Gerrit in Kubernetes
  Add promtail chart to collect logs from cluster
2020-06-30 12:51:50 +00:00
Thomas Draebing
ad0b8c71ee Add alert on Gerrit threads in deadlock
This adds an alert that is firing, if 1 or more threads of a Gerrit
instance are in a deadlock.

Change-Id: Ie2e14e81381e07de2559b42b91d6e483639831ef
2020-06-25 09:00:06 +02:00
Thomas Draebing
89ee46a081 Adapt to ytt 0.28.0
Ytt 0.28.0 introduced a breaking change. The --output-directory
option was removed. This was done because this option implicitly
emptied the directory, which could be dangerous. While this option
still exist under a different name, the --output-files option is
now recommended.

The installer now uses the --output-files option, but to ensure a
clean installation, it checks, whether the directory already exists
and if it does, asks the user, whether it can empty it. If it is
not allowed to do so, the installation will abort.

Change-Id: I574c3b054e9293c0534d609c062946cd39890793
2020-06-19 17:40:09 +02:00
Thomas Draebing
3b4005a047 Sort monitoring and logging components into sub-maps in the config
This is done in preparation to allow multiple logging stacks.

Change-Id: I950200805ec01851bfdf6ccc3a5243893a947616
2020-05-27 16:30:33 +02:00
Thomas Draebing
3887f2b53c Collect logs from Gerrit in Kubernetes
This adds a service discovery configuration for promtail to also
collect logs for Gerrit installations in Kubernetes. The installations
will be discovered by namespace and a given label.

Change-Id: I894e47f37428add9b44df6596950d314ee2a3ed0
2020-05-27 16:30:33 +02:00
Thomas Draebing
de8fee4f68 Add promtail chart to collect logs from cluster
This adds the promtail chart to the installation that allows to
collect the logs of the applications in the cluster, which are written
to stdout of the containers.

This will only collect logs from pods in the same namespace as the
monitoring setup. In a later change also logs from Gerrit instances
in Kubernetes will be added.

Change-Id: I86c5c5470eaa31191fb5ac339ee21dee85106097
2020-05-27 16:30:31 +02:00
Thomas Draebing
aab93a806b Fix error if output directory didn't exist
Change-Id: Ib1fecac1433bf20d4c6c45a4f13b17ee8c864e73
2020-05-26 14:29:26 +02:00
Thomas Draebing
451882b7e9 Allow to monitor Gerrit on Kubernetes
So far it was only possible to monitor single instance Gerrit servers.
This was due to to the fact that a URL had to be used that pointed to
a dedicated instance, since if multiple replicas would be behind the
instance, the metrics of a random replica would be scraped and not of
all.

Prometheus has a service discovery functionality for deployments running
in Kubernetes. This is now used, when monitoring a Gerrit instance in
Kubernetes. This allows to have a variable number of replicas running,
which will be automatically discovered by Prometheus.

The dashboards were adapted accordingly and allow now to select the
replica to be observed. For now, no summary of all replicas can be
displayed in the dashboards, but that feature is planned to be added
in the future.

Change-Id: I96efc63a192cd90f5e3e91a53dace8e1ae83132e
2020-05-14 15:55:35 +02:00
Thomas Draebing
7663baf7be Use gerrit_build_info metric to display Gerrit version
This replaces the hacky graph showing the Gerrit version with a table
showing the current Gerrit version information.

Change-Id: Idfbdc85e376953aead40fea06544e5c84fb777e7
2020-05-14 15:33:14 +02:00
Matthias Sohn
e8b2651af2 Add latency dashboard
Add graphs for the following latency metrics
- receive-commit
- query total
- query changes
- REST total
- REST change list comments
- REST change list robot comments
- REST change post review
- REST get change detail
- REST get change diff
- REST get change
- REST get commit
- REST get change revision actions

Change-Id: Id782e12335ae76820cac4e4e8c80450671bf8216
2020-05-05 18:30:18 +02:00
Thomas Draebing
dc60bd1654 Fix installation if TLS verification is skipped
The installation failed, if TLS verification was disabled and no CA
certificate was given in the configuration. This happened because the
installation script always expected the CA certificate.

The installation now only expects the certificate, if TLS verification
is enabled.

Change-Id: I5429fc1ee0d230c74cc0689607cf2736d6520030
2020-04-29 17:36:08 +02:00
Thomas Draebing
d0b53a0970 Create CA-certificate file for promtail during installation
For TLS-verification promtail requires a CA-certificate, which had to
be created manually.

Change-Id: Ia1fe191bad7f3d1ca4a1568921ad67d22c47efd7
2020-04-16 14:25:53 +02:00
Thomas Draebing
2ead0f0a05 Version promtail version
This adds the promtail version used in the setup to a file and adds
an installation step downloading promtail, if the installation is not
run in `dryrun`-mode.

Change-Id: I1127220a57b2610b5c4458ce2205077706a860e6
2020-04-16 14:25:53 +02:00
Thomas Draebing
0bdb1d02e0 Create promtail config per Gerrit host
So far the install-script could only create a single promtail config.
Since the monitoring setup is able to monitor multiple Gerrit servers,
this caused manual work to create a promtail config per Gerrit server.

Now ytt will create a configuration for each Gerrit host configured
in the config.yaml. Ytt is only able to do that in a single file. Thus,
csplit is used to split the files into separate files that can then
be used to configure promtail on the respective hosts. The config-
files can then be found under
$OUTPUT/promtail/promtail-$GERRIT_HOSTNAME.yaml.

Change-Id: Ib09fba83d8a8fbd45b42e9e5388a85a37ab1a952
2020-04-16 14:25:53 +02:00
Thomas Draebing
6b75c12831 Rewrite the scripts in python
The scripts were written in bash. Using bash became quite unwieldy.

Python by nature can deal well with yaml and is thus better suited
in dealing with the yaml-based configuration files. This change
rewrites the original scripts staying as close as possible to the
original ones.

Right now, the python scripts call subprocesses a lot to work with
the tools, which were already used before. At least for yaml-
templating there may be better tools that have a python integration,
which could be used in the future.

Change-Id: Ida16318445a05dcfdada9c7a56a391e4827f02e7
2020-04-16 14:25:50 +02:00
Thomas Draebing
3f8594c3cb Fix typo in install.sh script
Change-Id: Ib4529df6924d80032a24387db26719a8105b5496
2020-04-15 14:03:45 +02:00
Thomas Dräbing
81ab4f166a Merge changes I1ba3967a,Id55095c3
* changes:
  Describe infrastructure dependencies
  Use object store to store chunks created by Loki
2020-04-08 13:18:16 +00:00
Thomas Dräbing
b34c47f817 Merge changes I1efdc490,I220d90d3,I405f09f7,I392b2ddf,I84062d6e
* changes:
  Relabel the instance label for prometheus and loki metrics
  Add dashboard for Loki metrics
  Add dashboard to monitor Prometheus data
  Only show Gerrit instances in the instance dropdowns
  Create a configmap per dashboard
2020-04-08 13:17:58 +00:00
Thomas Draebing
a8135ce8c4 Relabel the instance label for prometheus and loki metrics
The instance label for Prometheus had the value localhost:9090, which
was misleading.

Now the label is relabeled to prometheus-<namespace> or loki-<namespace>.
This is still not ideal for cases, where multiple replicas are deployed,
but until then, it is already a slight improvement.

Change-Id: I1efdc49071b1d3bf99d21315ca03821e9d58c906
2020-04-03 13:36:34 +02:00
Thomas Dräbing
e2a5902494 Merge "Show more lines in log queries in Grafana" 2020-04-03 09:58:31 +00:00
Thomas Draebing
f960eb5eab Add dashboard for Loki metrics
Change-Id: I220d90d33be3ed292402f3adb7386953cad7b0de
2020-04-03 11:56:24 +02:00
Thomas Draebing
ff7fd22ca2 Add dashboard to monitor Prometheus data
This is an adapted version of this dashboard:
https://grafana.com/grafana/dashboards/3681

Change-Id: I405f09f75698b940becd6994a7fc457853603756
2020-04-03 11:56:24 +02:00
Thomas Draebing
442bf6fb98 Only show Gerrit instances in the instance dropdowns
A variable was used to select the Gerrit instance to observe in the
dashboards. Since the instance label is set for all targets that
prometheus scrapes, the variable would also contain e.g. the prometheus
instance.

Now only Gerrit instances are displayed by further filtering for a
metric specific for Gerrit.

Change-Id: I392b2ddf53a0ea49db25018dc5d37d269365812a
2020-04-03 11:37:27 +02:00
Thomas Draebing
623332e4b3 Create a configmap per dashboard
I the dashboard files got too large (>2Mb) Kubernetes was rejecting
the configmap.

Now each dashboard is installed with an own configmap. A sidecar container
is used to register these dashboards with Grafana.

Change-Id: I84062d6e2ac7dc2669945b54575bf239a25900a4
2020-03-26 09:55:39 +01:00
Thomas Dräbing
6d3c31e50c Merge "Update Grafana to 6.7.1" 2020-03-26 08:06:30 +00:00
Matthias Sohn
d7d1703c44 Merge changes I2cd9c872,I26cfd395
* changes:
  Scrape Loki metrics
  Monitor Prometheus itself
2020-03-24 22:41:09 +00:00
Thomas Draebing
202a3168ce Show more lines in log queries in Grafana
The default maximum log lines shown in Grafana are 1000. This is
barely covering a few minutes in the httpd-logs.

The value of 10,000 can still be handled by the browser. More log
entries will cause the browser to cache as long as Grafana does not
provide pagination, which is planned for the future.

Change-Id: Ife84d161cd022300ff6f440920021e4176b770b9
2020-03-24 16:21:01 +01:00
Thomas Draebing
10a0a54069 Update Grafana to 6.7.1
The most interesting new features are:
- proper limits for queried logs
- query history for logs (still a beta feature)

Change-Id: Ibd8b76b0e1e16d4bd3c74382fa3fd5a24c1bba45
2020-03-24 16:20:54 +01:00
Thomas Draebing
aa0c5252f0 Describe infrastructure dependencies
Change-Id: I1ba3967a10e5cd35aff60579eff388252c81874b
2020-03-24 16:01:36 +01:00
Thomas Draebing
eb4e6ea191 Use object store to store chunks created by Loki
The chunks created by Loki were stored in a persistent volume. This
does not scale well, since volumes cannot easily be resized in
Kubernetes. Also, at least the ext4-filesystem had issues, when large
numbers of logs were saved. These issues are due to the dir_index as
discussed in [1].

An object store provides a more scalable and cheaper solution. Loki
supports S3 as an object storage and also other object stores that
understand the S3 API like Ceph or OpenStack Swift.

[1] https://github.com/grafana/loki/issues/1502

Change-Id: Id55095c3b6659f40708712c1a494753dbcab7686
2020-03-24 16:01:34 +01:00
Thomas Dräbing
5d4c32212e Merge changes Icaada525,Ifbf13edb
* changes:
  Process dashboard: add panel showing system load
  Process dashboard: show number of available CPUs
2020-03-24 14:50:12 +00:00
Thomas Draebing
b1be26012b Scrape Loki metrics
Change-Id: I2cd9c872882cd760fc2ff10028b7e03a31f8fba5
2020-03-23 16:09:54 +01:00
Thomas Draebing
ead4e7d5cc Monitor Prometheus itself
Monitoring Prometheus itself will help to identify issues with the
monitoring setup itself.

Change-Id: I26cfd395831aebffe9f32922c8e795f8df928b9e
2020-03-23 15:39:29 +01:00
Thomas Draebing
1d6a3dcc5e Remove custom labels added to logs during parsing
Promtail was configured to create labels for nearly every key in the
logs. This was done to support easier label-based querying. Loki
however is not optimized to  work with labels having a high cardinality.
This led to failures in Loki, if it had to handle a high number of
logs. In addition, the high number of labels led to a huge number of
chunks being created, mostly just containing a single log entry,
making querying and storage very inefficient.

This change removes all custom made labels, except for the
gerrit_version label. Logs should rather be queried using the grep-
like syntax of LogQL for which Loki is optimized.

Change-Id: I70e2a3ff4f640bc6f5d08d50212958a7bca2eae1
2020-03-23 11:53:13 +01:00
Thomas Draebing
ab26ebb833 Increase the chunk_retain_period to 15 minutes
This increases the time a chunk has to be filled before being flushed.
With shorter times, it could happen that during times of low traffic
chunks will not be filled completely before being flushed. This would
lead to small chunk objects, which is inefficient.

Change-Id: I74b2af1a053c8d4298b9e9d7ffca04cb9d8926bd
2020-03-23 11:41:01 +01:00
Thomas Draebing
8b308e2973 Set resource limit for Loki pods
So far, there were no limits to the resources the Loki pod was allowed
to use. This now sets limits that in my observation for now seem to
work. With handling more and more logs, these limits will probably have
to be increased.

Change-Id: I7313488a60da8a1fff28666870549f748400735a
2020-03-17 14:48:52 +01:00
Thomas Draebing
8ab8153f8e Increase number of allowed requests per log parser
The default limit of requests accepted by Loki from a single host was
set to 10000, which is not enough for a large Gerrit instance to push
all httpd/sshd-logs to Loki.

Change-Id: I94cb56e00102170ae4ed10e90123a8885e3aad00
2020-03-17 09:09:51 +01:00
Matthias Sohn
14e7530aab Process dashboard: add panel showing system load
- Rearrange the other panels so that we show system load over cpu usage
over threads in the left column.
- Reduce height of memory panel a bit

Change-Id: Icaada525f87d0df503f67cf688b94d15a4119034
2020-03-13 17:41:01 +01:00
Matthias Sohn
4a96ed4947 Process dashboard: show number of available CPUs
Change-Id: Ifbf13edb2dfa8f5cee64aea3f9dca006d419ef20
2020-03-13 17:40:53 +01:00
Thomas Draebing
8daaa2695f Add basic dev documentation
Change-Id: I6de025c38fa87d4b70bdd4d8eaf261ced97716f2
2020-03-11 15:23:19 +01:00
Thomas Draebing
be862d863e Move internal project to open source
This change adds the current status of a project that aims to create
a simple monitoring setup to monitor Gerrit servers, which was developed
internally at SAP.

The project provides an opinionated and basic configuration for helm
charts that can be used to install Loki, Prometheus and Grafana on a
Kubernetes cluster. Scripts to easily apply the configuration and
install the whole setup are provided as well.

The contributions so far were done by (with number of commits)

  80  Thomas Draebing
  11  Matthias Sohn
   2  Saša Živkov

Change-Id: I8045780446edfb3c0dc8287b8f494505e338e066
2020-03-11 15:23:19 +01:00
David Pursehouse
4314ca0fbc Initial empty repository 2020-03-06 09:09:23 +00:00