the-distro/gerrit-monitoring

Author	SHA1	Message	Date
Thomas Draebing	ce5b8300f1	Start using Grafonnet to create Grafana dashboards Versioning the pure JSON files representing the Grafana dashboards had some disadvantages. It was hard to review them, they were very cluttered and a lot was duplicated. There are some tools that deal with that. One of them is Grafonnet, which is a superset of Jsonnet, a tool to create JSON files using a domain specific language. This change implements the Gerrit Process dashboard in Grafonnet. It also extends the installer to be able to install dashboards in the Jsonnet format. Change-Id: `I6235fb7d045bd71557678a4e3b0d4ad4515f4615`	2020-12-04 08:31:21 +01:00
Thomas Draebing	baa386bd98	Update Prometheus chart to 12.0.0. This also changes the helm chart repository, since the old one was deprecated. Further, the new version adapts the resources to not contain deprecated APIs. Change-Id: `Idd3f1ed48e22da303fd62d9c2ee63ccb959ed948`	2020-12-01 07:14:29 +00:00
Thomas Draebing	f9867a49ef	Update helm chart stable repository URL The stable repository for helm charts was moved to a new URL. The old one will be unavailable soon. Change-Id: `I34300992764bab012e8dd602d75f19817dcdd7ba`	2020-11-27 10:40:11 +01:00
Thomas Draebing	bec7bf7897	Adapt dashboards to be accepted by Grafana dashboard repository Grafana provides a repository for dashboards that can be used to easily import dashboards. Providing these dashboards there would make it easier for users not using the full setup provided here to still use the dashboards. To be able to upload however, the datasource reference in the dashboards has to be a template. This is however not compatible with the way how the dashboards are imported in the Grafana of the stack provided here. Thus, the variables are removed during the installation. Change-Id: `I99f127882a6f7594ca1c40fbe1e299378e89f4e9`	2020-11-27 10:40:09 +01:00
Thomas Draebing	65582f2deb	Also monitor parallel GC This change - adds metrics for parallel GC to the GC panel in the Gerrit Process dashboard - configures the GC panel to only show queries with values other than null - changes the interval to one minute, which fits the scrape interval - changes the default time frame to the last 24h, which is used for most other dashboards Change-Id: `I3b6587e769ae7486a02e26b8d7f2822319eb94e6`	2020-08-25 13:20:11 +02:00
Thomas Draebing	f5c4885e67	Remove basic auth between promtail chart and loki The promtail chart is anyway configured to use the Loki service for pushing logs. The service itself is not password protected and this was thus not required. Change-Id: `I886b76ca7e5d6e8af370a2cd0f527892008c7600`	2020-08-19 13:28:44 +02:00
Thomas Dräbing	50c3a5aac8	Merge changes I574c3b05,I95020080,I894e47f3,I86c5c547 * changes: Adapt to ytt 0.28.0 Sort monitoring and logging components into sub-maps in the config Collect logs from Gerrit in Kubernetes Add promtail chart to collect logs from cluster	2020-06-30 12:51:50 +00:00
Thomas Draebing	ad0b8c71ee	Add alert on Gerrit threads in deadlock This adds an alert that is firing, if 1 or more threads of a Gerrit instance are in a deadlock. Change-Id: `Ie2e14e81381e07de2559b42b91d6e483639831ef`	2020-06-25 09:00:06 +02:00
Thomas Draebing	89ee46a081	Adapt to ytt 0.28.0 Ytt 0.28.0 introduced a breaking change. The --output-directory option was removed. This was done because this option implicitly emptied the directory, which could be dangerous. While this option still exist under a different name, the --output-files option is now recommended. The installer now uses the --output-files option, but to ensure a clean installation, it checks, whether the directory already exists and if it does, asks the user, whether it can empty it. If it is not allowed to do so, the installation will abort. Change-Id: `I574c3b054e9293c0534d609c062946cd39890793`	2020-06-19 17:40:09 +02:00
Thomas Draebing	3b4005a047	Sort monitoring and logging components into sub-maps in the config This is done in preparation to allow multiple logging stacks. Change-Id: `I950200805ec01851bfdf6ccc3a5243893a947616`	2020-05-27 16:30:33 +02:00
Thomas Draebing	3887f2b53c	Collect logs from Gerrit in Kubernetes This adds a service discovery configuration for promtail to also collect logs for Gerrit installations in Kubernetes. The installations will be discovered by namespace and a given label. Change-Id: `I894e47f37428add9b44df6596950d314ee2a3ed0`	2020-05-27 16:30:33 +02:00
Thomas Draebing	de8fee4f68	Add promtail chart to collect logs from cluster This adds the promtail chart to the installation that allows to collect the logs of the applications in the cluster, which are written to stdout of the containers. This will only collect logs from pods in the same namespace as the monitoring setup. In a later change also logs from Gerrit instances in Kubernetes will be added. Change-Id: `I86c5c5470eaa31191fb5ac339ee21dee85106097`	2020-05-27 16:30:31 +02:00
Thomas Draebing	aab93a806b	Fix error if output directory didn't exist Change-Id: `Ib1fecac1433bf20d4c6c45a4f13b17ee8c864e73`	2020-05-26 14:29:26 +02:00
Thomas Draebing	451882b7e9	Allow to monitor Gerrit on Kubernetes So far it was only possible to monitor single instance Gerrit servers. This was due to to the fact that a URL had to be used that pointed to a dedicated instance, since if multiple replicas would be behind the instance, the metrics of a random replica would be scraped and not of all. Prometheus has a service discovery functionality for deployments running in Kubernetes. This is now used, when monitoring a Gerrit instance in Kubernetes. This allows to have a variable number of replicas running, which will be automatically discovered by Prometheus. The dashboards were adapted accordingly and allow now to select the replica to be observed. For now, no summary of all replicas can be displayed in the dashboards, but that feature is planned to be added in the future. Change-Id: `I96efc63a192cd90f5e3e91a53dace8e1ae83132e`	2020-05-14 15:55:35 +02:00
Thomas Draebing	7663baf7be	Use gerrit_build_info metric to display Gerrit version This replaces the hacky graph showing the Gerrit version with a table showing the current Gerrit version information. Change-Id: `Idfbdc85e376953aead40fea06544e5c84fb777e7`	2020-05-14 15:33:14 +02:00
Matthias Sohn	e8b2651af2	Add latency dashboard Add graphs for the following latency metrics - receive-commit - query total - query changes - REST total - REST change list comments - REST change list robot comments - REST change post review - REST get change detail - REST get change diff - REST get change - REST get commit - REST get change revision actions Change-Id: `Id782e12335ae76820cac4e4e8c80450671bf8216`	2020-05-05 18:30:18 +02:00
Thomas Draebing	dc60bd1654	Fix installation if TLS verification is skipped The installation failed, if TLS verification was disabled and no CA certificate was given in the configuration. This happened because the installation script always expected the CA certificate. The installation now only expects the certificate, if TLS verification is enabled. Change-Id: `I5429fc1ee0d230c74cc0689607cf2736d6520030`	2020-04-29 17:36:08 +02:00
Thomas Draebing	d0b53a0970	Create CA-certificate file for promtail during installation For TLS-verification promtail requires a CA-certificate, which had to be created manually. Change-Id: `Ia1fe191bad7f3d1ca4a1568921ad67d22c47efd7`	2020-04-16 14:25:53 +02:00
Thomas Draebing	2ead0f0a05	Version promtail version This adds the promtail version used in the setup to a file and adds an installation step downloading promtail, if the installation is not run in `dryrun`-mode. Change-Id: `I1127220a57b2610b5c4458ce2205077706a860e6`	2020-04-16 14:25:53 +02:00
Thomas Draebing	0bdb1d02e0	Create promtail config per Gerrit host So far the install-script could only create a single promtail config. Since the monitoring setup is able to monitor multiple Gerrit servers, this caused manual work to create a promtail config per Gerrit server. Now ytt will create a configuration for each Gerrit host configured in the config.yaml. Ytt is only able to do that in a single file. Thus, csplit is used to split the files into separate files that can then be used to configure promtail on the respective hosts. The config- files can then be found under $OUTPUT/promtail/promtail-$GERRIT_HOSTNAME.yaml. Change-Id: `Ib09fba83d8a8fbd45b42e9e5388a85a37ab1a952`	2020-04-16 14:25:53 +02:00
Thomas Draebing	6b75c12831	Rewrite the scripts in python The scripts were written in bash. Using bash became quite unwieldy. Python by nature can deal well with yaml and is thus better suited in dealing with the yaml-based configuration files. This change rewrites the original scripts staying as close as possible to the original ones. Right now, the python scripts call subprocesses a lot to work with the tools, which were already used before. At least for yaml- templating there may be better tools that have a python integration, which could be used in the future. Change-Id: `Ida16318445a05dcfdada9c7a56a391e4827f02e7`	2020-04-16 14:25:50 +02:00
Thomas Draebing	3f8594c3cb	Fix typo in install.sh script Change-Id: `Ib4529df6924d80032a24387db26719a8105b5496`	2020-04-15 14:03:45 +02:00
Thomas Dräbing	81ab4f166a	Merge changes I1ba3967a,Id55095c3 * changes: Describe infrastructure dependencies Use object store to store chunks created by Loki	2020-04-08 13:18:16 +00:00
Thomas Dräbing	b34c47f817	Merge changes I1efdc490,I220d90d3,I405f09f7,I392b2ddf,I84062d6e * changes: Relabel the instance label for prometheus and loki metrics Add dashboard for Loki metrics Add dashboard to monitor Prometheus data Only show Gerrit instances in the instance dropdowns Create a configmap per dashboard	2020-04-08 13:17:58 +00:00
Thomas Draebing	a8135ce8c4	Relabel the instance label for prometheus and loki metrics The instance label for Prometheus had the value localhost:9090, which was misleading. Now the label is relabeled to prometheus-<namespace> or loki-<namespace>. This is still not ideal for cases, where multiple replicas are deployed, but until then, it is already a slight improvement. Change-Id: `I1efdc49071b1d3bf99d21315ca03821e9d58c906`	2020-04-03 13:36:34 +02:00
Thomas Dräbing	e2a5902494	Merge "Show more lines in log queries in Grafana"	2020-04-03 09:58:31 +00:00
Thomas Draebing	f960eb5eab	Add dashboard for Loki metrics Change-Id: `I220d90d33be3ed292402f3adb7386953cad7b0de`	2020-04-03 11:56:24 +02:00
Thomas Draebing	ff7fd22ca2	Add dashboard to monitor Prometheus data This is an adapted version of this dashboard: https://grafana.com/grafana/dashboards/3681 Change-Id: `I405f09f75698b940becd6994a7fc457853603756`	2020-04-03 11:56:24 +02:00
Thomas Draebing	442bf6fb98	Only show Gerrit instances in the instance dropdowns A variable was used to select the Gerrit instance to observe in the dashboards. Since the instance label is set for all targets that prometheus scrapes, the variable would also contain e.g. the prometheus instance. Now only Gerrit instances are displayed by further filtering for a metric specific for Gerrit. Change-Id: `I392b2ddf53a0ea49db25018dc5d37d269365812a`	2020-04-03 11:37:27 +02:00
Thomas Draebing	623332e4b3	Create a configmap per dashboard I the dashboard files got too large (>2Mb) Kubernetes was rejecting the configmap. Now each dashboard is installed with an own configmap. A sidecar container is used to register these dashboards with Grafana. Change-Id: `I84062d6e2ac7dc2669945b54575bf239a25900a4`	2020-03-26 09:55:39 +01:00
Thomas Dräbing	6d3c31e50c	Merge "Update Grafana to 6.7.1"	2020-03-26 08:06:30 +00:00
Matthias Sohn	d7d1703c44	Merge changes I2cd9c872,I26cfd395 * changes: Scrape Loki metrics Monitor Prometheus itself	2020-03-24 22:41:09 +00:00
Thomas Draebing	202a3168ce	Show more lines in log queries in Grafana The default maximum log lines shown in Grafana are 1000. This is barely covering a few minutes in the httpd-logs. The value of 10,000 can still be handled by the browser. More log entries will cause the browser to cache as long as Grafana does not provide pagination, which is planned for the future. Change-Id: `Ife84d161cd022300ff6f440920021e4176b770b9`	2020-03-24 16:21:01 +01:00
Thomas Draebing	10a0a54069	Update Grafana to 6.7.1 The most interesting new features are: - proper limits for queried logs - query history for logs (still a beta feature) Change-Id: `Ibd8b76b0e1e16d4bd3c74382fa3fd5a24c1bba45`	2020-03-24 16:20:54 +01:00
Thomas Draebing	aa0c5252f0	Describe infrastructure dependencies Change-Id: `I1ba3967a10e5cd35aff60579eff388252c81874b`	2020-03-24 16:01:36 +01:00
Thomas Draebing	eb4e6ea191	Use object store to store chunks created by Loki The chunks created by Loki were stored in a persistent volume. This does not scale well, since volumes cannot easily be resized in Kubernetes. Also, at least the ext4-filesystem had issues, when large numbers of logs were saved. These issues are due to the dir_index as discussed in [1]. An object store provides a more scalable and cheaper solution. Loki supports S3 as an object storage and also other object stores that understand the S3 API like Ceph or OpenStack Swift. [1] https://github.com/grafana/loki/issues/1502 Change-Id: `Id55095c3b6659f40708712c1a494753dbcab7686`	2020-03-24 16:01:34 +01:00
Thomas Dräbing	5d4c32212e	Merge changes Icaada525,Ifbf13edb * changes: Process dashboard: add panel showing system load Process dashboard: show number of available CPUs	2020-03-24 14:50:12 +00:00
Thomas Draebing	b1be26012b	Scrape Loki metrics Change-Id: `I2cd9c872882cd760fc2ff10028b7e03a31f8fba5`	2020-03-23 16:09:54 +01:00
Thomas Draebing	ead4e7d5cc	Monitor Prometheus itself Monitoring Prometheus itself will help to identify issues with the monitoring setup itself. Change-Id: `I26cfd395831aebffe9f32922c8e795f8df928b9e`	2020-03-23 15:39:29 +01:00
Thomas Draebing	1d6a3dcc5e	Remove custom labels added to logs during parsing Promtail was configured to create labels for nearly every key in the logs. This was done to support easier label-based querying. Loki however is not optimized to work with labels having a high cardinality. This led to failures in Loki, if it had to handle a high number of logs. In addition, the high number of labels led to a huge number of chunks being created, mostly just containing a single log entry, making querying and storage very inefficient. This change removes all custom made labels, except for the gerrit_version label. Logs should rather be queried using the grep- like syntax of LogQL for which Loki is optimized. Change-Id: `I70e2a3ff4f640bc6f5d08d50212958a7bca2eae1`	2020-03-23 11:53:13 +01:00
Thomas Draebing	ab26ebb833	Increase the chunk_retain_period to 15 minutes This increases the time a chunk has to be filled before being flushed. With shorter times, it could happen that during times of low traffic chunks will not be filled completely before being flushed. This would lead to small chunk objects, which is inefficient. Change-Id: `I74b2af1a053c8d4298b9e9d7ffca04cb9d8926bd`	2020-03-23 11:41:01 +01:00
Thomas Draebing	8b308e2973	Set resource limit for Loki pods So far, there were no limits to the resources the Loki pod was allowed to use. This now sets limits that in my observation for now seem to work. With handling more and more logs, these limits will probably have to be increased. Change-Id: `I7313488a60da8a1fff28666870549f748400735a`	2020-03-17 14:48:52 +01:00
Thomas Draebing	8ab8153f8e	Increase number of allowed requests per log parser The default limit of requests accepted by Loki from a single host was set to 10000, which is not enough for a large Gerrit instance to push all httpd/sshd-logs to Loki. Change-Id: `I94cb56e00102170ae4ed10e90123a8885e3aad00`	2020-03-17 09:09:51 +01:00
Matthias Sohn	14e7530aab	Process dashboard: add panel showing system load - Rearrange the other panels so that we show system load over cpu usage over threads in the left column. - Reduce height of memory panel a bit Change-Id: `Icaada525f87d0df503f67cf688b94d15a4119034`	2020-03-13 17:41:01 +01:00
Matthias Sohn	4a96ed4947	Process dashboard: show number of available CPUs Change-Id: `Ifbf13edb2dfa8f5cee64aea3f9dca006d419ef20`	2020-03-13 17:40:53 +01:00
Thomas Draebing	8daaa2695f	Add basic dev documentation Change-Id: `I6de025c38fa87d4b70bdd4d8eaf261ced97716f2`	2020-03-11 15:23:19 +01:00
Thomas Draebing	be862d863e	Move internal project to open source This change adds the current status of a project that aims to create a simple monitoring setup to monitor Gerrit servers, which was developed internally at SAP. The project provides an opinionated and basic configuration for helm charts that can be used to install Loki, Prometheus and Grafana on a Kubernetes cluster. Scripts to easily apply the configuration and install the whole setup are provided as well. The contributions so far were done by (with number of commits) 80 Thomas Draebing 11 Matthias Sohn 2 Saša Živkov Change-Id: `I8045780446edfb3c0dc8287b8f494505e338e066`	2020-03-11 15:23:19 +01:00
David Pursehouse	4314ca0fbc	Initial empty repository	2020-03-06 09:09:23 +00:00

48 commits