Commit graph

15 commits

Author SHA1 Message Date
Thomas Dräbing
81ab4f166a Merge changes I1ba3967a,Id55095c3
* changes:
  Describe infrastructure dependencies
  Use object store to store chunks created by Loki
2020-04-08 13:18:16 +00:00
Thomas Dräbing
b34c47f817 Merge changes I1efdc490,I220d90d3,I405f09f7,I392b2ddf,I84062d6e
* changes:
  Relabel the instance label for prometheus and loki metrics
  Add dashboard for Loki metrics
  Add dashboard to monitor Prometheus data
  Only show Gerrit instances in the instance dropdowns
  Create a configmap per dashboard
2020-04-08 13:17:58 +00:00
Thomas Draebing
a8135ce8c4 Relabel the instance label for prometheus and loki metrics
The instance label for Prometheus had the value localhost:9090, which
was misleading.

Now the label is relabeled to prometheus-<namespace> or loki-<namespace>.
This is still not ideal for cases, where multiple replicas are deployed,
but until then, it is already a slight improvement.

Change-Id: I1efdc49071b1d3bf99d21315ca03821e9d58c906
2020-04-03 13:36:34 +02:00
Thomas Dräbing
e2a5902494 Merge "Show more lines in log queries in Grafana" 2020-04-03 09:58:31 +00:00
Thomas Draebing
623332e4b3 Create a configmap per dashboard
I the dashboard files got too large (>2Mb) Kubernetes was rejecting
the configmap.

Now each dashboard is installed with an own configmap. A sidecar container
is used to register these dashboards with Grafana.

Change-Id: I84062d6e2ac7dc2669945b54575bf239a25900a4
2020-03-26 09:55:39 +01:00
Thomas Dräbing
6d3c31e50c Merge "Update Grafana to 6.7.1" 2020-03-26 08:06:30 +00:00
Thomas Draebing
202a3168ce Show more lines in log queries in Grafana
The default maximum log lines shown in Grafana are 1000. This is
barely covering a few minutes in the httpd-logs.

The value of 10,000 can still be handled by the browser. More log
entries will cause the browser to cache as long as Grafana does not
provide pagination, which is planned for the future.

Change-Id: Ife84d161cd022300ff6f440920021e4176b770b9
2020-03-24 16:21:01 +01:00
Thomas Draebing
10a0a54069 Update Grafana to 6.7.1
The most interesting new features are:
- proper limits for queried logs
- query history for logs (still a beta feature)

Change-Id: Ibd8b76b0e1e16d4bd3c74382fa3fd5a24c1bba45
2020-03-24 16:20:54 +01:00
Thomas Draebing
eb4e6ea191 Use object store to store chunks created by Loki
The chunks created by Loki were stored in a persistent volume. This
does not scale well, since volumes cannot easily be resized in
Kubernetes. Also, at least the ext4-filesystem had issues, when large
numbers of logs were saved. These issues are due to the dir_index as
discussed in [1].

An object store provides a more scalable and cheaper solution. Loki
supports S3 as an object storage and also other object stores that
understand the S3 API like Ceph or OpenStack Swift.

[1] https://github.com/grafana/loki/issues/1502

Change-Id: Id55095c3b6659f40708712c1a494753dbcab7686
2020-03-24 16:01:34 +01:00
Thomas Draebing
b1be26012b Scrape Loki metrics
Change-Id: I2cd9c872882cd760fc2ff10028b7e03a31f8fba5
2020-03-23 16:09:54 +01:00
Thomas Draebing
ead4e7d5cc Monitor Prometheus itself
Monitoring Prometheus itself will help to identify issues with the
monitoring setup itself.

Change-Id: I26cfd395831aebffe9f32922c8e795f8df928b9e
2020-03-23 15:39:29 +01:00
Thomas Draebing
ab26ebb833 Increase the chunk_retain_period to 15 minutes
This increases the time a chunk has to be filled before being flushed.
With shorter times, it could happen that during times of low traffic
chunks will not be filled completely before being flushed. This would
lead to small chunk objects, which is inefficient.

Change-Id: I74b2af1a053c8d4298b9e9d7ffca04cb9d8926bd
2020-03-23 11:41:01 +01:00
Thomas Draebing
8b308e2973 Set resource limit for Loki pods
So far, there were no limits to the resources the Loki pod was allowed
to use. This now sets limits that in my observation for now seem to
work. With handling more and more logs, these limits will probably have
to be increased.

Change-Id: I7313488a60da8a1fff28666870549f748400735a
2020-03-17 14:48:52 +01:00
Thomas Draebing
8ab8153f8e Increase number of allowed requests per log parser
The default limit of requests accepted by Loki from a single host was
set to 10000, which is not enough for a large Gerrit instance to push
all httpd/sshd-logs to Loki.

Change-Id: I94cb56e00102170ae4ed10e90123a8885e3aad00
2020-03-17 09:09:51 +01:00
Thomas Draebing
be862d863e Move internal project to open source
This change adds the current status of a project that aims to create
a simple monitoring setup to monitor Gerrit servers, which was developed
internally at SAP.

The project provides an opinionated and basic configuration for helm
charts that can be used to install Loki, Prometheus and Grafana on a
Kubernetes cluster. Scripts to easily apply the configuration and
install the whole setup are provided as well.

The contributions so far were done by (with number of commits)

  80  Thomas Draebing
  11  Matthias Sohn
   2  Saša Živkov

Change-Id: I8045780446edfb3c0dc8287b8f494505e338e066
2020-03-11 15:23:19 +01:00