Monitoring and other tech updates

Like 0
2020-08-21

Briefly: one more brick has been laid in the reliable foundation of the system – monitoring. More details below.

A huge amount of work has been done to improve monitoring:

updated alert and urlmon
the matter of Sensu has been finally investigated – we do not need it
but we mastered the use of sensu-plugins – a great set of free checkers that can be used in our cmd_check_alert
the heartbeat_mesh – component has been added to the utilities – checking the liveness of the resource by the presence of heartbeats
completely redesigned notifications model to match the alerta model
heavily redesigned notify_devilry to work with recipient chains, sending to alert, message transformation in progress, new message format, mute, filtering, etc.
complete refactoring of cmd_check_alert – threads, yaml, code-style, parallel processing
added positive events to disk_alert – to automatically close incidents in the alert when the resource returns to normal
inodes monitoring added to disk_alert
added a random pause before launch to disk_alert – to reduce the load on alerta
default filters in disk_alert updated for many unnecessary partition types
heartbeat_mesh is rolled out to production and its configs are filled, for now manually

All this led to a sharp increase in messages in telegrams, but at the same time, all notifications that previously came only in telegrams began to be displayed on the alert dashboard.

Finally, it became possible to globally assess the current situation in the context of monitoring.

During the initial operation, with a sharp increase in the number of messages in the alert, it began to bend. Which required tuning uwsgi, postgresql.

Monitoring still needs to be done a lot of improvements:

learn to ignore dropouts of heartbeats if this is a massive phenomenon caused by the receiver, and not by senders
fill receiver configs from accounting
make an extended config notify_devilry for hosts like alerta, so that messages with the required client token come to the alert
master many sensu-plugins
it is possible to switch from alerta-urlmon to sensu-plugins + cmd_check_alert components
catch emails from servers in alerts
catch pipeline fails in alerts

In accounting/services.py, the ability to run states on the entire servers of all clients has been added.

Made a complete refactor of lxd states in sysadmws-formula, without the usage of LXD API. It became fast and reliable now.

Clients from the travel segment began to close their debts.

Post Views: 357

Blog

Monitoring and other tech updates