Blog

aniston-grace-W4oQ6ZPUZKA-unsplash

Briefly: one more brick has been laid in the reliable foundation of the system – monitoring. More details below.

A huge amount of work has been done to improve monitoring:

  • updated alert and urlmon
  • the matter of Sensu has been finally investigated – we do not need it
  • but we mastered the use of sensu-plugins – a great set of free checkers that can be used in our cmd_check_alert
  • the heartbeat_mesh – component has been added to the utilities – checking the liveness of the resource by the presence of heartbeats
  • completely redesigned notifications model to match the alerta model
  • heavily redesigned notify_devilry to work with recipient chains, sending to alert, message transformation in progress, new message format, mute, filtering, etc.
  • complete refactoring of cmd_check_alert – threads, yaml, code-style, parallel processing
  • added positive events to disk_alert – to automatically close incidents in the alert when the resource returns to normal
  • inodes monitoring added to disk_alert
  • added a random pause before launch to disk_alert – to reduce the load on alerta
  • default filters in disk_alert updated for many unnecessary partition types
  • heartbeat_mesh is rolled out to production and its configs are filled, for now manually

All this led to a sharp increase in messages in telegrams, but at the same time, all notifications that previously came only in telegrams began to be displayed on the alert dashboard. 

Finally, it became possible to globally assess the current situation in the context of monitoring.

During the initial operation, with a sharp increase in the number of messages in the alert, it began to bend. Which required tuning uwsgi, postgresql.

Monitoring still needs to be done a lot of improvements:

  • learn to ignore dropouts of heartbeats if this is a massive phenomenon caused by the receiver, and not by senders
  • fill receiver configs from accounting
  • make an extended config notify_devilry for hosts like alerta, so that messages with the required client token come to the alert
  • master many sensu-plugins
  • it is possible to switch from alerta-urlmon to sensu-plugins + cmd_check_alert components
  • catch emails from servers in alerts
  • catch pipeline fails in alerts

In accounting/services.py, the ability to run states on the entire servers of all clients has been added.

Made a complete refactor of lxd states in sysadmws-formula, without the usage of LXD API. It became fast and reliable now.

Clients from the travel segment began to close their debts.

Share this post