[icinga-users] Icinga2 cluster connection fails and does not reconnect
Per von Zweigbergk
pvz at itassistans.se
Mon Nov 2 10:29:05 CET 2015
I’m having an odd issue with one particular server at one of our customers. We have Icinga2 set up in the "command execution bridge" scenario, where no hosts and services are configured out on the satellites, instead hosts and services are configured only on the central "master" node, that uses command_endpoint to execute the remote checks.
The satellite icinga2 instances are predominantly Windows Server 2008 R2, just like this one, and they all work fine (including other machines on the same site!), except this one, where the cluster connection just fails and then just doesn't re-establish. The master instance is running Ubuntu Linux 14.04.
The infuriating thing is that there's *nothing* useful in the log files to go on. Looking on the server side, everything works, until it just doesn't, with no intervening errors. I see successful checks, being sent and results received. I also see events like this every 10 seconds, then suddenly they just stop coming:
[2015-11-02 09:22:48 +0100] notice/ApiClient: Received 'event::Heartbeat' message from 'srv03.example.com'
And then after a bit over a minute:
[2015-11-02 09:23:59 +0100] information/ApiClient: No messages for identity 'srv03.example.com' have been received in the last 60 seconds.
The log files on the satellite side are equally unhelpful. All I can see is:
[2015-11-02 09:23:50 Västeuropa, normaltid] information/ApiClient: No messages for identity 'icinga.example.com' have been received in the last 60 seconds.
[2015-11-02 09:23:50 Västeuropa, normaltid] warning/ApiClient: API client disconnected for identity 'icinga.example.com'
[2015-11-02 09:23:50 Västeuropa, normaltid] warning/ApiListener: Removing API client for endpoint 'icinga.example.com'. 0 API clients left.
[2015-11-02 09:23:55 Västeuropa, normaltid] information/ApiClient: Reconnecting to API endpoint 'icinga.example.com' via host '192.0.2.237' and port '5665'
It then never appears to actually manage to reconnect, and no failures or retries are logged.
The failure occurs intermittently, once as little as 10 minutes after restarting, other times it can be hours...
I'm running Icinga 2.3.11 on both the satellite and master.
Any insight into this problem (that right now appears like a black box to me), or at least ideas of what I can look at would be appreciated.
Per von Zweigbergk
IT-assistans Sverige AB
More information about the icinga-users