TCP Keepalive & Dead Connection Detection

  • 17 November 2021
  • 0 replies
  • 829 views

Userlevel 2
Badge +1

Hi

One of the changes from LogPoint 5 to 6 I was exited to see implemented, was the support for session keepalive in the syslog collector.

Most people do not think that much about it, but I would say that it is part of ensuring a stable operating environment.

Doing a 'netstat -ano | grep 514' in the CLI you will probably get something like the below listed:(I have pasted in the headlines as well as they will not show using '| grep')

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       Timer
--------------
tcp6       0      0 :::514                  :::*                    LISTEN      off (0.00/0/0)
tcp6       0      0 172.20.20.20:514        172.20.10.107:50554    ESTABLISHED keepalive (7126.58/0/0)
tcp6       0      0 172.20.20.20:514        172.20.10.100:51662    ESTABLISHED keepalive (7126.58/0/0)
tcp6       0      0 172.20.20.20:514        172.25.20.42:50053      ESTABLISHED keepalive (7135.92/0/0)

---------------

This shows the tcp syslog connections and that they are supporting keepalive.

7126.58 is the remaining life in seconds for that specific session - And this is where I realized that maybe LogPoint introduced keepalive, but they kept standard config, but then again this is also a question of tailoring values for the specific installation.

To understand a bit more of this you can try pasting the following command sequence in the CLI.

sysctl \
net.ipv4.tcp_keepalive_time \
net.ipv4.tcp_keepalive_intvl \
net.ipv4.tcp_keepalive_probes

And you will now get something like the below.

-----------
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_intvl = 25
net.ipv4.tcp_keepalive_probes = 9

-----------

TCP 7200 seconds is the standard TCP session length, and for a bit of explanation on these values, TCP keep-alive timer kicks in after the idle time of 7200 seconds. If the keep-alive messages are unsuccessful then they are retried at the interval of 25 seconds. After 9 successive retry failure, the connection will be brought down.
If you want to know a bit more on TCP keepalive and DCD(Dead Connection Detection) 'https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/index.html' is a good place to visit.

Knowing a bit about networks, I suspect that in most modern networks the communications between Log source and LogPoint Back-End/Collector/LPC-server will probably traverse one or more firewall or Load-Balancers, and here concurrent sessions are a scarce resource, and depending on firewall vendor default inactivity time-out for a session can be anything from 30 minutes to 1 hour, and Load-Balancers might even be more aggressive.
This typically result in sessions being torn down by the firewall or Load-Balancer, leaving initiating and receiving end without the knowledge their session has terminated.

You might recognize some of the symptoms like the below snippet of an 'nxlog.log'file:

---------------
2021-08-23 11:22:23 ERROR couldn't connect to tcp socket on 10.9.9.9:514; A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.  
.
.
2021-10-14 09:50:12 ERROR couldn't connect to tcp socket on 10.9.9.9:514; No connection could be made because the target machine actively refused it.  
.
.
.
2021-11-09 21:02:35 ERROR om_tcp send failed; An existing connection was forcibly closed by the remote host.  
2021-11-09 21:02:36 INFO connecting to 10.9.9.9:514
2021-11-09 21:02:57 INFO reconnecting in 2 seconds
2021-11-09 21:02:57 ERROR couldn't connect to tcp socket on 10.9.9.9:514; A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.  
2021-11-09 21:02:59 INFO connecting to 10.9.9.9:514

---------------

If you decide to do something about these issues, you can start out with investigating the communications path between you LogPoint Servers and your log sources, mapping inactivity time-out's and the decide for a optimal config of the TCP-stack on your LogPoint server.

Changing these values are not difficult at all.

Paste below sequence in to the CLI
----------
sysctl -w \
net.ipv4.tcp_keepalive_time=1500 \
net.ipv4.tcp_keepalive_intvl=60 \
net.ipv4.tcp_keepalive_probes=10
-----------

Above commands only changes the current config, but will disappear at reboot.
The way to make the change permanent is to edit the 'etc/sysctl.conf' pasting below lines at the end of the file.
---------
net.ipv4.tcp_keepalive_time = 1500
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 10
--------

The steps taken in this article does not just go for your LogPoint installation.
For my part I realized then years back, when I was troubleshooting intermittent failures in applications communicating with database severs.


Regards
Hans


0 replies

Be the first to reply!

Reply