TCP Keepalive and timeout

So, I had an issue at work (well previous work), where in case the link between client and server was lost, they wouldn't timeout for a pretty long while, in this use case this was a real issue, especially knowing that they also wouldn't keep talking to each other when the link was going back up.

I was looking at the various kernel settings available, and found this article which matched what I could observe on one side, so about 15 minutes for a timeout.

On the other side which was solely waiting for data to arrive, I never saw the timeout (which by default should be around 2hours).

So I tried to fiddle with the tcp keepalive, and found libdontdie:

This is a complete rewrite of the libkeepalive.

It is indeed pretty neat, simply compile it and set the options as you LD_PRELOAD it, and you're all set. If you enable debug, the setsockopt() calls will be logged in syslog.

Unfortunately for my case, this would simply put the socket in FIN_WAIT1, and will still take a pretty long time to timeout. The missing part was the TCP_USER_TIMEOUT flag, which wasn't part of the lib, so I forked it and added the user timeout option to it so I could do my testing quickly using it.

I actually didn't do a pull request on it yet for 2 reasons, first I didn't update the documentation yet, second, it makes it more like libpleasedie when using it like this, not sure that actually fits the original purpose.

Using the right combination of keepalive time, probes and interval, I was finally able to trigger the FIN_WAIT1 and have it timeout quickly when this state was reached for the socket using the user timeout.

Hope that can come in handy to someone at some point!