Added dial/handshake timeout for non-responsive connections#273
Added dial/handshake timeout for non-responsive connections#273nmezei wants to merge 2 commits intonanomsg:mainfrom
Conversation
Change ipc solaris to use SyscallConn which enables deadlines/close/abort to work
|
Hey Nick. I'll have to look over this a bit to fully understand what you're doing. Thanks for this. |
| // Dial implements the Dialer Dial method | ||
| func (d *dialer) Dial() (transport.Pipe, error) { | ||
|
|
||
| // TODO: It might be good to pass a context here to abort the dial after |
There was a problem hiding this comment.
I agree that that would be a better approach.
| @@ -1,3 +1,4 @@ | |||
| //go:build solaris && cgo | |||
| // This change was necessary to support SetDeadline and Close | ||
| // which aborts pending reads/writes. | ||
| // The prior code was calling c.File().Fd(). It was not closing | ||
| // the file. Additionally the docs say Fd() causes deadlines to not work. |
There was a problem hiding this comment.
When I originally wrote this, the Control method didn't exist. I need to review the docs to understand what is happening here.
There was a problem hiding this comment.
Here was an attempt to call Close on the connection without this change. Close blocked forever instead of aborting the read.
1 @ 0x43ce36 0x44d0be 0x44d095 0x4654c5 0x4b27cd 0x50ea18 0x519ae5 0x528177 0x4696c1
# 0x4654c4 internal/poll.runtime_Semacquire+0x24 c:/bsros/toolchain/3.4.1/lib/go/src/runtime/sema.go:67
# 0x4b27cc internal/poll.(*FD).Close+0x6c c:/bsros/toolchain/3.4.1/lib/go/src/internal/poll/fd_unix.go:116
# 0x50ea17 net.(*netFD).Close+0x37 c:/bsros/toolchain/3.4.1/lib/go/src/net/fd_posix.go:37
# 0x519ae4 net.(*conn).Close+0x44 c:/bsros/toolchain/3.4.1/lib/go/src/net/net.go:207
| h.cv.Broadcast() | ||
| for conn := range h.workq { | ||
| // This does not do anything because conn.open is | ||
| // false and conn.Close is a no-op |
There was a problem hiding this comment.
I don't think this comment is necessarily true. But I need to review ... it may be that I've missed something here. The intent is to be able to abort the handshaker via Close, and that should close any connections that might not be fully established yet. This code was "recently" (in terms of release numbers, but not necessarily dates) refactored to better deal with stalled handshakes, and I may have messed this up.
| // Also, various properties are initialized. | ||
| func (p *conn) handshake() error { | ||
|
|
||
| // Rational for a timeout mechanism (which may need to move elsewhere) |
There was a problem hiding this comment.
| // Rational for a timeout mechanism (which may need to move elsewhere) | |
| // Rationale for a timeout mechanism (which may need to move elsewhere) |
| // if the socket remains open. Closing the dialer does not work | ||
| // because the handshake is not complete (check for p.open). | ||
|
|
||
| // TODO: Timeout should be configurable however that results in a |
There was a problem hiding this comment.
I don't think this should be configurable. The handshake should go quickly, and 5 seconds is more than enough time to deal with even the crappiest networks.
There was a problem hiding this comment.
HOWEVER, I don't think we want to keep this deadline in place once the handshake is completed. Because we could have large BULK transfers (think a 4 GB message, which is perfectly legal) that might take quite a while to send. Or the connection might be backed up due to other large messages in the pipe ahead of it. Therefore we need to ensure that the timeout does not persist once the handshake is complete.
There was a problem hiding this comment.
Right. That is why I defer SetDeadline to 0. That should disable the current deadline per docs "The deadline applies to all future and pending I/O". I figure the deadline set during handshake is done right after dialer connection creation which is prior to the conn being available for transport usage. After the dial/handshake completes other deadlines could be set on it specific to how mangos wants to use the connection.
gdamore
left a comment
There was a problem hiding this comment.
I'm not completely certain that the changes requested here are sufficient -- I need to spend more time thinking about them -- but I believe that at a minimum we need to prevent the connection deadline from outliving the handshake.
| // if the socket remains open. Closing the dialer does not work | ||
| // because the handshake is not complete (check for p.open). | ||
|
|
||
| // TODO: Timeout should be configurable however that results in a |
There was a problem hiding this comment.
HOWEVER, I don't think we want to keep this deadline in place once the handshake is completed. Because we could have large BULK transfers (think a 4 GB message, which is perfectly legal) that might take quite a while to send. Or the connection might be backed up due to other large messages in the pipe ahead of it. Therefore we need to ensure that the timeout does not persist once the handshake is complete.
|
Thanks for taking a look. Mangos has been great. |
Also change ipc solaris to use SyscallConn which enables deadlines/close/abort to work