Skip to content

Failed to connect to bluechi-controller due to error "Connection reset by peer". #1041

@dofmind

Description

@dofmind

Describe the bug

Sometimes, when bluechi-agent disconnects from bluechi-controller and bluechi-agent tries to connect to another bluechi-controller, bluechi-agent may not be able to register to bluechi-controller.

In the case, bluechi-controller prints the following error:

Failed to start peer bus server: Connection reset by peer

Also, when this happens, there will be a lot of CLOSE_WAIT state on port 842 of the host of the newly connected bluechi-controller.

root@vc-s32g2:~# netstat -anv | grep 842             
tcp        0      1 192.168.16.114:36072    192.168.16.114:842      SYN_SENT   
tcp6    4097      0 :::842                  :::*                    LISTEN     
tcp6      30      0 192.168.16.114:842      192.168.16.105:51876    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.106:46946    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.103:34736    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.111:53606    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.114:54694    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.103:34426    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.106:50000    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.106:36524    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.106:33140    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.111:52734    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.106:51806    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.103:34468    CLOSE_WAIT 
tcp6      30      0 192.168.16.114:842      192.168.16.105:51866    CLOSE_WAIT 
...

To Reproduce

It's not easy to reproduce because this case occurs rarely.

Prerequisite

  • Bluechi version is 0.9.0.
  • There are many hosts (about 3 ~ 7) on local network. All hosts are running bluechi-controller and bluechi-agent.

Step

  1. The bluechi-agent of all hosts have connected to bluechi-controller of one host.
  2. Shutdown bluechi-controller of the connected host.```systemctl stop bluechi-controller
systemctl stop bluechi-controller
  1. Change controller address of bluechi-agent of all hosts to another host.
dbus-send --system --dest=org.eclipse.bluechi.Agent --print-reply --type=method_call /org/eclipse/bluechi org.eclipse.bluechi.Agent.SwitchController string:'tcp:host=192.168.16.111,port=842'
  1. Make sure that the bluechi-agent on all hosts is connected to the bluechi-controller on another host.
  2. Repeat the above steps until this error case happens.

Logs when the error case occurs,

  • Host A cannot connect to bluechi-controller of another host.
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Disconnected from controller
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 1)
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 2)
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 3)
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:31 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:31 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 4)
Feb 03 11:02:31 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Disconnected from controller
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.111,port=842
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 5)
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.111,port=842
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 6)
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.111,port=842
Feb 03 11:02:33 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:33 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 7)
Feb 03 11:02:33 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.111,port=842
...
  • Host B can connect to bluechi-controller of new host.
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Disconnected from controller
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 1)
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 2)
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 3)
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 4)
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 5)
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:32 hpc-ad-qc bluechi-agent[1173]: Disconnected from controller
Feb 03 11:02:32 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.111,port=842
Feb 03 11:02:32 hpc-ad-qc bluechi-agent[1173]: Register call response received
Feb 03 11:02:32 hpc-ad-qc bluechi-agent[1173]: Connected to controller as 'hpc-ad-qc2'
  • Host B bluechi-controller doesn't permit that bluechi-agent of host A connects
root@hpc-ad-qc:~# journalctl -f -u bluechi-controller
Jul 01 09:00:03 hpc-ad-qc systemd[1]: Started BlueChi Controller systemd service.
Jul 01 09:00:03 hpc-ad-qc bluechi-controller[998]: Starting bluechi-controller 0.9.0-1
Feb 03 11:02:32 hpc-ad-qc bluechi-controller[998]: Registered managed node from fd 10 as 'hpc-ad-qc2'
Feb 03 11:02:32 hpc-ad-qc bluechi-controller[998]: Failed to start peer bus server: Connection reset by peer

Expected behavior

When bluechi-agent disconnects from bluechi-controller and bluechi-agent tries to connect to another bluechi-controller, the bluechi-agent of all hosts should be connected to bluechi-controller of other host without any error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions