-
Notifications
You must be signed in to change notification settings - Fork 41
Open
Labels
bugSomething isn't workingSomething isn't working
Milestone
Description
Describe the bug
Sometimes, when bluechi-agent disconnects from bluechi-controller and bluechi-agent tries to connect to another bluechi-controller, bluechi-agent may not be able to register to bluechi-controller.
In the case, bluechi-controller prints the following error:
Failed to start peer bus server: Connection reset by peer
Also, when this happens, there will be a lot of CLOSE_WAIT state on port 842 of the host of the newly connected bluechi-controller.
root@vc-s32g2:~# netstat -anv | grep 842
tcp 0 1 192.168.16.114:36072 192.168.16.114:842 SYN_SENT
tcp6 4097 0 :::842 :::* LISTEN
tcp6 30 0 192.168.16.114:842 192.168.16.105:51876 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.106:46946 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.103:34736 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.111:53606 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.114:54694 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.103:34426 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.106:50000 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.106:36524 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.106:33140 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.111:52734 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.106:51806 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.103:34468 CLOSE_WAIT
tcp6 30 0 192.168.16.114:842 192.168.16.105:51866 CLOSE_WAIT
...
To Reproduce
It's not easy to reproduce because this case occurs rarely.
Prerequisite
- Bluechi version is 0.9.0.
- There are many hosts (about 3 ~ 7) on local network. All hosts are running bluechi-controller and bluechi-agent.
Step
- The bluechi-agent of all hosts have connected to bluechi-controller of one host.
- Shutdown bluechi-controller of the connected host.```systemctl stop bluechi-controller
systemctl stop bluechi-controller
- Change controller address of bluechi-agent of all hosts to another host.
dbus-send --system --dest=org.eclipse.bluechi.Agent --print-reply --type=method_call /org/eclipse/bluechi org.eclipse.bluechi.Agent.SwitchController string:'tcp:host=192.168.16.111,port=842'
- Make sure that the bluechi-agent on all hosts is connected to the bluechi-controller on another host.
- Repeat the above steps until this error case happens.
Logs when the error case occurs,
- Host A cannot connect to bluechi-controller of another host.
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Disconnected from controller
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 1)
Feb 03 11:02:29 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 2)
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 3)
Feb 03 11:02:30 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:31 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:31 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 4)
Feb 03 11:02:31 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Disconnected from controller
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.111,port=842
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 5)
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.111,port=842
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 6)
Feb 03 11:02:32 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.111,port=842
Feb 03 11:02:33 vc-s32g2 bluechi-agent[1066]: Agent connection attempt failed, retrying
Feb 03 11:02:33 vc-s32g2 bluechi-agent[1066]: Trying to connect to controller (try 7)
Feb 03 11:02:33 vc-s32g2 bluechi-agent[1066]: Connecting to controller on tcp:host=192.168.16.111,port=842
...
- Host B can connect to bluechi-controller of new host.
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Disconnected from controller
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 1)
Feb 03 11:02:29 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 2)
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 3)
Feb 03 11:02:30 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 4)
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Agent connection attempt failed, retrying
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Trying to connect to controller (try 5)
Feb 03 11:02:31 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.110,port=842
Feb 03 11:02:32 hpc-ad-qc bluechi-agent[1173]: Disconnected from controller
Feb 03 11:02:32 hpc-ad-qc bluechi-agent[1173]: Connecting to controller on tcp:host=192.168.16.111,port=842
Feb 03 11:02:32 hpc-ad-qc bluechi-agent[1173]: Register call response received
Feb 03 11:02:32 hpc-ad-qc bluechi-agent[1173]: Connected to controller as 'hpc-ad-qc2'
- Host B bluechi-controller doesn't permit that bluechi-agent of host A connects
root@hpc-ad-qc:~# journalctl -f -u bluechi-controller
Jul 01 09:00:03 hpc-ad-qc systemd[1]: Started BlueChi Controller systemd service.
Jul 01 09:00:03 hpc-ad-qc bluechi-controller[998]: Starting bluechi-controller 0.9.0-1
Feb 03 11:02:32 hpc-ad-qc bluechi-controller[998]: Registered managed node from fd 10 as 'hpc-ad-qc2'
Feb 03 11:02:32 hpc-ad-qc bluechi-controller[998]: Failed to start peer bus server: Connection reset by peer
Expected behavior
When bluechi-agent disconnects from bluechi-controller and bluechi-agent tries to connect to another bluechi-controller, the bluechi-agent of all hosts should be connected to bluechi-controller of other host without any error.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working