-
Notifications
You must be signed in to change notification settings - Fork 170
Description
GlassFish Version (and build number)
7.1.0-SNAPSHOT
JDK version
any
OS
any
Database
No response
Problem Description
When I was running the TCK Faces tests locally, in a phase where they deployed hundreds of war files, the domain start took over a minute and timed out, which should not happen. The pid file should be created asap.
The start-domain command waits for the PID file, then for the process active. Should it really wait for ports open? Any like now, or just the admin port? What about start-local-instance command?
In #25517 I removed the default one minute timeout, I believe we should be sure start/stop just work or they escalate errors to the user. However that doesn't answer the question what is the correct behavior:
- If we don't wait port ports listening, will following asadmin commands fail?
- What means that the DAS or server instance started? That there is a process started or that the server is responding on ports? On all or just on the admin port?
- If just on admin, and user uses different applications for communication, how they react if the server is not listening yet?
My suspicion is that we should wait for all ports. But then the change in #25517 is the real fix of the problem, and more, we should first wait for pid as a first phase, ports as second. And ... until all ports are listening, process must be still alive. That would ensure that the instance did not collapse because it could not allocate ports. Hmmm, starts making sense now ...
Steps to reproduce
Not easy to reproduce, but I have noticed random failures here and there in GlassFish project tests. Mostly were caused by throttling of the CI, but not always. Now I have seen it when running Faces TCK which deployed perhaps 100 war files to the autodeploy directory and started the domain.
Impact of Issue
Random failures and timeouts when starting the domain.
Idea
- Any problem causing server to crash while starting should result in user error that the server failed to start.
- Timeouts make it more complicated.
- If the server cannot open any of its ports, it should die complaining.
What should do restart
- Restart is specific, it has to avoid collision with the original process.
- Original process starts itself again (same or updated command)
- The new process waits for the parent's death; its pid received as the
AS_RESTARTsystem property. - Then continues. If it fails, nobody will see it (evolution of the project goes in the direction to the state where logging would work from the beginning).
What should start:
- Start the process
- Remember PID1
- Watch the pid file existence, wait until PID1 is alive (GfLauncher); once the file contains PID2 (server), watch if the PID2 is alive and wait until it starts listening.
Then:
- If the PID2 is alive AND it doesn't listen on all ports specified forever, it is a bug.
- If the PID2 died, we should get something in the output. If not, user should check the server.log (until now it can happen that server died unable to get to the phase to enable logging, however as we fixed many bugs already, it is not so frequent)
Possible error scenarios
- Firewall blocks requests/connections to server ports.
- That can happen any time, even when the server is running. Timeout is not a solution of the problem. In contrast, waiting until the server is ok, tells the user at least that the server was able to start listening. What is happening later is not a problem of the start command.
- Can be tested using TestContainers.
- Server was killed by OS OOME when starting
- Then we should detect it died and escalate it to the user.
- Maybe can be tested by TestContainers; simulating SIGKILL on the same system as test is would have problem with timing probably.