Skip to content

The start-domain command randomly times out #25615

@dmatej

Description

@dmatej

GlassFish Version (and build number)

7.1.0-SNAPSHOT

JDK version

any

OS

any

Database

No response

Problem Description

When I was running the TCK Faces tests locally, in a phase where they deployed hundreds of war files, the domain start took over a minute and timed out, which should not happen. The pid file should be created asap.

The start-domain command waits for the PID file, then for the process active. Should it really wait for ports open? Any like now, or just the admin port? What about start-local-instance command?

In #25517 I removed the default one minute timeout, I believe we should be sure start/stop just work or they escalate errors to the user. However that doesn't answer the question what is the correct behavior:

  • If we don't wait port ports listening, will following asadmin commands fail?
  • What means that the DAS or server instance started? That there is a process started or that the server is responding on ports? On all or just on the admin port?
    • If just on admin, and user uses different applications for communication, how they react if the server is not listening yet?

My suspicion is that we should wait for all ports. But then the change in #25517 is the real fix of the problem, and more, we should first wait for pid as a first phase, ports as second. And ... until all ports are listening, process must be still alive. That would ensure that the instance did not collapse because it could not allocate ports. Hmmm, starts making sense now ...

Steps to reproduce

Not easy to reproduce, but I have noticed random failures here and there in GlassFish project tests. Mostly were caused by throttling of the CI, but not always. Now I have seen it when running Faces TCK which deployed perhaps 100 war files to the autodeploy directory and started the domain.

Impact of Issue

Random failures and timeouts when starting the domain.

Idea

  • Any problem causing server to crash while starting should result in user error that the server failed to start.
  • Timeouts make it more complicated.
  • If the server cannot open any of its ports, it should die complaining.

What should do restart

  • Restart is specific, it has to avoid collision with the original process.
  1. Original process starts itself again (same or updated command)
  2. The new process waits for the parent's death; its pid received as the AS_RESTART system property.
  3. Then continues. If it fails, nobody will see it (evolution of the project goes in the direction to the state where logging would work from the beginning).

What should start:

  1. Start the process
  2. Remember PID1
  3. Watch the pid file existence, wait until PID1 is alive (GfLauncher); once the file contains PID2 (server), watch if the PID2 is alive and wait until it starts listening.

Then:

  • If the PID2 is alive AND it doesn't listen on all ports specified forever, it is a bug.
  • If the PID2 died, we should get something in the output. If not, user should check the server.log (until now it can happen that server died unable to get to the phase to enable logging, however as we fixed many bugs already, it is not so frequent)

Possible error scenarios

  • Firewall blocks requests/connections to server ports.
    • That can happen any time, even when the server is running. Timeout is not a solution of the problem. In contrast, waiting until the server is ok, tells the user at least that the server was able to start listening. What is happening later is not a problem of the start command.
    • Can be tested using TestContainers.
  • Server was killed by OS OOME when starting
    • Then we should detect it died and escalate it to the user.
    • Maybe can be tested by TestContainers; simulating SIGKILL on the same system as test is would have problem with timing probably.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions