Service stopped running #
That can have many app/service-specific reasons, obviously - a crashing app will usually also kill, and then restart, the service.
By default, Docker Swarm tries to restart all failing services forever - check the service’s
deploy.restart_policy. Also check if maybe the service just doesn’t come up again for some reason (see “Service doesn’t start” above).
In extreme cases of resource exhaustion (especially if your endpoint runs out of RAM), Planetary Quantum will sacrifice containers first - that is to ensure the base system stays up and controllable via the Quantum Console.
Depending on other services #
Docker containers serve a single purpose, that’s why it’s common to depend on other services: for example, a (Ruby on) Rails application usually depends on a database service (PostgreSQL) to be available.
Docker Swarm does not support the
depends_on option, which is why we recommend a more resilient approach:
wait-for-it.sh in your container’s
COPY . /app
RUN chmod +x wait-for-it.sh
In an entrypoint script, this is how you use
./wait-for-it.sh $DB_HOST:$DB_PORT -- echo "database is available"
rails s -b 0.0.0.0
Re-use the environment variables used to configure the application in the entrypoint script.
This allows the container to wait for the other service to become available. This is the most direct replacement to a
depends_on. Especially when migrations need to be executed when a container boots, this is the way.
Build for resilience #
A better and more scalable way is to allow an application to respond to failure. We realize that this may require more work initially, but the goal is to not let the application crash because the database is not available (yet).
We recommend this together with
wait-for-it.sh as it makes the system more responsive when maintenance is required or (sometimes unpredictable) higher traffic patterns occur. It allows for communication to the customer at all times, which a very important part of managing the unpredictable.
Application crashes can lead to a never ending cycle of start-crash-start-crash-start… and will have an impact on the system eventually as resources are used. Which will have an impact on other stacks and services.
- Handle the framework exception
- Display a message (to the user)
- Log the exception for further inspection