Hi list,
I’m running Cassandra (C*, a clustered database) as a systemd service.
Currently this is just a “Type=simple” service, as such, dependant units will
start as soon as the C* process starts rather than when C* is accepting client
connections.
I’d like to transition to something more complex so I can start to write
additional units that depend on C*.
I’ve successfully managed to set the service type to “notify” and modify C* to
call sd_notify() when is ready to accept client connections.
Further experimentation reveals that this is not an ideal solution. C* can take
a long time (minutes to _hours_) to reach the point where it will accept client
connections/queries. The default startup timeout is 90s, which causes the
service to be marked failed if exceeded, hence C*, with its long startup times,
will often never get the chance to transition to “active”.
Part of the issue for me is trying to define what “active” means. The man
pages, for “Type=forking" services, says: "The parent process is expected to
exit when start-up is complete and all communication channels are set up”. I’m
assuming for “notify” services, sd_notify() should be called when "start-up is
complete and all communication channels are set up”. Even if this takes hours?
Cassandra exposes a number of inet ports of interest:
- Client connection ports for running queries via Cassandra Query Language
(CQL)/Thrift (RPC) — this is what most clients use to query the database (i.e.,
to run `SELECT * FROM …` style queries)
- JMX (Java Management Extensions) for performing management operations — the
C* and 3rd-party management tools use this to call management functions and to
collect statistics/metrics about the JVM and C*.
The JMX socket is available a few seconds after the process is running.
The CQL/Thrift ports can take far longer to become available — sometimes hours
after the process starts. Cassandra only starts listening on these ports once
it has joined the cluster of nodes & has synchronised its state. State
synchronisation may require bootstrapping & copying large amounts of data
across the network and hence take a long time to complete.
Currently my dependent C* client units simply spin-wait, attempting to
establish a connection to C*. This seems like duplicated effort and makes these
services more complex than they need to be.
My original thought was to just disable the startup timeout on the C*, but that
means the unit will stay “activating” for a long time. Also means that JMX
clients, which can establish connections almost immediately, would have their
startup deferred unnecessarily.
Ideally I’d like to be able to write units that can depend on individual ports
being available from a process — i.e, when the CQL port is available, start the
client unit(s) and when JMX is available, start a monitoring service. Is this
possible with systemd?
Alternatively, I was thinking that I could write some kind of simple
process/script that attempts a connection, and exits with failure if the
connection cannot be established, or success if it can. I’d then write a unit
file, e.g. `cassandra-cql-port.service`:
[Unit]
# not really sure what combo of
Wants/Requires/Requisite/BindsTo/PartOf/Before/After is needed
Requisite=cassandra.service
[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/opt/bin/watch-port 9042
Restart=on-failure
RestartSec=1min
StartLimitInterval=0
My client units could then want/require this unit. Is this a valid approach?
Or am I walking down the wrong path to use systemd to manage this?
Regards,
Adam
_______________________________________________
systemd-devel mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/systemd-devel