Hi, I've been playing with journald and journalctl, and have a few thoughts about making the logs more reliable.
First problem: it would be nice to tie more messages from systemd to units, i.e. use more structured logging. I'll follow up with a patch which does that, and I'm wondering if this is the right approach. Unfortunately, currently not all messages can be usefully attached by journald to processes, because the system unit information is lost after the last process exits. This sucks, because sometimes the most important messages are the last ones. E.g. http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=665782 is about a process failing to start. 'systemctl status' will probably not be able to show this message, because by the time journald tries to query cgroup information, it is too late. But systemd has/had this information and could be queried for it. So I propose to add a cache of recently dead PIDs in systemd, that could be queried by journald. I'm not sure about the details, but probably keeping the information for a few seconds should be enough. This would lead to more reliable logging for messages close to the end of the program. Second problem: some messages are lost because of the small depth of the unix datagram queue length. By default net.unix.max_dgram_qlen appears to be 10, which is really really low. Messages get dropped silently once this queue is full. I think that the goal should be almost never loose messages. In extreme cases, if the logging system if full or broken, messages must be lost, but otherwise this should not happen. Otherwise things become very confusing for the user: didn't this service start without giving any reason, or were just the messages dropped? Overflowing the queue can also be done on purpose by a rogue program: send enough messages and break logging for legitimate processes. Journald does rate-limiting, but only after receiving the messages, so this does not help. TODO contains an entry for this, suggesting (a) bumping sysctl, (b) adding sockopt. (a) would certainly help, but has the downside that this might negatively influence other daemons. Also, it does not really solve the problem, because if journald is not keeping up, the queue will overflow anyway. (b) would be nice, but will take time to get the kernel modified. And again, increasing the queue depth can only be a part of a solution. As an alternative to in-kernel queuing, I tried to implement a message queue in the application. Messages are queued when the socket is busy, and are resent using sendmmsg on next log statement. This adds queueing that is under control of the application and helps if the socket is _temporarily_ overloaded. To provide reliable logging however, the application must block (or wait) at some point, until the socket is free again. I'll post the patch as RFC, I'm not sure if it is really useful. I guess that some applications could simply block, but in case of e.g. systemd itself it is not possible. journald should process the messages faster. Should journald run reniced to higher priority? I think that the problem is worse with journald than with syslog, because journald tries to do more things: parse the message, query cgroup information, forward the message to syslog, store it on disk. syslog only does the last one, so necessarily must be more efficient. Also, it seems that journald could be more efficient too: running test-log (which queues sends two messages to joural before quitting) show the following activity by journald: epoll_wait(7, {?} 0x7fff4348ad40, 1, -1) = 1 ioctl(4, FIONREAD, [164]) = 0 recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"PRIORITY=6\nSYSLOG_FACILITY=3\nCOD"..., 2315}], msg_controllen=136, {cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 164 open("/etc/localtime", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=19981, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = -1 ESRCH (No such process) sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=350, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = 56 open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/proc/19981/comm", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) readlink("/proc/19981/exe", 0x1008e80, 99) = -1 ENOENT (No such file or directory) open("/proc/19981/cmdline", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0 capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0 open("/proc/19981/sessionid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0 capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0 open("/proc/19981/loginuid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/proc/500/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) uname({sys="Linux", node="fedora-15", ...}) = 0 clock_gettime(CLOCK_REALTIME, {1350560576, 588798333}) = 0 clock_gettime(CLOCK_MONOTONIC, {184373, 816089357}) = 0 ftruncate(11, 29777920) = 0 ioctl(4, FIONREAD, [237]) = 0 recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"PRIORITY=6\nSYSLOG_FACILITY=3\nCOD"..., 2315}], msg_controllen=136, {cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 237 open("/etc/localtime", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=19981, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = -1 ESRCH (No such process) sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=350, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = 56 open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/proc/19981/comm", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) readlink("/proc/19981/exe", 0x1008e80, 99) = -1 ENOENT (No such file or directory) open("/proc/19981/cmdline", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0 capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0 open("/proc/19981/sessionid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0 capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0 open("/proc/19981/loginuid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/proc/500/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) uname({sys="Linux", node="fedora-15", ...}) = 0 clock_gettime(CLOCK_REALTIME, {1350560576, 594694729}) = 0 clock_gettime(CLOCK_MONOTONIC, {184373, 821970285}) = 0 ftruncate(11, 29777920) = 0 ioctl(4, FIONREAD, [0]) = 0 recvmsg(4, 0x7fff4348ab60, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(7, It seems that since the first sendmsg returning -ESRCH journald can expect the the process it is trying to query is gone, and there is no need to check /proc/19981/cgroup, /proc/19981/comm, etc. Would be nice to short-circuit this. Is there a legitimate reason why some of those files in /proc would be available, but not others? Zbyszek _______________________________________________ systemd-devel mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/systemd-devel
