OpenShift Investigate Source of Signal
This procedure helps find the source of a kill
signal
such as SIGQUIT
:
- Ensure you're logged in with
oc with
cluster-admin
permissions - Find the relevant pod receiving the signal:
$ oc get pods --namespace $NAMESPACE NAME READY STATUS RESTARTS AGE mypod-7d57d6599f-tq7vt 1/1 Running 0 12m
- Find the worker node of the pod:
oc get pod --namespace $NAMESPACE --output "jsonpath={.spec.nodeName}{'\n'}" $PODNAME
- Start a debug pod on the worker node with the containerdiag
image:
oc debug node/$NODE -t --image=quay.io/ibm/containerdiag
- Find the worker node PID of the pod container (we'll use this
later); for example:
$ podinfo.sh -p mypod-7d57d6599f-tq7vt 3636617
- Change to the root filesystem:
chroot /host
- Run this command to append to the audit
rules file:
cat >> /etc/audit/rules.d/audit.rules
- Paste this line and press
ENTER
:-a always,exit -F arch=b64 -S kill -k watchkill
- Type
Ctrl^D
to finish the append. - Confirm the line is there:
$ tail -1 /etc/audit/rules.d/audit.rules -a always,exit -F arch=b64 -S kill -k watchkill
- Regenerate the audit rules:
augenrules --load
- Kill
auditd
(there is no graceful way of doing this):systemctl kill auditd
- Start
auditd
:systemctl start auditd
- Double check the status and make sure it's running
(
active (running)
):$ systemctl status auditd ● auditd.service - Security Auditing Service Loaded: loaded (/usr/lib/systemd/system/auditd.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2022-10-05 13:26:04 UTC; 9min ago [...]
- Wait for the signal to occur.
- After the issue is reproduced, search for the signal in the audit
logs (replace
SIGQUIT
with the signal name):ausearch -k watchkill -i | grep -A 5 -B 5 --group-separator========= SIGQUIT
- Find the relevant audit event; for example:
type=PROCTITLE msg=audit(10/05/22 08:47:31.523:278210) : proctitle=java -Dsdjagent.loadjnilibrary=false -Dsun.jvmstat.perdata.syncWaitMs=5000 -Dsdjagent.managementAgentConnectDelayMs=0 -jar /tmp/ type=OBJ_PID msg=audit(10/05/22 08:47:31.523:278210) : opid=230677 oauid=unset ouid=unknown(1000680000) oses=-1 obj=system_u:system_r:container_t:s0:c15,c26 ocomm=java type=SYSCALL msg=audit(10/05/22 08:47:31.523:278210) : arch=x86_64 syscall=kill success=yes exit=0 a0=0x1 a1=SIGQUIT a2=0x1 a3=0x7 items=0 ppid=149339 pid=218261 auid=unset uid=unknown(1000680000) gid=root euid=unknown(1000680000) suid=unknown(1000680000) fsuid=unknown(1000680000) egid=root sgid=root fsgid=root tty=(none) ses=unset comm=main exe=/opt/java/openjdk/jre/bin/java subj=system_u:system_r:spc_t:s0 key=watchkill
- In the
OBJ_PID
line, theopid=
is the PID of the program receiving the signal. Confirm this matches the worker node PID of the pod container from step 5 above. - In the
PROCTITLE
line, theproctitle=
is the command line of the program sending the signal. In theSYSCALL
line, thepid=
is the PID of the program sending the signal and theppid=
is the parent PID of that program. - Search for the
pid=
inps
; for example:ps -elf | grep 218261
- If nothing is found (i.e. the process sending the signal quickly
went away), search for the
ppid=
inps
; for example:$ ps -elf | grep 149339 0 S root 149339 146443 0 80 0 - 642951 futex_ Sep21 ? 01:23:32 java -Xmx256m -Djava.library.path=/opt/draios/lib -Dsun.rmi.transport.connectionTimeout=2000 -Dsun.rmi.transport.tcp.handshakeTimeout=2000 -Dsun.rmi.transport.tcp.responseTimeout=2000 -Dsun.rmi.transport.tcp.readTimeout=2000 -jar /opt/draios/share/sdjagent.jar
- This process will most likely be driven by some container. The
parent PID is the 5th column, so just keep running
ps -elf
up that chain until you findconmon
; for example:$ ps -elf | grep 146441 | grep -v grep 4 S root 146441 146404 0 80 0 - 2977 do_wai Sep21 ? 00:00:00 /bin/bash /var/tmp/sclXDwWEb 4 S root 146443 146441 0 80 0 - 15984 hrtime Sep21 ? 00:01:14 /opt/draios/bin/dragent --noipcns $ ps -elf | grep 146404 | grep -v grep 4 S root 146404 146391 0 80 0 - 13837 do_wai Sep21 ? 00:00:00 /usr/bin/scl enable llvm-toolset-7.0 -- /docker-entrypoint.sh 4 S root 146441 146404 0 80 0 - 2977 do_wai Sep21 ? 00:00:00 /bin/bash /var/tmp/sclXDwWEb $ ps -elf | grep 146391 | grep -v grep 1 S root 146391 1 0 80 0 - 30958 poll_s Sep21 ? 00:05:20 /usr/bin/conmon -b /var/data/crioruntimestorage/overlay-containers/681b13596d8c31f8e60e8b0a0973382fe73094f37ec13ff2fa32918996af06e7/userdata [...] 4 S root 146404 146391 0 80 0 - 13837 do_wai Sep21 ? 00:00:00 /usr/bin/scl enable llvm-toolset-7.0 -- /docker-entrypoint.sh
- Take the hexadecimal string in the
conmon
command line to get container information; for example:$ runc state 681b13596d8c31f8e60e8b0a0973382fe73094f37ec13ff2fa32918996af06e7 [...] "io.kubernetes.container.name": "sysdig-agent", "io.kubernetes.pod.name": "sysdig-agent-l49j6", "io.kubernetes.pod.namespace": "ibm-observe", [...]
- Therefore, the ultimate cause of this signal was the
sysdig-agent
container in thesysdig-agent-l49j6
pod in theibm-observe
namespace. - If the signal audit rule is no longer needed, remove it from
/etc/audit/rules.d/audit.rules
, re-generate the rules, and restartauditd
.