OpenShift Investigate Source of Signal

This procedure helps find the source of a kill signal such as SIGQUIT:

  1. Ensure you're logged in with oc with cluster-admin permissions
  2. Find the relevant pod receiving the signal:
    $ oc get pods --namespace $NAMESPACE
    NAME                            READY   STATUS    RESTARTS   AGE
    mypod-7d57d6599f-tq7vt          1/1     Running   0          12m
  3. Find the worker node of the pod:
    oc get pod --namespace $NAMESPACE --output "jsonpath={.spec.nodeName}{'\n'}" $PODNAME
  4. Start a debug pod on the worker node with the containerdiag image:
    oc debug node/$NODE -t --image=quay.io/ibm/containerdiag
  5. Find the worker node PID of the pod container (we'll use this later); for example:
    $ podinfo.sh -p mypod-7d57d6599f-tq7vt
    3636617
  6. Change to the root filesystem:
    chroot /host
  7. Run this command to append to the audit rules file:
    cat >> /etc/audit/rules.d/audit.rules
  8. Paste this line and press ENTER:
    -a always,exit -F arch=b64 -S kill -k watchkill
  9. Type Ctrl^D to finish the append.
  10. Confirm the line is there:
    $ tail -1 /etc/audit/rules.d/audit.rules
    -a always,exit -F arch=b64 -S kill -k watchkill
  11. Regenerate the audit rules:
    augenrules --load
  12. Kill auditd (there is no graceful way of doing this):
    systemctl kill auditd
  13. Start auditd:
    systemctl start auditd
  14. Double check the status and make sure it's running (active (running)):
    $ systemctl status auditd
    ● auditd.service - Security Auditing Service
       Loaded: loaded (/usr/lib/systemd/system/auditd.service; enabled; vendor preset: enabled)
       Active: active (running) since Wed 2022-10-05 13:26:04 UTC; 9min ago [...]
  15. Wait for the signal to occur.
  16. After the issue is reproduced, search for the signal in the audit logs (replace SIGQUIT with the signal name):
    ausearch -k watchkill -i | grep -A 5 -B 5 --group-separator========= SIGQUIT
  17. Find the relevant audit event; for example:
    type=PROCTITLE msg=audit(10/05/22 08:47:31.523:278210) : proctitle=java -Dsdjagent.loadjnilibrary=false -Dsun.jvmstat.perdata.syncWaitMs=5000 -Dsdjagent.managementAgentConnectDelayMs=0 -jar /tmp/
    type=OBJ_PID msg=audit(10/05/22 08:47:31.523:278210) : opid=230677 oauid=unset ouid=unknown(1000680000) oses=-1 obj=system_u:system_r:container_t:s0:c15,c26 ocomm=java
    type=SYSCALL msg=audit(10/05/22 08:47:31.523:278210) : arch=x86_64 syscall=kill success=yes exit=0 a0=0x1 a1=SIGQUIT a2=0x1 a3=0x7 items=0 ppid=149339 pid=218261 auid=unset uid=unknown(1000680000) gid=root euid=unknown(1000680000) suid=unknown(1000680000) fsuid=unknown(1000680000) egid=root sgid=root fsgid=root tty=(none) ses=unset comm=main exe=/opt/java/openjdk/jre/bin/java subj=system_u:system_r:spc_t:s0 key=watchkill
  18. In the OBJ_PID line, the opid= is the PID of the program receiving the signal. Confirm this matches the worker node PID of the pod container from step 5 above.
  19. In the PROCTITLE line, the proctitle= is the command line of the program sending the signal. In the SYSCALL line, the pid= is the PID of the program sending the signal and the ppid= is the parent PID of that program.
  20. Search for the pid= in ps; for example:
    ps -elf | grep 218261
  21. If nothing is found (i.e. the process sending the signal quickly went away), search for the ppid= in ps; for example:
    $ ps -elf | grep 149339
    0 S root     149339 146443  0  80   0 - 642951 futex_ Sep21 ?       01:23:32 java -Xmx256m -Djava.library.path=/opt/draios/lib -Dsun.rmi.transport.connectionTimeout=2000 -Dsun.rmi.transport.tcp.handshakeTimeout=2000 -Dsun.rmi.transport.tcp.responseTimeout=2000 -Dsun.rmi.transport.tcp.readTimeout=2000 -jar /opt/draios/share/sdjagent.jar
  22. This process will most likely be driven by some container. The parent PID is the 5th column, so just keep running ps -elf up that chain until you find conmon; for example:
    $ ps -elf | grep 146441 | grep -v grep
    4 S root     146441 146404  0  80   0 -  2977 do_wai Sep21 ?        00:00:00 /bin/bash /var/tmp/sclXDwWEb
    4 S root     146443 146441  0  80   0 - 15984 hrtime Sep21 ?        00:01:14 /opt/draios/bin/dragent --noipcns
    $ ps -elf | grep 146404 | grep -v grep
    4 S root     146404 146391  0  80   0 - 13837 do_wai Sep21 ?        00:00:00 /usr/bin/scl enable llvm-toolset-7.0 -- /docker-entrypoint.sh
    4 S root     146441 146404  0  80   0 -  2977 do_wai Sep21 ?        00:00:00 /bin/bash /var/tmp/sclXDwWEb
    $ ps -elf | grep 146391 | grep -v grep
    1 S root     146391      1  0  80   0 - 30958 poll_s Sep21 ?        00:05:20 /usr/bin/conmon -b /var/data/crioruntimestorage/overlay-containers/681b13596d8c31f8e60e8b0a0973382fe73094f37ec13ff2fa32918996af06e7/userdata [...]
    4 S root     146404 146391  0  80   0 - 13837 do_wai Sep21 ?        00:00:00 /usr/bin/scl enable llvm-toolset-7.0 -- /docker-entrypoint.sh
  23. Take the hexadecimal string in the conmon command line to get container information; for example:
    $ runc state 681b13596d8c31f8e60e8b0a0973382fe73094f37ec13ff2fa32918996af06e7
     [...]
     "io.kubernetes.container.name": "sysdig-agent",
     "io.kubernetes.pod.name": "sysdig-agent-l49j6",
     "io.kubernetes.pod.namespace": "ibm-observe",
     [...]
  24. Therefore, the ultimate cause of this signal was the sysdig-agent container in the sysdig-agent-l49j6 pod in the ibm-observe namespace.
  25. If the signal audit rule is no longer needed, remove it from /etc/audit/rules.d/audit.rules, re-generate the rules, and restart auditd.