IBM WebSphere Application Server Performance Cookbook

OpenShift Investigate Source of Signal

This procedure helps find the source of a kill signal such as SIGQUIT:

Ensure you're logged in with oc with cluster-admin permissions

Find the relevant pod receiving the signal:

$ oc get pods --namespace $NAMESPACE
NAME                            READY   STATUS    RESTARTS   AGE
mypod-7d57d6599f-tq7vt          1/1     Running   0          12m

Find the worker node of the pod:

oc get pod --namespace $NAMESPACE --output "jsonpath={.spec.nodeName}{'\n'}" $PODNAME

Start a debug pod on the worker node with the containerdiag image:
```
oc debug node/$NODE -t --image=quay.io/ibm/containerdiag
```
Find the worker node PID of the pod container (we'll use this later); for example:
```
$ podinfo.sh -p mypod-7d57d6599f-tq7vt
3636617
```
Change to the root filesystem:
```
chroot /host
```
Run this command to append to the audit rules file:
```
cat >> /etc/audit/rules.d/audit.rules
```

Paste this line and press ENTER:

-a always,exit -F arch=b64 -S kill -k watchkill

Type Ctrl^D to finish the append.

Confirm the line is there:

$ tail -1 /etc/audit/rules.d/audit.rules
-a always,exit -F arch=b64 -S kill -k watchkill

Regenerate the audit rules:
```
augenrules --load
```
Kill auditd (there is no graceful way of doing this):
```
systemctl kill auditd
```
Start auditd:
```
systemctl start auditd
```

Double check the status and make sure it's running (active (running)):

$ systemctl status auditd
● auditd.service - Security Auditing Service
   Loaded: loaded (/usr/lib/systemd/system/auditd.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2022-10-05 13:26:04 UTC; 9min ago [...]

Wait for the signal to occur.
After the issue is reproduced, search for the signal in the audit logs (replace SIGQUIT with the signal name):
```
ausearch -k watchkill -i | grep -A 5 -B 5 --group-separator========= SIGQUIT
```

Find the relevant audit event; for example:

type=PROCTITLE msg=audit(10/05/22 08:47:31.523:278210) : proctitle=java -Dsdjagent.loadjnilibrary=false -Dsun.jvmstat.perdata.syncWaitMs=5000 -Dsdjagent.managementAgentConnectDelayMs=0 -jar /tmp/
type=OBJ_PID msg=audit(10/05/22 08:47:31.523:278210) : opid=230677 oauid=unset ouid=unknown(1000680000) oses=-1 obj=system_u:system_r:container_t:s0:c15,c26 ocomm=java
type=SYSCALL msg=audit(10/05/22 08:47:31.523:278210) : arch=x86_64 syscall=kill success=yes exit=0 a0=0x1 a1=SIGQUIT a2=0x1 a3=0x7 items=0 ppid=149339 pid=218261 auid=unset uid=unknown(1000680000) gid=root euid=unknown(1000680000) suid=unknown(1000680000) fsuid=unknown(1000680000) egid=root sgid=root fsgid=root tty=(none) ses=unset comm=main exe=/opt/java/openjdk/jre/bin/java subj=system_u:system_r:spc_t:s0 key=watchkill

In the OBJ_PID line, the opid= is the PID of the program receiving the signal. Confirm this matches the worker node PID of the pod container from step 5 above.
In the PROCTITLE line, the proctitle= is the command line of the program sending the signal. In the SYSCALL line, the pid= is the PID of the program sending the signal and the ppid= is the parent PID of that program.
Search for the pid= in ps; for example:
```
ps -elf | grep 218261
```

If nothing is found (i.e. the process sending the signal quickly went away), search for the ppid= in ps; for example:

$ ps -elf | grep 149339
0 S root     149339 146443  0  80   0 - 642951 futex_ Sep21 ?       01:23:32 java -Xmx256m -Djava.library.path=/opt/draios/lib -Dsun.rmi.transport.connectionTimeout=2000 -Dsun.rmi.transport.tcp.handshakeTimeout=2000 -Dsun.rmi.transport.tcp.responseTimeout=2000 -Dsun.rmi.transport.tcp.readTimeout=2000 -jar /opt/draios/share/sdjagent.jar

This process will most likely be driven by some container. The parent PID is the 5th column, so just keep running ps -elf up that chain until you find conmon; for example:

$ ps -elf | grep 146441 | grep -v grep
4 S root     146441 146404  0  80   0 -  2977 do_wai Sep21 ?        00:00:00 /bin/bash /var/tmp/sclXDwWEb
4 S root     146443 146441  0  80   0 - 15984 hrtime Sep21 ?        00:01:14 /opt/draios/bin/dragent --noipcns
$ ps -elf | grep 146404 | grep -v grep
4 S root     146404 146391  0  80   0 - 13837 do_wai Sep21 ?        00:00:00 /usr/bin/scl enable llvm-toolset-7.0 -- /docker-entrypoint.sh
4 S root     146441 146404  0  80   0 -  2977 do_wai Sep21 ?        00:00:00 /bin/bash /var/tmp/sclXDwWEb
$ ps -elf | grep 146391 | grep -v grep
1 S root     146391      1  0  80   0 - 30958 poll_s Sep21 ?        00:05:20 /usr/bin/conmon -b /var/data/crioruntimestorage/overlay-containers/681b13596d8c31f8e60e8b0a0973382fe73094f37ec13ff2fa32918996af06e7/userdata [...]
4 S root     146404 146391  0  80   0 - 13837 do_wai Sep21 ?        00:00:00 /usr/bin/scl enable llvm-toolset-7.0 -- /docker-entrypoint.sh

Take the hexadecimal string in the conmon command line to get container information; for example:

$ runc state 681b13596d8c31f8e60e8b0a0973382fe73094f37ec13ff2fa32918996af06e7
 [...]
 "io.kubernetes.container.name": "sysdig-agent",
 "io.kubernetes.pod.name": "sysdig-agent-l49j6",
 "io.kubernetes.pod.namespace": "ibm-observe",
 [...]

Therefore, the ultimate cause of this signal was the sysdig-agent container in the sysdig-agent-l49j6 pod in the ibm-observe namespace.
If the signal audit rule is no longer needed, remove it from /etc/audit/rules.d/audit.rules, re-generate the rules, and restart auditd.

Previous Section (OpenShift Download Container Files Recipe) | Next Section (Liberty in OpenShift Get Javacore Recipe) | Back to Table of Contents

OpenShift Investigate Source of Signal

Footer links