Troubleshooting Linux

General Troubleshooting Commands

  • Print system page size: getconf PAGESIZE

  • The ausyscall command converts a syscall number to the syscall name. Example:

    $ ausyscall 221  
    fadvise64

Kernel symbol table

Gather the kernel symbol table:

$ sudo su -
$ cat /proc/kallsyms &> kallsyms_$(hostname)_$(date +"%Y%m%d_%H%M%S").txt
$ cat /boot/System.map-$(uname -r) &> systemmap_$(hostname)_$(date +"%Y%m%d_%H%M%S").txt

Upload kallsyms_*txt and systemmap_*txt

pgrep/pkill

pgrep finds process IDs based on various search options. It is a more formalized alternative to common commands like ps -elf | grep something: https://www.kernel.org/doc/man-pages/online/pages/man1/pgrep.1.html

Examples:

  • Search by simple program name: pgrep java
  • Search by something in the full program name or full command line: pgrep -f server1

pidof is a similar program to pgrep: https://www.kernel.org/doc/man-pages/online/pages/man1/pidof.1.html

pkill combines pgrep and kill into one command: https://www.kernel.org/doc/man-pages/online/pages/man1/pkill.1.html

Examples:

  • Send SIGQUIT to all Java programs: pkill -3 java
  • Send SIGQUIT to all Java programs with server1 in the command line: pkill -3 -f server1

kill

The kill command is used to send a signal to a processes or to terminate it:

kill $PID

Without arguments, the SIGTERM (15) signal is sent which is equivalent to kill -15 $PID.

To specify a signal, use the number or name of the signal. For example, to send the equivalent of Ctrl+C to a process, use either one of the following commands:

$ kill -2 $PID
$ kill -INT $PID

To list all available signals:

$ kill -l
 1) SIGHUP        2) SIGINT        3) SIGQUIT       4) SIGILL        5) SIGTRAP
 6) SIGABRT       7) SIGBUS        8) SIGFPE        9) SIGKILL      10) SIGUSR1
11) SIGSEGV      12) SIGUSR2      13) SIGPIPE      14) SIGALRM      15) SIGTERM
16) SIGSTKFLT    17) SIGCHLD      18) SIGCONT      19) SIGSTOP      20) SIGTSTP
21) SIGTTIN      22) SIGTTOU      23) SIGURG       24) SIGXCPU      25) SIGXFSZ
26) SIGVTALRM    27) SIGPROF      28) SIGWINCH     29) SIGIO        30) SIGPWR
31) SIGSYS       34) SIGRTMIN     35) SIGRTMIN+1   36) SIGRTMIN+2   37) SIGRTMIN+3
38) SIGRTMIN+4   39) SIGRTMIN+5   40) SIGRTMIN+6   41) SIGRTMIN+7   42) SIGRTMIN+8
43) SIGRTMIN+9   44) SIGRTMIN+10  45) SIGRTMIN+11  46) SIGRTMIN+12  47) SIGRTMIN+13
48) SIGRTMIN+14  49) SIGRTMIN+15  50) SIGRTMAX-14  51) SIGRTMAX-13  52) SIGRTMAX-12
53) SIGRTMAX-11  54) SIGRTMAX-10  55) SIGRTMAX-9   56) SIGRTMAX-8   57) SIGRTMAX-7
58) SIGRTMAX-6   59) SIGRTMAX-5   60) SIGRTMAX-4   61) SIGRTMAX-3   62) SIGRTMAX-2
63) SIGRTMAX-1   64) SIGRTMAX 

SIGSTOP may be used to completely pause a process so that the operating system does not schedule it. SIGCONT may be used to continue a stopped process. This can be useful for things such as simulating a hung database.

Find who killed a process

There are two main ways a process is killed:

  1. It kills itself using a call to java/lang/System.exit, java/lang/Runtime.halt, exit, raise, etc.
  2. It is killed by the kernel or another process using the kill system call, the kill command, etc.

These are diagnosed differently with their own section below.

Find why a process killed itself

  1. If using IBM Java or Semeru/OpenJ9, restart the process with the following Java options and review the resulting javacore:
    -Xdump:java:events=vmstop,request=exclusive+preempt
  2. If using HotSpot Java, some builds have User Statically-Defined Tracing (USDT) probes which then SystemTap or eBPF (on newer kernels) can use to trace calls to System.exit.

Find who killed another process

  1. Check the native stdout and stderr logs of the process for any suspicious activity

  2. Check the kernel log around the time of the kill for things like the OOM Killer and other potentially related messages (e.g. SSH login by some user)

  3. Consider using bcc-tools and killsnoop.py

  4. If an auditing or keylogging system is in place, review if anyone used the kill command.

  5. For systems that support it, use auditd with a rule to watch for kill system calls, although test the performance overhead.

  6. For kernels that support SystemTap, combine scripts such as https://github.com/jav/systemtap/blob/master/testsuite/systemtap.examples/process/sigmon.stp and https://github.com/jav/systemtap/blob/master/testsuite/systemtap.examples/process/proc_snoop.stp to capture the signal and map to the source PID with details.

  7. For some signals like SIGTERM (but not SIGKILL), attach strace to the process and watch for signal notifications although the overhead may be massive even with the -e filter:

    $ nohup strace -f -tt -e signal -o strace_trace.txt -p $PID &>> strace_stdouterr.txt &
    $ tail -f strace_trace.txt | grep " SIG"
    2406  18:50:39.769367 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=678, si_uid=0} ---

    The si_pid integer is the sending PID. A script in the background that periodically writes ps output may capture this process. Create psbg.sh:

    #!/bin/sh
    outputfile="diag_ps_$(hostname)_$(date +"%Y%m%d_%H%M%S").log"
    while true; do
      echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") iteration" &>> "${outputfile}"
      ps -elf 2>&1 &>> "${outputfile}"
      sleep 15
    done

    Then start before the strace:

    $ chmod a+x psbg.sh
    $ nohup ./psbg.sh &
  8. For some signals, attach to the process using gdb and immediately continue. When the signal hits the process, gdb will break execution and leave you at a prompt. Then, handle the particular signal you want and print $_siginfo._sifields._kill.si_pid and detach. Use the same ps script as above to track potential source PIDs.

    $ java HelloWorld
    Hello World. Waiting indefinitely...
    
    $ ps -elf | grep HelloWorld | grep -v grep
    0 S kevin    23947 ...
    
    $ gdb java 23947
    ...
    (gdb) handle all nostop noprint noignore
    (gdb) handle SIGABRT stop print noignore
    (gdb) continue
    
    # ... Reproduce the problem ...
    
    Program received signal SIGABRT, Aborted.
    [Switching to Thread 0x7f232df12700 (LWP 23949)]
    0x00000033a400d720 in sem_wait () from /lib64/libpthread.so.0
    (gdb) ptype $_siginfo
    type = struct {
        int si_signo;
        int si_errno;
        int si_code;
        union {
            int _pad[28];
            struct {...} _kill;
            struct {...} _timer;
            struct {...} _rt;
            struct {...} _sigchld;
            struct {...} _sigfault;
            struct {...} _sigpoll;
        } _sifields;
    }
    (gdb) ptype $_siginfo._sifields._kill
    type = struct {
        __pid_t si_pid;
        __uid_t si_uid;
    }
    (gdb) p $_siginfo._sifields._kill.si_pid
    $1 = 22691
    
    (gdb) continue

    In the above example, we print _sifields._kill because we know we sent a kill, but strictly speaking, that assumption cannot always be made. _sifields is a union, so only one of the fields of the union will have correct values. You must first consult the signal number to know which union member to print:

    The rest of the struct may be a union, so that one should read only the fields that are meaningful for the given signal

File I/O

fsync

fsync is a system call used to attempt to flush pending I/O writes to disk; however, there are various potential issues with fsync (Rebello et al., 2021). One way to reduce such risks is to use a copy-on-write file system such as Btrfs instead of journaling file systems such as ext4 and XFS.

sosreport

sosreport is a utility to gather system-wide diagnostics on Fedora, RedHat, and CentOS distributions:

  1. sudo dnf install -y sos
  2. sudo sosreport --batch
  3. This will take a few minutes to run.
  4. A compressed file will be produced such as /var/tmp/sosreport-7ce62b94e928-2020-09-01-itqojtr.tar.xz. To use an alternative directory, specify --tmp-dir $dir.

Analysis tips:

  1. Uncompress with tar -xf sosreport*tar.xz

systemd

Killing nohup processes

Recent versions of systemd terminate user processes part of the user session scope unit (session-XX.scope) when the user logs out even if they were nohupped. Either systemd-run may be used instead of nohup, or KillUserProcesses may be set to no in logind.conf.

Signal handlers

Show signal handlers registered for a process:

grep Sig /proc/$pid/status

Process core dumps

Core dumps are normally written in the ELF file format. Therefore, use the readelf program to find all of the LOAD sections to review the virtual memory regions that were dumped to the core:

$ readelf --program-headers core
Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  NOTE           0x00000000000003f8 0x0000000000000000 0x0000000000000000
                 0x00000000000008ac 0x0000000000000000  R      1
  LOAD           0x0000000000000ca4 0x0000000000400000 0x0000000000000000
                 0x0000000000001000 0x0000000000001000  R E    1
  LOAD           0x0000000000001ca4 0x0000000000600000 0x0000000000000000
                 0x0000000000001000 0x0000000000001000  RW     1...

Request core dump (also known as a "system dump" for IBM Java)

Additional methods of requesting system dumps for IBM Java are documented in the Troubleshooting IBM Java and Troubleshooting WAS chapters.

  1. The gcore command pauses the process while the core is generated and then the process should continue. Replace ${PID} in the following example with the process ID. You must have permissions to the process (i.e. either run as the owner of the process or as root). The size of the core file will be the size of the virtual size of the process (ps VSZ). If there is sufficient free space in physical RAM and the filecache, the core file will be written to RAM and then asynchronously written out to the filesystem which can dramatically improve the speed of generating a core and reduce the time the process is paused. In general, core dumps compress very well (often up to 75%) for transfer. Normally, the gcore command is provided as part of the gdb package. In fact, the gcore command is actually a shell script which attaches gdb to the process and runs the gdb gcore command and then detaches.

    gcore ${PID} core.$(date +%Y%m%d.%H%M%S).dmp  

    There is some evidence that the gcore command in gdb writes less information than the kernel would write in the case of a crash (this probably has to do with the two implementations being different code bases).

  2. The process may be crashed using kill -6 ${PID} or kill -11 ${PID} which will usually produe a core dump.

  3. On OutOfMemoryError using the J9 option:

    "-Xdump:tool:events=systhrow,filter=java/lang/OutOfMemoryError,range=1..1,request=exclusive+prepwalk,exec=gcore %p"

IBM proposed a kernel API to create a core dump but it was rejected for security reasons and it was proposed to do it in user space.

Core dumps from crashes

When a crash occurs, the kernel may create a core dump of the process. How much is written is controlled by coredump_filter:

Since kernel 2.6.23, the Linux-specific /proc/PID/coredump_filter file can be used to control which memory segments are written to the core dump file in the event that a core dump is performed for the process with the corresponding process ID. The value in the file is a bit mask of memory mapping types (see mmap(2)). (https://www.kernel.org/doc/man-pages/online/pages/man5/core.5.html)

When a process is dumped, all anonymous memory is written to a core file as long as the size of the core file isn't limited. But sometimes we don't want to dump some memory segments, for example, huge shared memory. Conversely, sometimes we want to save file-backed memory segments into a core file, not only the individual files. /proc/PID/coredump_filter allows you to customize which memory segments will be dumped when the PID process is dumped. coredump_filter is a bitmask of memory types. If a bit of the bitmask is set, memory segments of the corresponding memory type are dumped, otherwise they are not dumped. The following 7 memory types are supported:

- (bit 0) anonymous private memory
- (bit 1) anonymous shared memory
- (bit 2) file-backed private memory
- (bit 3) file-backed shared memory
- (bit 4) ELF header pages in file-backed private memory areas (it is effective only if the bit 2 is cleared)
- (bit 5) hugetlb private memory
- (bit 6) hugetlb shared memory

Note that MMIO pages such as frame buffer are never dumped and vDSO pages are always dumped regardless of the bitmask status. When a new process is created, the process inherits the bitmask status from its parent. It is useful to set up coredump_filter before the program runs.

For example:

$ echo 0x7 > /proc/self/coredump_filter
$ ./some_program

https://www.kernel.org/doc/Documentation/filesystems/proc.txt

systemd-coredump

This section has been moved to Java best practices for core piping.

Process Virtual Address Space

The total virtual and resident address space sizes of a process may be queried with ps:

$ ps -o pid,vsz,rss -p 14062
  PID    VSZ   RSS
14062  44648 42508

Details of the virtual address space of a process may be queried with (https://www.kernel.org/doc/Documentation/filesystems/proc.txt):

$ cat /proc/${PID}/maps

This will produce a line of output for each virtual memory area (VMA):

$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fd:02 22151273    /bin/cat...

The first column is the address range of the VMA. The second column is the set of permissions (read, write, execute, private). The third column is the offset if the VMA is a file, device, etc. The fourth column is the device (major:minor) if the VMA is a file, device, etc. The fifth column is the inode if the VMA is a file, device, etc. The final column is the pathname if the VMA is a file, etc.

The sum of these address ranges will equal the ps VSZ number.

In recent versions of Linux, smaps is a superset of maps and additionally includes details for each VMA:

$ cat /proc/self/smaps
00400000-0040b000 r-xp 00000000 fd:02 22151273    /bin/cat
Size:                 44 kB
Rss:                  20 kB
Pss:                  12 kB...

The Rss and Pss values are particularly interesting, showing how much of the VMA is resident in memory (some pages may be shared with other processes) and the proportional set size of a shared VMA where the size is divided by the number of processes sharing it, respectively.

smaps

The total virtual size of the process (VSZ):

$ grep ^Size smaps | awk '{print $2}' | paste -sd+ | bc | sed 's/$/*1024/' | bc
3597316096

The total resident size of the process (RSS):

$ grep Rss smaps | awk '{print $2}' | paste -sd+ | bc | sed 's/$/*1024/' | bc
897622016

The total proportional resident set size of the process (PSS):

$ grep Pss smaps | awk '{print $2}' | paste -sd+ | bc | sed 's/$/*1024/' | bc
891611136

In general, PSS is used for sizing physical memory.

Print sum VMA sizes greater than 60MB:

$ grep -v -E "^[a-zA-Z_]+:" smaps | awk '{print $1}' | sed 's/\-/,/g' | perl -F, -lane 'print hex($F[1])-hex($F[0]);' | sort -n | grep "^[6].......$" | paste -sd+ | bc
840073216

Sort VMAs by RSS:

$ cat smaps | while read line; do read line2; read line3; read line4; read line5; read lin6; read line7; read line8; read line9; read line10; read line11; read line12; read line13; read line14; echo $line, $line3; done | awk '{print $(NF-1)*1024, $1}' | sort -n

pmap

The pmap command prints the same information as smaps but in a column-based format: https://www.kernel.org/doc/man-pages/online/pages/man1/pmap.1.html

    $ pmap -XX $(pgrep -f defaultServer)  
    22:   java -javaagent:/opt/ibm/wlp/bin/tools/ws-javaagent.jar -Djava.awt.headless=true -Djdk.attach.allowAttachSelf=true -Xshareclasses:name=liberty,nonfatal,cacheDir=/output/.classCache/ -XX:+UseContainerSupport -jar /opt/ibm/wlp/bin/tools/ws-server.jar defaultServer  
             Address Perm   Offset Device   Inode    Size KernelPageSize MMUPageSize    Rss    Pss Shared_Clean Shared_Dirty Private_Clean Private\_Dirty Referenced Anonymous LazyFree AnonHugePages ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked                 VmFlags Mapping  
            00400000 r-xp 00000000  08:01 5384796       4              4           4      4      4            0            0             0             4          4         4        0             0              0              0               0    0       0      0       rd ex mr mw me dw java  
            00600000 r--p 00000000  08:01 5384796       4              4           4      4      4            0            0             0             4          4         4        0             0              0              0               0    0       0      0       rd mr mw me dw ac java  
            00601000 rw-p 00001000  08:01 5384796       4              4           4      4      4            0            0             0             4          0         4        0             0              0              0               0    0       0      0    rd wr mr mw me dw ac java  
            00d96000 rw-p 00000000  00:00       0     132              4           4    124    124            0            0             0           124        124       124        0             0              0              0               0    0       0      0       rd wr mr mw me ac \[heap\]  
            00db7000 rw-p 00000000  00:00       0   51200              4           4  39124  39124            0            0             0         39124      39124     39124        0         38912              0              0               0    0       0      0    rd wr mr mw me nr hg \[heap\]  
            03fb7000 ---p 00000000  00:00       0  153600              4           4      0      0            0            0             0             0          0         0        0             0              0              0               0    0       0      0          mr mw me nr hg  
            dfff1000 ---p 00000000  00:00       0      60              4           4      0      0            0            0             0             0          0         0        0             0              0              0               0    0       0      0          mr mw me nr hg  
            e0000000 rw-p 00000000  00:00       0   69184              4           4  47856  47856            0            0             0         47856      47856     47856        0         24576              0              0               0    0       0      0    rd wr mr mw me nr hg

gdb

Loading a core dump

A core dump is loaded by passing the paths to the executable and the core dump to gdb:

$ gdb ${PATH_TO_EXECUTABLE} ${PATH_TO_CORE}

To load matching symbols from particular paths (e.g. if the core is from another machine):

  1. Run gdb without any parameters
  2. set solib-absolute-prefix ${ABS_PATH_TO_SO_LIBS}
    • This is the simulated root; for example, if the .so was originally loaded from /lib64/libc.so.6, and you have it on your host in /tmp/dump/lib64/libc.so.6, then ${ABS_PATH_TO_SO_LIBS} would be /tmp/dump
  3. Optional if you have *.debug files: set debug-file-directory ${ABS_PATHS_TO_DIR_WITH_DEBUG_FILES}
  4. file ${PATH_TO_JAVA_EXECUTABLE_THAT_CREATED_THE_CORE}
  5. core-file ${PATH_TO_CORE}

Batch execute some gdb comments:

$ gdb --batch --quiet -ex "thread apply all bt" -ex "quit" $EXE $CORE

Common Commands

  • Disable pagination: set pagination off
    • Add to ~/.gdbinit to always run this on load
  • Print current thread stack: bt
  • Print thread stacks: thread apply all bt
  • Review what's loaded in memory:
    • info proc mappings
    • maintenance info sections
    • info files
  • List all threads: info threads
  • Switch to a different thread: thread N
  • Print loaded shared libraries: info sharedlibrary
  • Print register: p $rax
  • Print some number of words starting at a register; for example, 16 words starting from the stack pointer: x/16wx $esp
  • Print current instruction: x/i $pc
  • If available, print source of current function: list
  • Disassemble function at address: disas 0xff
  • Print structure definition: ptype struct malloc_state
  • Print output to a file: set logging on
  • Print data type of variable: ptype var
  • Print symbol information: info symbol 0xff
  • Add a source directory to the source path: directory $DIR

Virtual memory may be printed with the x command:

(gdb) x/32xc 0x00007f3498000000
0x7f3498000000:    32 ' '      0 '\000'    0 '\000'    28 '\034'    54 '6'    127 '\177'    0 '\000'    0 '\000'
0x7f3498000008:    0 '\000'    0 '\000'    0 '\000'   -92 '\244'    52 '4'    127 '\177'    0 '\000'    0 '\000'
0x7f3498000010:    0 '\000'    0 '\000'    0 '\000'    4 '\004'    0 '\000'    0 '\000'    0 '\000'    0 '\000'
0x7f3498000018:    0 '\000'    0 '\000'    0 '\000'    4 '\004'    0 '\000'    0 '\000'    0 '\000'    0 '\000'

Another option is to dump memory to a file and then spawn an xxd process from within gdb to dump that file which is easier to read:

(gdb) define xxd
Type commands for definition of "xxd".
End with a line saying just "end".
>dump binary memory dump.bin $arg0 $arg0+$arg1
>shell xxd dump.bin
>shell rm -f dump.bin
>end
(gdb) xxd 0x00007f3498000000 32
0000000: 2000 001c  367f 0000 0000 00a4 347f 0000   ...6.......4...
0000010: 0000 0004 0000 0000 0000 0004 0000 0000  ................

For large areas, these may be dumped to a file directly:

(gdb) dump binary memory dump.bin 0x00007f3498000000 0x00007f34a0000000

Large VMAs often have a lot of zero'd memory. A simple trick to filter those out is to remove all zero lines:

$ xxd dump.bin | grep -v "0000 0000 0000 0000 0000 0000 0000 0000" | less

Process Virtual Address Space

Gdb can query a core file and produce output about the virtual address space which is similar to /proc/${PID}/smaps, although it is normally a subset of all of the VMAs:

(gdb) info files
Local core dump file:
    `core.16721.dmp', file type elf64-x86-64.
    ...
    0x00007f3498000000 - 0x00007f34a0000000 is load51...

The "Local core dump file" stanza of gdb "info files" seems like the best place to look to approximate the virtual address size of the process at the time of the dump. This will not account for everything, especially if coredump_filter is the default value (and even if it has all flags set).

A GDB python script may be used to sum all of these address ranges: https://raw.githubusercontent.com/kgibm/problemdetermination/master/scripts/gdb/gdbinfofiles.py

Debug a Running Process

You may attach gdb to a running process:

$ gdb ${PATH_TO_EXECUTABLE} ${PID}

This may be useful to set breakpoints. For example, to break on a SIGABRT signal:

(gdb) handle all nostop noprint noignore
(gdb) handle SIGABRT stop print noignore
(gdb) continue

# ... Reproduce the problem ...

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7f232df12700 (LWP 23949)]
0x00000033a400d720 in sem_wait () from /lib64/libpthread.so.0
(gdb) ptype $_siginfo
type = struct {
    int si_signo;
    int si_errno;
    int si_code;
    union {
        int _pad[28];
        struct {...} _kill;...
    } _sifields;
}
(gdb) ptype $_siginfo._sifields._kill
type = struct {
    __pid_t si_pid;
    __uid_t si_uid;
}
(gdb) p $_siginfo._sifields._kill.si_pid
$1 = 22691

(gdb) continue

Next we can search for this PID 22691 and we'll find out who it is (in the following example, we see bash and the user name). If the PID is gone, then it is presumably some sort of script that already finished (you could create a background process that writes ps output to a file periodically to capture this):

$ ps -elf | grep 22691 | grep -v grep
0 S kevin    22691 20866  0  80   0 - 27657 wait   08:16 pts/2    00:00:00 bash

Strictly speaking, you must first consult the signal number to know which union member to print above in $_siginfo._sifields._kill: https://www.kernel.org/doc/man-pages/online/pages/man2/sigaction.2.html

Process glibc malloc free lists

Use the following Python script to sum the total size of free malloc chunks on glibc malloc free lists in a core dump:

# glibc_malloc_info.py is a gdb automation script to count the total size of free malloc chunks in all arenas.
# It requires gdb compiled with Python scripting support, a core dump, the process from which the core dump came,
# and glibc symbols (e.g. glibc-debuginfo) that match those used by that process.
# Background: https://sourceware.org/glibc/wiki/MallocInternals
#
# usage:
# Basic: CORE=path_to_core EXE=path_to_process gdb --batch --command glibc_malloc_info.py
# J9 jextract: JEXTRACTED=true gdb --batch --command glibc_malloc_info.py
# glibc symbols extracted into current directory: JEXTRACTED=true CWDSYMBOLS=true gdb --batch --command glibc_malloc_info.py
# 
# Example output:
#
# Total malloced: 25948160
# Total malloced not through mmap: 6696960
# Total malloced through mmap: 19251200
# mmap threshold: 2170880
# Number of arenas: 32
# Processing arena 1 @ 0x155190f4b9e0
# [0]: binaddr: 0x155190f4ba50
# [0]: chunkaddr: 0x155157cfe050
# [0,0]: size: 39952
# [0]: total free in bin: 160784, num: 10, max: 39952, avg: 16078
# [...]
# Total malloced: 25948160
# Total free: 12915264

import os
import sys

if os.environ.get("JEXTRACTED") == "true":
  gdb.execute("set solib-absolute-prefix " + os.getcwd() + "/")
  gdb.execute("set solib-search-path " + os.getcwd() + "/")

if os.environ.get("CWDSYMBOLS") == "true":
  print("Setting debug-file-directory to " + os.getcwd() + "/usr/lib/debug/")
  gdb.execute("set debug-file-directory " + os.getcwd() + "/usr/lib/debug/")

exe = os.environ.get("EXE")
core = os.environ.get("CORE")
for root, dirnames, filenames in os.walk('.'):
  for filename in filenames:
    fullpath = root + "/" + filename
    if core is None and filename.startswith("core") and not filename.endswith("zip") and not filename.endswith(".gz") and not filename.endswith(".tar"):
      print("Found core file: " + fullpath)
      core = fullpath
    elif exe is None and filename == "java":
      print("Found executable: " + fullpath)
      exe = fullpath

if exe is None:
  raise Exception("Could not find executable in current working directory");
if core is None:
  raise Exception("Could not find corefile in current working directory");

gdb.execute("file " + exe)
gdb.execute("core-file " + core)
gdb.execute("set pagination off")

OPTION_DEBUG = os.environ.get("DEBUG") == "true"
OPTION_BIN_ADDR_OFFSET = 16 # see offsetof in bin_at in malloc.c
OPTION_BIN_ADDR_OFFSET2 = 24
OPTION_BIN_COUNT = 253 # see struct malloc_state in malloc.c or `ptype struct malloc_state`-1
OPTION_FASTBIN_COUNT = 9 # see struct malloc_state in malloc.c or `ptype struct malloc_state`-1
OPTION_START = 0 # For debug

def value_to_addr(v):
  return clean_addr(v.address)

def clean_addr(addr):
  addr = str(addr)
  x = addr.find(" ")
  if x != -1:
    addr = addr[0:x]
  return addr

def process_bins(bins, count):
  total_free = 0
  for i in range(OPTION_START, count):
    iteration_free = 0
    binaddr = value_to_addr(bins[i])
    binaddrHexstring = hex(int(binaddr, 16))
    print("[" + str(i) + "]: binaddr: " + binaddr)
    chunkaddr = value_to_addr(bins[i].dereference())
    print("[" + str(i) + "]: chunkaddr: " + str(chunkaddr))

    if chunkaddr == "0x0":
      continue

    fwd = gdb.parse_and_eval("((struct malloc_chunk *)" + chunkaddr + ")->fd")
    if OPTION_DEBUG:
      print("[" + str(i) + "]: fwd: " + str(fwd))

    x = 0
    max = 0

    # If the address of the first chunk equals the fd pointer, then it's an unused bin. See malloc_init_state
    if binaddr != str(fwd):
      firstaddr = chunkaddr
      firstiteration = True
      while chunkaddr != "0x0" and (firstiteration or (str(chunkaddr) != str(firstaddr))):
        if OPTION_DEBUG:
          print("[" + str(i) + "," + str(x) + "]: chunkaddr: " + chunkaddr)

        checkchunk = hex(int(chunkaddr, 16) + OPTION_BIN_ADDR_OFFSET)
        checkchunk2 = hex(int(chunkaddr, 16) + OPTION_BIN_ADDR_OFFSET2)
        if OPTION_DEBUG:
          print("[" + str(i) + "," + str(x) + "]: checkchunk : " + checkchunk)
          print("[" + str(i) + "," + str(x) + "]: checkchunk2: " + checkchunk2)
        if checkchunk == binaddrHexstring or checkchunk2 == binaddrHexstring:
          break

        try:
          size = gdb.parse_and_eval("((struct malloc_chunk *)" + chunkaddr + ")->size & ~0x7")
        except gdb.error:
          size = gdb.parse_and_eval("((struct malloc_chunk *)" + chunkaddr + ")->mchunk_size & ~0x7")

        if size > max:
          max = size

        if OPTION_DEBUG or firstiteration:
          print("[" + str(i) + "," + str(x) + "]: size: " + str(size))
        total_free = total_free + size
        iteration_free = iteration_free + size

        chunkaddr = value_to_addr(gdb.parse_and_eval("((struct malloc_chunk *)" + chunkaddr + ")->fd").dereference())

        x = x + 1
        firstiteration = False

    if x > 0:
      print("[" + str(i) + "]: total free in bin: " + str(iteration_free) + ", num: " + str(x) + ", max: " + str(max) + ", avg: " + str(iteration_free / x))
    else:
      print("[" + str(i) + "]: total free in bin: " + str(iteration_free))

  return total_free

total_malloced = gdb.parse_and_eval("mp_.mmapped_mem") + gdb.parse_and_eval("main_arena.system_mem")

print("Total malloced: " + str(total_malloced))
print("Total malloced not through mmap: " + str(gdb.parse_and_eval("main_arena.system_mem")))
print("Total malloced through mmap: " + str(gdb.parse_and_eval("mp_.mmapped_mem")))
print("mmap threshold: " + str(gdb.parse_and_eval("mp_.mmap_threshold")))
print("Number of arenas: " + str(gdb.parse_and_eval("narenas")))

total_free = 0

arena = value_to_addr(gdb.parse_and_eval("main_arena"))
main_arena = arena
process_arena = True
arena_count = 1

while process_arena:
  print("Processing arena " + str(arena_count) + " @ " + str(arena))
  total_free = total_free + process_bins(gdb.parse_and_eval("((struct malloc_state *)" + str(arena) + ")->bins"), OPTION_BIN_COUNT)
  total_free = total_free + process_bins(gdb.parse_and_eval("((struct malloc_state *)" + str(arena) + ")->fastbinsY"), OPTION_FASTBIN_COUNT)
  arena = clean_addr(gdb.parse_and_eval("((struct malloc_state *)" + str(arena) + ")->next"))
  print("Next arena: " + str(arena))
  process_arena = str(arena) != str(main_arena)
  arena_count = arena_count + 1

print("")
print("Total malloced: " + str(total_malloced))
print("Total free: " + str(total_free))

# Example manual analysis:
#
# There was a question about how to determine glibc malloc free chunks:
# Install glibc-debuginfo and load the core dump in gdb. First, we see that the total outstanding malloc'ed at the time of the dump is about 4GB:
# (gdb) print main_arena.system_mem + mp_.mmapped_mem
# $1 = 4338761728
# Start with the starting chunk for a bin:
# (gdb) print main_arena.bins[250]
# $2 = (mchunkptr) 0x7f24c6232000
# Cast to a malloc_chunk, get the size (or mchunk_size, depending on version) and mask with 0x7 to see that it's 1MB (see https://sourceware.org/glibc/wiki/MallocInternals):
# (gdb) print ((struct malloc_chunk *)0x7f24c6232000)->size & ~0x7
# $3 = 1048545
# Then follow the linked list through the fd (forward) pointer:
# (gdb) print ((struct malloc_chunk *)0x7f24c6232000)->fd
# $4 = (struct malloc_chunk *) 0x7f24a7632000
# Continuing:
# (gdb) print ((struct malloc_chunk *)0x7f24a7632000)->fd
# $5 = (struct malloc_chunk *) 0x7f24a3cfb000
# You know you're at the end of the list when the forward pointer is the address of the first chunk plus either 16 or 24:
# (gdb) print &(main_arena.bins[250])
# $6 = (mchunkptr *) 0x7f2525c97f98 <main_arena+2104>
# After about 3,345 of these chunks, you'll see:
# (gdb) print ((struct malloc_chunk *)0x7f24bcb723d0)->fd
# $7 = (struct malloc_chunk *) 0x7f2525c97f88 <main_arena+2088>
# (gdb) print/x 0x7f2525c97f88 + 16
# $8 = 0x7f2525c97f98
# Therefore, 3,345 * ~1MB = ~3.3GB.
# In this case, the workaround is export MALLOC_MMAP_THRESHOLD_=1000000. This fixes the glibc malloc mmap threshold so
# that when someone calls malloc requesting more than that many bytes, malloc actually calls mmap to allocate and then
# when free is called, it calls munmmap so that the memory goes back to the OS instead of getting added to the malloc
# free lists.

gcore

Although it's preferable to use Java's built-in methods of requesting a core dump, gcore may be used which attaches gdb and dumps most of the memory segments into a core file and allows the process to continue:

usage: gcore [-o filename] pid

Here is a shells cript to help capture more details:

#!/bin/sh

# This script automates taking a core dump using gcore.
# It also updates coredump_filter to maximize core dump contents.
# Usage: ./ibmgcore.sh PID [SEQ] [PREFIX]
#        PID - The process ID. You must have permissions (owner or sudo/root).
#        SEQ - Optional sequence number. Defaults to 1.
#        PREFIX - Optional prefix (e.g. directory and file name). Defaults to ./

PID=$1
SEQ=$2
PREFIX=$3
if [ -z "$PREFIX" ]; then
  PREFIX="./"
fi
if [ -z "$SEQ" ]; then
  SEQ=1
fi
DT=`date +%Y%m%d.%H%M%S`
LOG="${PREFIX}core.${DT}.$PID.000$SEQ.dmp.log.txt"
COREFILE="${PREFIX}core.${DT}.$PID.000$SEQ.dmp"
echo 0x7f > /proc/$PID/coredump_filter
date > ${LOG}
echo $PID >> ${LOG} 2>&1
cat /proc/$PID/coredump_filter >> ${LOG} 2>&1
echo "maps" >> ${LOG} 2>&1
cat /proc/$PID/maps >> ${LOG} 2>&1
echo "smaps" >> ${LOG} 2>&1
cat /proc/$PID/smaps >> ${LOG} 2>&1
echo "limits" >> ${LOG} 2>&1
cat /proc/$PID/limits >> ${LOG} 2>&1
echo "gcore start" >> ${LOG} 2>&1
date >> ${LOG}
gcore -o $COREFILE $PID >> ${LOG} 2>&1
echo "gcore finish" >> ${LOG} 2>&1
date >> ${LOG}
echo "Gcore complete. Now renaming. This may take a few moments, but your process has now continued running."
# gcore adds the PID to the end of the file, so just remove that
mv $COREFILE.$PID $COREFILE
date >> ${LOG}
echo "Completely finished." >> ${LOG} 2>&1

Shared Libraries

Check if a shared library is stripped of symbols:

$ file $LIBRARY.so

Check the output for "stripped" or "non-stripped."

glibc

malloc

The default Linux native memory allocator on most distributions is Glibc malloc (which is based on ptmalloc and dlmalloc). Glibc malloc either allocates like a classic heap allocator (from sbrk or mmap'ed arenas) or directly using mmap, depending on a sliding threshold (M_MMAP_THRESHOLD). In the former case, the basic idea of a heap allocator is to request a large block of memory from the operating system and dole out chunks of it to the program. When the program frees these chunks, the memory is not returned to the operating system, but instead is saved for future allocations. This generally improves the performance by avoiding operating system overhead, including system call time. Techniques such as binning allows the allocator to quickly find a "right sized" chunk for a new memory request.

The major downside of all heap allocators is fragmentation (compaction is not possible because pointer addresses in the program could not be changed). While heap allocators can coallesce adjacent free chunks, program allocation patterns, malloc configuration, and malloc heap allocator design limitations mean that there are likely to be free chunks of memory that are unlikely to be used in the future. These free chunks are essentially "wasted" space, yet from the operating system point of view, they are still active virtual memory requests ("held" by glibc malloc instead of by the program directly). If no free chunk is available for a new allocation, then the heap must grow to satisfy it.

In the worst case, with certain allocation patterns and enough time, resident memory will grow unbounded. Unlike certain Java garbage collectors, glibc malloc does not have a feature of heap compaction. Glibc malloc does have a feature of trimming (M_TRIM_THRESHOLD); however, this only occurs with contiguous free space at the top of a heap, which is unlikely when a heap is fragmented.

Starting with glibc 2.10 (for example, RHEL 6), the default behavior was changed to be less memory efficient but more performant by creating per-thread arenas to reduce cross-thread malloc contention:

Red Hat Enterprise Linux 6 features version 2.11 of glibc, providing many features and enhancements, including... An enhanced dynamic memory allocation (malloc) behaviour enabling higher scalability across many sockets and cores. This is achieved by assigning threads their own memory pools and by avoiding locking in some situations. The amount of additional memory used for the memory pools (if any) can be controlled using the environment variables MALLOC_ARENA_TEST and MALLOC_ARENA_MAX. MALLOC_ARENA_TEST specifies that a test for the number of cores is performed once the number of memory pools reaches this value. MALLOC_ARENA_MAX sets the maximum number of memory pools used, regardless of the number of cores. (https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/6.0_Release_Notes/compiler.html)

After a certain number of arenas have already been created (2 on 32-bit and 8 on 64-bit, or the value explicitly set through the environment variable MALLOC_ARENA_TEST), the maximum number of arenas will be set to NUMBER_OF_CPU_CORES multiplied by 2 for 32-bit and 8 for 64-bit. A thread will be assigned to a particular arena. These arenas will also increase virtual memory usage compared to a single heap; however, virtual memory increase by itself is not an issue. There is some evidence that certain workloads combined with these per-thread arenas may cause additional fragmentation, which could be an issue. This behavior may be reverted with the environment variable MALLOC_ARENA_MAX=1.

Glibc malloc does not make it easy to tell if fragmentation is the cause of process size growth, versus program demands or a leak. The malloc_stats function can be called in the running process to print free statistics to stderr. It wouldn't be too hard to write a JVMTI shared library which called this function through a static method or MBean (and this could even be loaded dynamically through Java Surgery). More commonly, you'll have a core dump (whether manually taken or from a crash), and the malloc structures don't track total free space in each arena, so the only way would be to write a gdb python script that walks the arenas and memory chunks and calculates free space (in the same way as malloc_stats). Both of these techniques, while not terribly difficult, are not currently available. In general, native heap fragmentation in Java program is much less likely than native memory program demands or a leak, so I always investigate those first (using techniques described elsewhere).

If you have determined that native heap fragmentation is causing unbounded process size growth, then you have a few options. First, you can change the application by reducing its native memory demands. Second, you can tune glibc malloc to immediately free certain sized allocations back to the operating system. As discussed above, if the requested size of a malloc is greater than M_MMAP_THRESHOLD, then the allocation skips the heaps and is directly allocated from the operating system using mmap. When the program frees this allocation, the chunk is un-mmap'ed and thus given back to the operating system. Beyond the additional cost of system calls and the operating system needing to allocate and free these chunks, mmap has additional costs because it must be zero-filled by the operating system, and it must be sized to the boundary of the page size (e.g. 4KB). This can cause worse performance and more memory waste (ceteris paribus).

If you decide to change the mmap threshold, the first step is to determine the allocation pattern. This can be done through tools such as ltrace (on malloc) or SystemTap, or if you know what is causing most of the allocations (e.g. Java DirectByteBuffers), then you can trace just those allocations. Next, create a histogram of these sizes and choose a threshold just under the smallest yet most frequent allocation. For example, let's say you've found that most allocations are larger than 8KB. In this case, you can set the threshold to 8192:

MALLOC_MMAP_THRESHOLD_=8192

Additionally, glibc malloc has a limit on the number of direct mmaps that it will make, which is 65536 by default. With a smaller threshold and many allocations, this may need to be increased. You can set this to something like 5 million:

MALLOC_MMAP_MAX_=5000000

These are set as environment variables in each Java process. Note that there is a trailing underscore on these variable names.

You can verify these settings and the number and total size of mmaps using a core dump, gdb, and glibc symbols:

(gdb) p mp_
$1 = {trim_threshold = 131072, top_pad = 131072, mmap_threshold = 4096,
      arena_test = 0, arena_max = 1, n_mmaps = 1907812, n_mmaps_max = 5000000,
      max_n_mmaps = 2093622, no_dyn_threshold = 1, pagesize = 4096,
      mmapped_mem = 15744507904, max_mmapped_mem = 17279684608, max_total_mem = 0,
      sbrk_base = 0x1e1a000 ""}

In this example, the threshold was set to 4KB (mmap_threshold), there are about 1.9 million active mmaps (n_mmaps), the maximum number is 5 million (n_mmaps_max), and the total amount of memory currently mmap'ped is about 14GB (mmapped_mem).

There is also some evidence that the number of arenas can contribute to fragmentation.

Investigating core dumps

Notes:

  1. The kernel does not dump everything from the running process (even if 0x7f is set in coredump_filter).
  2. Testing has shown that gcore dumps less memory content than when the kernel dumps a core during crash processing.
Ideas for dealing with fragmentation:
  1. Reduce the number and/or frequency of direct or indirect mallocs.
  2. Create thread local caches of whatever is using the mallocs (e.g. DirectByteBuffers). Set the minimum size of the thread pool doing this equal to the maximum size to avoid thread local destruction.
  3. Experiment with a limited MALLOC_ARENA_MAX (try 1 to see if there's any effect at first).
  4. Experiment with MALLOC_MMAP_THRESHOLD and MALLOC_MMAP_MAX, carefully monitoring the performance difference.
  5. Batch the frees together (e.g. with a stop-the-world type of mechanism) to increase the probability of free chunks coalescing.
  6. Experiment with M_MXFAST
  7. malloc_trim is rarely useful outside of academic scenarios. It only trims from the top of the main arena. First, most fragmentation is within an arena not at the top, and second, most programs heavily (and even predominantly) use the non-main arenas.
How much is malloc'ed?

Add mp_.mmapped_mem plus system_mem for each arena starting at main_arena and following the next pointer until next==&main_arena

(gdb) p mp_.mmapped_mem
$1 = 0
(gdb) p &main_arena
$2 = (struct malloc_state *) 0x3c95b8ee80
(gdb) p main_arena.system_mem
$3 = 413696
(gdb) p main_arena.next
$4 = (struct malloc_state *) 0x3c95b8ee80
Exploring Arenas

glibc provides malloc statistics at runtime through a few methods: mallinfo, malloc_info, and malloc_stats. mallinfo is old and not designed for 64-bit and malloc_info is the new version which returns an XML blob of information. malloc_stats doesn't return anything, but instead prints out total statistics to stderr (http://udrepper.livejournal.com/20948.html).

malloc trace

Malloc supports rudimentary allocation tracing: http://www.gnu.org/software/libc/manual/html_node/Tracing-malloc.html

strace

On how to use strace and ltrace to investigate mmap and malloc calls with callstacks, see the main Linux chapter.

Native Memory Leaks

eBPF

On Linux kernel versions >= 4.1, eBPF is an in-kernel virtual machine that runs programs that access kernel information. eBPF is fully supported starting with, for example, RHEL 8.

Install
  • Modern Fedora/RHEL/CentOS/ubi/ubi-init:
    dnf install -y kernel-devel bcc-tools bpftool
    Then add to PATH:
    export PATH=/usr/share/bcc/tools/:${PATH}

Alternatively, manually install:

  1. Install dependencies: https://github.com/iovisor/bcc/blob/master/INSTALL.md#packages
  2. Clone bcc tools:
    git clone https://github.com/iovisor/bcc
bpftool
List running eBPF programs
bpftool prog list
Tracking native memory leaks

Periodically dump any stacks that do not have matching frees:

  1. Run the memleak script, specifying the process to watch, the interval in seconds, and, optionally, the number of iterations:
    memleak.py -p $PID 30 10 > memleak_$PID.txt
  2. Analyze the stack output. For example:
    [19:41:11] Top 10 stacks with outstanding allocations:
    225144 bytes in 159 allocations from stack
    func1+0x16 [process]
    main+0x81 [process]

LinuxNativeTracker

Recent versions of IBM Java include an optional feature to enable advanced native memory tracking: https://www.ibm.com/support/pages/ibm-java-linux-howto-tracking-native-memory-java-8-linux

HotSpot Java has -XX:NativeMemoryTracking

Debug Symbols

In general, it is recommended to compile all executables and libraries with debug symbols (-g):

GCC, the GNU C/C++ compiler, supports '-g' with or without '-O', making it possible to debug optimized code. We recommend that you always use '-g' whenever you compile a program.

Alternatively, symbols may be output into separate files and made available for download to support engineers: http://www.sourceware.org/gdb/current/onlinedocs/gdb/Separate-Debug-Files.html

See instructions for each distribution.

Frame pointer omission

Frame pointer omission (FPO) is a common compiler optimization that makes it more difficult for diagnostic tools to walk stack traces. When compiling with GCC, test the relative performance of -fno-omit-frame-pointer to ensure that frame pointers are not omitted so that backtraces are in tact.

To check if an executable uses FPO, dump its assembly and check if there are instructions to copy the address of the stack pointer into the frame pointer. If there are no such instructions, then FPO is active. For example, on x86 (with objdump using AT&T syntax by default), you might search as follows:

$ objdump -d libzip.so | grep -e "mov.*%esp,.*%ebp" -e "mov.*%rsp,.*%rbp"

Note that some executables have a mix of FPO and no-FPO so the presence alone may not be sufficient to check.

SystemTap (stap)

SystemTap is largerly superceded by eBPF on newer kernels. However, it does still work. Examples:

ltrace equivalent: https://sourceware.org/git/?p=systemtap.git;a=blob;f=testsuite/systemtap.examples/process/ltrace.stp;h=151cdb545432b9001bf2416f098b097418d2ccff;hb=refs/heads/master

Network

On Linux, once a socket is listening, there are two queues: a SYN queue and an accept queue (controlled by the backlog passed to listen). Once the handshake is complete, a connection is put on the accept queue, if the current number of connections on the accept queue is less than the backlog. The backlog does not affect the SYN queue because if a SYN gets to the server when the accept queue is full, it is still possible that by the time the full handshake completes, the accept queue will have space. If the handshake completes and the accept queue is full, then the server's socket information is dropped but nothing sent to the client; when the client tries to send data, the server would send a RST. If syn cookies are enabled and the SYN queue reaches a high watermark, after the SYN/ACK is sent, the SYN is removed from the queue. When the ACK comes back, the SYN is rebuilt from the information in the ACK and then the handshake is completed.

Process CPU Deep Dive

  1. Create linuxstat.sh:
    #!/bin/sh
    outputfile="linuxstat_$(date +"%Y%m%d_%H%M%S").log"
    echo "linuxstat: $(date +"%Y%m%d %H%M%S %N %Z") : PIDs: ${*}" | tee -a ${outputfile}
    while true; do
      cat /proc/stat &>> ${outputfile}
      for PID in ${*}; do
        echo "linuxstat: $(date +"%Y%m%d %H%M%S %N %Z") : iteration for PID: ${PID}" | tee -a ${outputfile}
        cat /proc/${PID}/stat &>> ${outputfile}
        (for i in /proc/${PID}/task/*; do echo -en "$i="; cat $i/stat; done) &>> ${outputfile}
      done
      sleep 15
    done
  2. chmod +x linuxstat.sh
  3. Start:
    nohup ./linuxstat.sh PID1 PID2...
  4. Reproduce the problem
  5. Ctrl^C to stpo linuxstat.sh and gather linuxstat*log

Hung Processes

Gather and review (particularly the output of each kernel stack in /stack):

PID=$1
outputfile="linuxhang_$(date +"%Y%m%d_%H%M%S").log"
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : status" | tee -a ${outputfile}
cat /proc/${PID}/status &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : sched" | tee -a ${outputfile}
cat /proc/${PID}/sched &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : schedstat" | tee -a ${outputfile}
cat /proc/${PID}/schedstat &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : syscall" | tee -a ${outputfile}
cat /proc/${PID}/syscall &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : wchan" | tee -a ${outputfile}
echo -en "/proc/${PID}/wchan=" &>> ${outputfile}
cat /proc/${PID}/wchan &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : task wchan" | tee -a ${outputfile}
(for i in /proc/${PID}/task/*; do echo -en "$i="; cat $i/wchan; echo ""; done) &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : stack" | tee -a ${outputfile}
echo -en "/proc/${PID}/stack=" &>> ${outputfile}
cat /proc/${PID}/stack &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : task stack" | tee -a ${outputfile}
(for i in /proc/${PID}/task/*; do echo -en "$i="; cat $i/stack; echo ""; done) &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : syscall" | tee -a ${outputfile}
echo -en "/proc/${PID}/syscall=" &>> ${outputfile}
cat /proc/${PID}/syscall &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : task syscall" | tee -a ${outputfile}
(for i in /proc/${PID}/task/*; do echo -en "$i="; cat $i/syscall; echo ""; done) &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : task sched" | tee -a ${outputfile}
(for i in /proc/${PID}/task/*; do echo -en "$i="; cat $i/sched; done) &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : task status" | tee -a ${outputfile}
(for i in /proc/${PID}/task/*; do echo -en "$i="; cat $i/status; done) &>> ${outputfile}
echo "Wrote to ${outputfile}"

Review if number of switches is increasing:

PID=8939; PROC=sched; for i in /proc/${PID} /proc/${PID}/task/*; do echo -en "$i/${PROC}="; echo ""; cat $i/${PROC}; echo ""; done | grep -e ${PROC}= -e nr_switches

A simpler script:

PID=...
date >> kernelstacks.txt
for i in /proc/${PID}/task/*; do
  echo -en "$i stack=" &>> kernelstacks.txt
  cat $i/stack &>> kernelstacks.txt
  echo "" &>> kernelstacks.txt
  echo -en "$i wchan=" &>> kernelstacks.txt
  cat $i/wchan &>> kernelstacks.txt
  echo "" &>> kernelstacks.txt
  echo -en "$i syscall=" &>> kernelstacks.txt
  cat $i/syscall &>> kernelstacks.txt
  echo "" &>> kernelstacks.txt
  echo -en "$i sched=" &>> kernelstacks.txt
  cat $i/sched &>> kernelstacks.txt
  echo "" &>> kernelstacks.txt
  echo -en "$i status=" &>> kernelstacks.txt
  cat $i/status &>> kernelstacks.txt
  echo "" &>> kernelstacks.txt
done

Kernel Dumps

crash /var/crash/<timestamp>/vmcore /usr/lib/debug /lib/modules/<kernel>/vmlinux

Note that the <kernel> version should be the same that was captured by kdump. To find out which kernel you are currently running, use the uname -r command.

To display the kernel message buffer, type the log command at the interactive prompt.

To display the kernel stack trace, type the bt command at the interactive prompt. You can use bt <pid> to display the backtrace of a single process.

To display status of processes in the system, type the ps command at the interactive prompt. You can use ps <pid> to display the status of a single process.

To display basic virtual memory information, type the vm command at the interactive prompt. You can use vm <pid> to display information on a single process.

To display information about open files, type the files command at the interactive prompt. You can use files <pid> to display files opened by only one selected process.

kernel object file: A vmlinux kernel object file, often referred to as the namelist in this document, which must have been built with the -g C flag so that it will contain the debug data required for symbolic debugging.

When using the fbt provider, it helps to run through the syscall once with all to see what the call stack is and then hone in.

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/pdf/Kernel_Crash_Dump_Guide/Red_Hat_Enterprise_Linux-7-Kernel_Crash_Dump_Guide-en-US.pdf

Change root password

  1. Type e on the boot menu.
  2. Add rd.break enforcing=0 to the line that starts with linux (another option is systemd.debug-shell and hit Ctrl+Alt+F9)
  3. Continue booting: Ctrl+X
  4. Make the filesystem writable: mount -o remount,rw /sysroot
  5. Enter root filesystem: chroot /sysroot
  6. Change root password: passwd
  7. Continue booting: Ctrl+D

Fix non-working disk

  1. Type e on the boot menu.
  2. Add systemd.unit=emergency.target to the line that starts with linux
  3. Make all filesystems writeable: mount -o remount,rw /
  4. Try re-mount: mount -a
  5. Fix any errors in /etc/fstab
  6. Run systemctl daemon-reload
  7. Try re-mount: mount -a
  8. Continue booting: Ctrl+D

journald

Persist all logs

  1. Set Storage=persistent in /etc/systemd/journald.conf
  2. Run systemctl reload systemd-journald
  3. Logs in /var/log/journal

Battery Status

Example:

$ sudo acpi -V | grep ^Battery
Battery 0: Unknown, 79%
Battery 0: design capacity 1886 mAh, last full capacity 1002 mAh = 53%
Battery 1: Charging, 46%, 01:01:20 until charged
Battery 1: design capacity 6166 mAh, last full capacity 5567 mAh = 90%

Administration

Create New Superuser

  1. Create a user with a home directory: adduser -m $user
  2. Set the password for the new user: passwd $user
  3. Add the user to the superuser wheel group: usermod -a -G wheel $user

Basic Diagnostics

outputfile="linuxdiag_$(date +"%Y%m%d_%H%M%S").log"
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : uptime" | tee -a ${outputfile}
uptime &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : hostname" | tee -a ${outputfile}
hostname &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : w" | tee -a ${outputfile}
w &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : lscpu" | tee -a ${outputfile}
lscpu &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : dmesg" | tee -a ${outputfile}
(dmesg | tail -50) &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : df" | tee -a ${outputfile}
df -h &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : free" | tee -a ${outputfile}
free -m &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : ps memory" | tee -a ${outputfile}
ps -o pid,vsz,rss,cmd &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : vmstat" | tee -a ${outputfile}
vmstat 1 5 &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : top all" | tee -a ${outputfile}
top -b -d 2 -n 2 &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : top threads" | tee -a ${outputfile}
top -b -H -d 2 -n 2 &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : pidstat" | tee -a ${outputfile}
pidstat -d -h --human -l -r -u -v -w 2 2 &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : iostat" | tee -a ${outputfile}
iostat -xm 1 5 &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : ss summary" | tee -a ${outputfile}
ss --summary &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : ss all" | tee -a ${outputfile}
ss -amponet &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : nstat" | tee -a ${outputfile}
nstat -asz &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : sar network" | tee -a ${outputfile}
sar -n DEV 1 5 &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : sar tcp" | tee -a ${outputfile}
sar -n TCP,ETCP 1 5 &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : lnstat" | tee -a ${outputfile}
lnstat -c 1 &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : sysctl" | tee -a ${outputfile}
sysctl -a &>> ${outputfile}
echo "Wrote to ${outputfile}"

Sending a kernel patch

  1. Review the documentation on submitting patches

  2. Find the repository of your target subsystem in the MAINTAINERS file. For example, for perf, the repository is:

    SCM: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core

  3. Clone this repository. For example, for perf, from above:

    git clone git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
  4. Checkout the appropriate branch from the SCM line in the MAINTAINERS file. For example, for perf, from above:

    git checkout perf/core
  5. Make your changes and commit them with a prefix of the subsystem. For example, for perf:

    git commit -sam "perf: DESCRIPTION OF CHANGES"
  6. Subscribe to the mailing list in the MAINTAINERS file. For example, for perf, it's linux-perf-users@vger.kernel.org and subscription instructions may be found at http://vger.kernel.org/vger-lists.html#linux-perf-users

  7. Send the patch to the mailing list. For example, for perf:

    git send-email --from "First Last <email@example.com>" --to linux-perf-users@vger.kernel.org --suppress-cc all -1
  8. After some time, validate the email was successfully sent to the mailing list by reviewing the archives. For example, for perf, see https://lore.kernel.org/linux-perf-users/

Error Codes (errno.h)

An errno code is used throughout Linux to detail errors when calling functions. Names and numeric values for a particular instance of Linux may be listed with errno -l from the moreutils package. Additional definitions are available in asm-generic/errno-base.h, asm-generic/errno.h, and linux/errno.h. Example list:

# errno -l | sort -n -k 2 | awk '{printf("%-16s %3s ", $1, $2); for (i=3;i<=NF;i++) printf(" %s", $i); printf("\n");}'
EPERM              1  Operation not permitted
ENOENT             2  No such file or directory
ESRCH              3  No such process
EINTR              4  Interrupted system call
EIO                5  Input/output error
ENXIO              6  No such device or address
E2BIG              7  Argument list too long
ENOEXEC            8  Exec format error
EBADF              9  Bad file descriptor
ECHILD            10  No child processes
EAGAIN            11  Resource temporarily unavailable
EWOULDBLOCK       11  Resource temporarily unavailable
ENOMEM            12  Cannot allocate memory
EACCES            13  Permission denied
EFAULT            14  Bad address
ENOTBLK           15  Block device required
EBUSY             16  Device or resource busy
EEXIST            17  File exists
EXDEV             18  Invalid cross-device link
ENODEV            19  No such device
ENOTDIR           20  Not a directory
EISDIR            21  Is a directory
EINVAL            22  Invalid argument
ENFILE            23  Too many open files in system
EMFILE            24  Too many open files
ENOTTY            25  Inappropriate ioctl for device
ETXTBSY           26  Text file busy
EFBIG             27  File too large
ENOSPC            28  No space left on device
ESPIPE            29  Illegal seek
EROFS             30  Read-only file system
EMLINK            31  Too many links
EPIPE             32  Broken pipe
EDOM              33  Numerical argument out of domain
ERANGE            34  Numerical result out of range
EDEADLK           35  Resource deadlock avoided
EDEADLOCK         35  Resource deadlock avoided
ENAMETOOLONG      36  File name too long
ENOLCK            37  No locks available
ENOSYS            38  Function not implemented
ENOTEMPTY         39  Directory not empty
ELOOP             40  Too many levels of symbolic links
ENOMSG            42  No message of desired type
EIDRM             43  Identifier removed
ECHRNG            44  Channel number out of range
EL2NSYNC          45  Level 2 not synchronized
EL3HLT            46  Level 3 halted
EL3RST            47  Level 3 reset
ELNRNG            48  Link number out of range
EUNATCH           49  Protocol driver not attached
ENOCSI            50  No CSI structure available
EL2HLT            51  Level 2 halted
EBADE             52  Invalid exchange
EBADR             53  Invalid request descriptor
EXFULL            54  Exchange full
ENOANO            55  No anode
EBADRQC           56  Invalid request code
EBADSLT           57  Invalid slot
EBFONT            59  Bad font file format
ENOSTR            60  Device not a stream
ENODATA           61  No data available
ETIME             62  Timer expired
ENOSR             63  Out of streams resources
ENONET            64  Machine is not on the network
ENOPKG            65  Package not installed
EREMOTE           66  Object is remote
ENOLINK           67  Link has been severed
EADV              68  Advertise error
ESRMNT            69  Srmount error
ECOMM             70  Communication error on send
EPROTO            71  Protocol error
EMULTIHOP         72  Multihop attempted
EDOTDOT           73  RFS specific error
EBADMSG           74  Bad message
EOVERFLOW         75  Value too large for defined data type
ENOTUNIQ          76  Name not unique on network
EBADFD            77  File descriptor in bad state
EREMCHG           78  Remote address changed
ELIBACC           79  Can not access a needed shared library
ELIBBAD           80  Accessing a corrupted shared library
ELIBSCN           81  .lib section in a.out corrupted
ELIBMAX           82  Attempting to link in too many shared libraries
ELIBEXEC          83  Cannot exec a shared library directly
EILSEQ            84  Invalid or incomplete multibyte or wide character
ERESTART          85  Interrupted system call should be restarted
ESTRPIPE          86  Streams pipe error
EUSERS            87  Too many users
ENOTSOCK          88  Socket operation on non-socket
EDESTADDRREQ      89  Destination address required
EMSGSIZE          90  Message too long
EPROTOTYPE        91  Protocol wrong type for socket
ENOPROTOOPT       92  Protocol not available
EPROTONOSUPPORT   93  Protocol not supported
ESOCKTNOSUPPORT   94  Socket type not supported
ENOTSUP           95  Operation not supported
EOPNOTSUPP        95  Operation not supported
EPFNOSUPPORT      96  Protocol family not supported
EAFNOSUPPORT      97  Address family not supported by protocol
EADDRINUSE        98  Address already in use
EADDRNOTAVAIL     99  Cannot assign requested address
ENETDOWN         100  Network is down
ENETUNREACH      101  Network is unreachable
ENETRESET        102  Network dropped connection on reset
ECONNABORTED     103  Software caused connection abort
ECONNRESET       104  Connection reset by peer
ENOBUFS          105  No buffer space available
EISCONN          106  Transport endpoint is already connected
ENOTCONN         107  Transport endpoint is not connected
ESHUTDOWN        108  Cannot send after transport endpoint shutdown
ETOOMANYREFS     109  Too many references: cannot splice
ETIMEDOUT        110  Connection timed out
ECONNREFUSED     111  Connection refused
EHOSTDOWN        112  Host is down
EHOSTUNREACH     113  No route to host
EALREADY         114  Operation already in progress
EINPROGRESS      115  Operation now in progress
ESTALE           116  Stale file handle
EUCLEAN          117  Structure needs cleaning
ENOTNAM          118  Not a XENIX named type file
ENAVAIL          119  No XENIX semaphores available
EISNAM           120  Is a named type file
EREMOTEIO        121  Remote I/O error
EDQUOT           122  Disk quota exceeded
ENOMEDIUM        123  No medium found
EMEDIUMTYPE      124  Wrong medium type
ECANCELED        125  Operation canceled
ENOKEY           126  Required key not available
EKEYEXPIRED      127  Key has expired
EKEYREVOKED      128  Key has been revoked
EKEYREJECTED     129  Key was rejected by service
EOWNERDEAD       130  Owner died
ENOTRECOVERABLE  131  State not recoverable
ERFKILL          132  Operation not possible due to RF-kill
EHWPOISON        133  Memory page has hardware error

Sysrq Keys

Check if sysrq enabled

Show if sysrq is enabled (1):

$ sysctl kernel.sysrq
kernel.sysrq = 1

Enable sysrq

Enable:

  1. Method 1 (temporary):
    sysctl -w kernel.sysrq=1
  2. Method 2 (permanent):
    1. Add kernel.sysrq=1 to /etc/sysctl.conf
    2. Apply with sysctl -p

sysrq characters

Commonly used characters:

  • f: Run the OOM Killer. This will kill the process using the most RAM (even if it's not using much).
  • r: Take control of keyboard from X.
  • e: Send SIGTERM to all processes. Wait for graceful termination.
  • i: Send SIGKILL to all processes for forceful termination.
  • s: Sync disks.
  • u: Remount all filesystems as read-only.
  • b: Reboot.
  • g: Switch to the kernel console. Otherwise, switch to a console with, e.g. Ctrl+Alt+F3
  • l: Show backtrace of all CPUs.
  • 0-9: Change the kernel log level.
  • d: Display kernel locks.
  • m: Show memory information.
  • t: Show a list of all processes.
  • w: Show a list of blocked processes.
  • c: Perform a kernel crash.

A "controlled" reboot is often done with reisub

Execute sysrq

Execute:

  1. Method 1 (using keyboard):
    • Ctrl + Alt + SysRq (usually PrintScreen) + $CHARACTER
    • All of these keys must be held down at the same time and then released
    • On some keyboards, this only works with the right-side Ctrl/Alt keys
    • On some keyboards, a function (Fn) key must be held for PrintScreen
  2. Method 2 (as root):
    echo $CHARACTER > /proc/sysrq-trigger
  3. Method 3 (with sudo):
    echo $CHARACTER | sudo tee /proc/sysrq-trigger