Troubleshooting Linux
General Troubleshooting Commands
Print system page size: getconf PAGESIZE
The
ausyscall
command converts a syscall number to the syscall name. Example:$ ausyscall 221 fadvise64
Kernel symbol table
Gather the kernel symbol table:
$ sudo su -
$ cat /proc/kallsyms &> kallsyms_$(hostname)_$(date +"%Y%m%d_%H%M%S").txt
$ cat /boot/System.map-$(uname -r) &> systemmap_$(hostname)_$(date +"%Y%m%d_%H%M%S").txt
Upload kallsyms_*txt and systemmap_*txt
pgrep/pkill
pgrep finds process IDs based on various search options. It is a more
formalized alternative to common commands like
ps -elf | grep something
: https://www.kernel.org/doc/man-pages/online/pages/man1/pgrep.1.html
Examples:
- Search by simple program name: pgrep java
- Search by something in the full program name or full command line: pgrep -f server1
pidof is a similar program to pgrep: https://www.kernel.org/doc/man-pages/online/pages/man1/pidof.1.html
pkill combines pgrep and kill into one command: https://www.kernel.org/doc/man-pages/online/pages/man1/pkill.1.html
Examples:
- Send SIGQUIT to all Java programs: pkill -3 java
- Send SIGQUIT to all Java programs with server1 in the command line: pkill -3 -f server1
kill
The kill command is used to send a signal to a processes or to terminate it:
kill $PID
Without arguments, the SIGTERM (15)
signal is sent which
is equivalent to kill -15 $PID
.
To specify a signal, use the number or name of the signal. For
example, to send the equivalent of Ctrl+C
to a process, use
either one of the following commands:
$ kill -2 $PID
$ kill -INT $PID
To list all available signals:
$ kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP
6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIGKILL 10) SIGUSR1
11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM
16) SIGSTKFLT 17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP
21) SIGTTIN 22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ
26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO 30) SIGPWR
31) SIGSYS 34) SIGRTMIN 35) SIGRTMIN+1 36) SIGRTMIN+2 37) SIGRTMIN+3
38) SIGRTMIN+4 39) SIGRTMIN+5 40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8
43) SIGRTMIN+9 44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9 56) SIGRTMAX-8 57) SIGRTMAX-7
58) SIGRTMAX-6 59) SIGRTMAX-5 60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2
63) SIGRTMAX-1 64) SIGRTMAX
SIGSTOP
may be used to completely pause a process so
that the operating system does not schedule it. SIGCONT
may
be used to continue a stopped process. This can be useful for things
such as simulating a hung database.
Find who killed a process
There are two main ways a process is killed:
- It kills itself using a call to
java/lang/System.exit
,java/lang/Runtime.halt
,exit
,raise
, etc. - It is killed by the kernel or another process using the kill system call, the kill command, etc.
These are diagnosed differently with their own section below.
Find why a process killed itself
- If using IBM Java or Semeru/OpenJ9, restart the process with the
following Java options and review the resulting javacore:
-Xdump:java:events=vmstop,request=exclusive+preempt
- If using HotSpot Java, some builds have User
Statically-Defined Tracing (USDT) probes which then SystemTap or
eBPF (on newer kernels) can use to trace calls to
System.exit
.
Find who killed another process
Check the native stdout and stderr logs of the process for any suspicious activity
Check the kernel log around the time of the
kill
for things like the OOM Killer and other potentially related messages (e.g. SSH login by some user)Consider using
bcc-tools
andkillsnoop.py
If an auditing or keylogging system is in place, review if anyone used the
kill
command.For systems that support it, use
auditd
with a rule to watch forkill
system calls, although test the performance overhead.For kernels that support SystemTap, combine scripts such as https://github.com/jav/systemtap/blob/master/testsuite/systemtap.examples/process/sigmon.stp and https://github.com/jav/systemtap/blob/master/testsuite/systemtap.examples/process/proc_snoop.stp to capture the signal and map to the source PID with details.
For some signals like
SIGTERM
(but notSIGKILL
), attachstrace
to the process and watch for signal notifications although the overhead may be massive even with the-e
filter:$ nohup strace -f -tt -e signal -o strace_trace.txt -p $PID &>> strace_stdouterr.txt & $ tail -f strace_trace.txt | grep " SIG" 2406 18:50:39.769367 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=678, si_uid=0} ---
The
si_pid
integer is the sending PID. A script in the background that periodically writesps
output may capture this process. Createpsbg.sh
:#!/bin/sh outputfile="diag_ps_$(hostname)_$(date +"%Y%m%d_%H%M%S").log" while true; do echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") iteration" &>> "${outputfile}" ps -elf 2>&1 &>> "${outputfile}" sleep 15 done
Then start before the
strace
:$ chmod a+x psbg.sh $ nohup ./psbg.sh &
For some signals, attach to the process using
gdb
and immediatelycontinue
. When the signal hits the process, gdb will break execution and leave you at a prompt. Then, handle the particular signal you want and print$_siginfo._sifields._kill.si_pid
and detach. Use the sameps
script as above to track potential source PIDs.$ java HelloWorld Hello World. Waiting indefinitely... $ ps -elf | grep HelloWorld | grep -v grep 0 S kevin 23947 ... $ gdb java 23947 ... (gdb) handle all nostop noprint noignore (gdb) handle SIGABRT stop print noignore (gdb) continue # ... Reproduce the problem ... Program received signal SIGABRT, Aborted. [Switching to Thread 0x7f232df12700 (LWP 23949)] 0x00000033a400d720 in sem_wait () from /lib64/libpthread.so.0 (gdb) ptype $_siginfo type = struct { int si_signo; int si_errno; int si_code; union { int _pad[28]; struct {...} _kill; struct {...} _timer; struct {...} _rt; struct {...} _sigchld; struct {...} _sigfault; struct {...} _sigpoll; } _sifields; } (gdb) ptype $_siginfo._sifields._kill type = struct { __pid_t si_pid; __uid_t si_uid; } (gdb) p $_siginfo._sifields._kill.si_pid $1 = 22691 (gdb) continue
In the above example, we print
_sifields._kill
because we know we sent a kill, but strictly speaking, that assumption cannot always be made._sifields
is a union, so only one of the fields of the union will have correct values. You must first consult the signal number to know which union member to print:The rest of the struct may be a union, so that one should read only the fields that are meaningful for the given signal
File I/O
fsync
fsync
is a system call used to attempt to flush pending I/O writes to disk;
however, there are various potential issues with fsync
(Rebello et al., 2021). One
way to reduce such risks is to use a copy-on-write file system such as
Btrfs instead of journaling file systems such as ext4 and XFS.
sosreport
sosreport is a utility to gather system-wide diagnostics on Fedora, RedHat, and CentOS distributions:
sudo dnf install -y sos
sudo sosreport --batch
- This will take a few minutes to run.
- A compressed file will be produced such as
/var/tmp/sosreport-7ce62b94e928-2020-09-01-itqojtr.tar.xz
. To use an alternative directory, specify--tmp-dir $dir
.
Analysis tips:
- Uncompress with
tar -xf sosreport*tar.xz
systemd
Killing nohup processes
When using systemd
, it's recommended to start
long-running processes in their own systemd
service or using systemd-run
rather than in the context of a user's session scope unit
(session-*.scope
) because such processes may be be
terminated when the user logs out even if they were backgrounded (e.g.
nohup ... &
) due to:
- If
KillUserProcesses=yes
(default changed to yes starting withsystemd
v230). - If
StopIdleSessionSec
is set to a non-infinity
value (the default isinfinity
). Available sincesystemd
v252.
Signal handlers
Show signal handlers registered for a process:
grep Sig /proc/$pid/status
Process core dumps
Core dumps are normally written in the ELF file format. Therefore, use the readelf program to find all of the LOAD sections to review the virtual memory regions that were dumped to the core:
$ readelf --program-headers core
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
NOTE 0x00000000000003f8 0x0000000000000000 0x0000000000000000
0x00000000000008ac 0x0000000000000000 R 1
LOAD 0x0000000000000ca4 0x0000000000400000 0x0000000000000000
0x0000000000001000 0x0000000000001000 R E 1
LOAD 0x0000000000001ca4 0x0000000000600000 0x0000000000000000
0x0000000000001000 0x0000000000001000 RW 1...
Request core dump (also known as a "system dump" for IBM Java)
Additional methods of requesting system dumps for IBM Java are documented in the Troubleshooting IBM Java and Troubleshooting WAS chapters.
The gcore command pauses the process while the core is generated and then the process should continue. Replace ${PID} in the following example with the process ID. You must have permissions to the process (i.e. either run as the owner of the process or as root). The size of the core file will be the size of the virtual size of the process (ps VSZ). If there is sufficient free space in physical RAM and the filecache, the core file will be written to RAM and then asynchronously written out to the filesystem which can dramatically improve the speed of generating a core and reduce the time the process is paused. In general, core dumps compress very well (often up to 75%) for transfer. Normally, the gcore command is provided as part of the gdb package. In fact, the gcore command is actually a shell script which attaches gdb to the process and runs the gdb gcore command and then detaches.
gcore ${PID} core.$(date +%Y%m%d.%H%M%S).dmp
There is some evidence that the gcore command in gdb writes less information than the kernel would write in the case of a crash (this probably has to do with the two implementations being different code bases).
The process may be crashed using
kill -6 ${PID}
orkill -11 ${PID}
which will usually produe a core dump.On OutOfMemoryError using the J9 option:
"-Xdump:tool:events=systhrow,filter=java/lang/OutOfMemoryError,range=1..1,request=exclusive+prepwalk,exec=gcore %p"
IBM proposed a kernel API to create a core dump but it was rejected for security reasons and it was proposed to do it in user space.
Core dumps from crashes
When a crash occurs, the kernel may create a core dump of the process. How much is written is controlled by coredump_filter:
Since kernel 2.6.23, the Linux-specific /proc/PID/coredump_filter file can be used to control which memory segments are written to the core dump file in the event that a core dump is performed for the process with the corresponding process ID. The value in the file is a bit mask of memory mapping types (see mmap(2)). (https://www.kernel.org/doc/man-pages/online/pages/man5/core.5.html)
When a process is dumped, all anonymous memory is written to a core file as long as the size of the core file isn't limited. But sometimes we don't want to dump some memory segments, for example, huge shared memory. Conversely, sometimes we want to save file-backed memory segments into a core file, not only the individual files. /proc/PID/coredump_filter allows you to customize which memory segments will be dumped when the PID process is dumped. coredump_filter is a bitmask of memory types. If a bit of the bitmask is set, memory segments of the corresponding memory type are dumped, otherwise they are not dumped. The following 7 memory types are supported:
- (bit 0) anonymous private memory
- (bit 1) anonymous shared memory
- (bit 2) file-backed private memory
- (bit 3) file-backed shared memory
- (bit 4) ELF header pages in file-backed private memory areas (it is effective only if the bit 2 is cleared)
- (bit 5) hugetlb private memory
- (bit 6) hugetlb shared memory
Note that MMIO pages such as frame buffer are never dumped and vDSO pages are always dumped regardless of the bitmask status. When a new process is created, the process inherits the bitmask status from its parent. It is useful to set up coredump_filter before the program runs.
For example:
$ echo 0x7 > /proc/self/coredump_filter
$ ./some_program
https://www.kernel.org/doc/Documentation/filesystems/proc.txt
systemd-coredump
This section has been moved to Java best practices for core piping.
Process Virtual Address Space
The total virtual and resident address space sizes of a process may be queried with ps:
$ ps -o pid,vsz,rss -p 14062
PID VSZ RSS
14062 44648 42508
Details of the virtual address space of a process may be queried with (https://www.kernel.org/doc/Documentation/filesystems/proc.txt):
$ cat /proc/${PID}/maps
This will produce a line of output for each virtual memory area (VMA):
$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fd:02 22151273 /bin/cat...
The first column is the address range of the VMA. The second column is the set of permissions (read, write, execute, private). The third column is the offset if the VMA is a file, device, etc. The fourth column is the device (major:minor) if the VMA is a file, device, etc. The fifth column is the inode if the VMA is a file, device, etc. The final column is the pathname if the VMA is a file, etc.
The sum of these address ranges will equal the ps VSZ
number.
In recent versions of Linux, smaps is a superset of maps and additionally includes details for each VMA:
$ cat /proc/self/smaps
00400000-0040b000 r-xp 00000000 fd:02 22151273 /bin/cat
Size: 44 kB
Rss: 20 kB
Pss: 12 kB...
The Rss and Pss values are particularly interesting, showing how much of the VMA is resident in memory (some pages may be shared with other processes) and the proportional set size of a shared VMA where the size is divided by the number of processes sharing it, respectively.
smaps
The total virtual size of the process (VSZ):
$ grep ^Size smaps | awk '{print $2}' | paste -sd+ | bc | sed 's/$/*1024/' | bc
3597316096
The total resident size of the process (RSS):
$ grep Rss smaps | awk '{print $2}' | paste -sd+ | bc | sed 's/$/*1024/' | bc
897622016
The total proportional resident set size of the process (PSS):
$ grep Pss smaps | awk '{print $2}' | paste -sd+ | bc | sed 's/$/*1024/' | bc
891611136
In general, PSS is used for sizing physical memory.
Print sum VMA sizes greater than 60MB:
$ grep -v -E "^[a-zA-Z_]+:" smaps | awk '{print $1}' | sed 's/\-/,/g' | perl -F, -lane 'print hex($F[1])-hex($F[0]);' | sort -n | grep "^[6].......$" | paste -sd+ | bc
840073216
Sort VMAs by RSS:
$ cat smaps | while read line; do read line2; read line3; read line4; read line5; read lin6; read line7; read line8; read line9; read line10; read line11; read line12; read line13; read line14; echo $line, $line3; done | awk '{print $(NF-1)*1024, $1}' | sort -n
pmap
The pmap command prints the same information as smaps but in a column-based format: https://www.kernel.org/doc/man-pages/online/pages/man1/pmap.1.html
$ pmap -XX $(pgrep -f defaultServer)
22: java -javaagent:/opt/ibm/wlp/bin/tools/ws-javaagent.jar -Djava.awt.headless=true -Djdk.attach.allowAttachSelf=true -Xshareclasses:name=liberty,nonfatal,cacheDir=/output/.classCache/ -XX:+UseContainerSupport -jar /opt/ibm/wlp/bin/tools/ws-server.jar defaultServer
Address Perm Offset Device Inode Size KernelPageSize MMUPageSize Rss Pss Shared_Clean Shared_Dirty Private_Clean Private\_Dirty Referenced Anonymous LazyFree AnonHugePages ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked VmFlags Mapping
00400000 r-xp 00000000 08:01 5384796 4 4 4 4 4 0 0 0 4 4 4 0 0 0 0 0 0 0 0 rd ex mr mw me dw java
00600000 r--p 00000000 08:01 5384796 4 4 4 4 4 0 0 0 4 4 4 0 0 0 0 0 0 0 0 rd mr mw me dw ac java
00601000 rw-p 00001000 08:01 5384796 4 4 4 4 4 0 0 0 4 0 4 0 0 0 0 0 0 0 0 rd wr mr mw me dw ac java
00d96000 rw-p 00000000 00:00 0 132 4 4 124 124 0 0 0 124 124 124 0 0 0 0 0 0 0 0 rd wr mr mw me ac \[heap\]
00db7000 rw-p 00000000 00:00 0 51200 4 4 39124 39124 0 0 0 39124 39124 39124 0 38912 0 0 0 0 0 0 rd wr mr mw me nr hg \[heap\]
03fb7000 ---p 00000000 00:00 0 153600 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 mr mw me nr hg
dfff1000 ---p 00000000 00:00 0 60 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 mr mw me nr hg
e0000000 rw-p 00000000 00:00 0 69184 4 4 47856 47856 0 0 0 47856 47856 47856 0 24576 0 0 0 0 0 0 rd wr mr mw me nr hg
gdb
Loading a core dump
A core dump is loaded by passing the paths to the executable and the core dump to gdb:
$ gdb ${PATH_TO_EXECUTABLE} ${PATH_TO_CORE}
To load matching symbols from particular paths (e.g. if the core is from another machine):
- Run gdb without any parameters
set solib-absolute-prefix ${ABS_PATH_TO_SO_LIBS}
- This is the simulated root; for example, if the
.so
was originally loaded from/lib64/libc.so.6
, and you have it on your host in/tmp/dump/lib64/libc.so.6
, then${ABS_PATH_TO_SO_LIBS}
would be/tmp/dump
- This is the simulated root; for example, if the
- Optional if you have
*.debug
files:set debug-file-directory ${ABS_PATHS_TO_DIR_WITH_DEBUG_FILES}
file ${PATH_TO_JAVA_EXECUTABLE_THAT_CREATED_THE_CORE}
core-file ${PATH_TO_CORE}
Batch execute some gdb comments:
$ gdb --batch --quiet -ex "thread apply all bt" -ex "quit" $EXE $CORE
Common Commands
- Disable pagination:
set pagination off
- Add to
~/.gdbinit
to always run this on load
- Add to
- Print current thread stack:
bt
- Print thread stacks:
thread apply all bt
- Review what's loaded in memory:
info proc mappings
maintenance info sections
info files
- List all threads:
info threads
- Switch to a different thread:
thread N
- Print loaded shared libraries:
info sharedlibrary
- Print register:
p $rax
- Print some number of words starting at a register; for example, 16
words starting from the stack pointer:
x/16wx $esp
- Print current instruction:
x/i $pc
- If available, print source of current function:
list
- Disassemble function at address:
disas 0xff
- Print structure definition:
ptype struct malloc_state
- Print output to a file:
set logging on
- Print data type of variable:
ptype var
- Print symbol information:
info symbol 0xff
- Add a source directory to the source
path:
directory $DIR
Print Virtual Memory
Virtual memory may be printed with the x
command:
(gdb) x/32xc 0x00007f3498000000
0x7f3498000000: 32 ' ' 0 '\000' 0 '\000' 28 '\034' 54 '6' 127 '\177' 0 '\000' 0 '\000'
0x7f3498000008: 0 '\000' 0 '\000' 0 '\000' -92 '\244' 52 '4' 127 '\177' 0 '\000' 0 '\000'
0x7f3498000010: 0 '\000' 0 '\000' 0 '\000' 4 '\004' 0 '\000' 0 '\000' 0 '\000' 0 '\000'
0x7f3498000018: 0 '\000' 0 '\000' 0 '\000' 4 '\004' 0 '\000' 0 '\000' 0 '\000' 0 '\000'
Another option is to dump memory to a file and then spawn an xxd process from within gdb to dump that file which is easier to read:
(gdb) define xxd
Type commands for definition of "xxd".
End with a line saying just "end".
>dump binary memory dump.bin $arg0 $arg0+$arg1
>shell xxd dump.bin
>shell rm -f dump.bin
>end
(gdb) xxd 0x00007f3498000000 32
0000000: 2000 001c 367f 0000 0000 00a4 347f 0000 ...6.......4...
0000010: 0000 0004 0000 0000 0000 0004 0000 0000 ................
For large areas, these may be dumped to a file directly:
(gdb) dump binary memory dump.bin 0x00007f3498000000 0x00007f34a0000000
Large VMAs often have a lot of zero'd memory. A simple trick to filter those out is to remove all zero lines:
$ xxd dump.bin | grep -v "0000 0000 0000 0000 0000 0000 0000 0000" | less
Process Virtual Address Space
Gdb can query a core file and produce output about the virtual address space which is similar to /proc/${PID}/smaps, although it is normally a subset of all of the VMAs:
(gdb) info files
Local core dump file:
`core.16721.dmp', file type elf64-x86-64.
...
0x00007f3498000000 - 0x00007f34a0000000 is load51...
The "Local core dump file" stanza of gdb "info files" seems like the best place to look to approximate the virtual address size of the process at the time of the dump. This will not account for everything, especially if coredump_filter is the default value (and even if it has all flags set).
A GDB python script may be used to sum all of these address ranges: https://raw.githubusercontent.com/kgibm/problemdetermination/master/scripts/gdb/gdbinfofiles.py
Debug a Running Process
You may attach gdb to a running process:
$ gdb ${PATH_TO_EXECUTABLE} ${PID}
This may be useful to set breakpoints. For example, to break on a SIGABRT signal:
(gdb) handle all nostop noprint noignore
(gdb) handle SIGABRT stop print noignore
(gdb) continue
# ... Reproduce the problem ...
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7f232df12700 (LWP 23949)]
0x00000033a400d720 in sem_wait () from /lib64/libpthread.so.0
(gdb) ptype $_siginfo
type = struct {
int si_signo;
int si_errno;
int si_code;
union {
int _pad[28];
struct {...} _kill;...
} _sifields;
}
(gdb) ptype $_siginfo._sifields._kill
type = struct {
__pid_t si_pid;
__uid_t si_uid;
}
(gdb) p $_siginfo._sifields._kill.si_pid
$1 = 22691
(gdb) continue
Next we can search for this PID 22691 and we'll find out who it is (in the following example, we see bash and the user name). If the PID is gone, then it is presumably some sort of script that already finished (you could create a background process that writes ps output to a file periodically to capture this):
$ ps -elf | grep 22691 | grep -v grep
0 S kevin 22691 20866 0 80 0 - 27657 wait 08:16 pts/2 00:00:00 bash
Strictly speaking, you must first consult the signal number to know which union member to print above in $_siginfo._sifields._kill: https://www.kernel.org/doc/man-pages/online/pages/man2/sigaction.2.html
Process glibc malloc free lists
Use the following Python script to sum the total size of free malloc chunks on glibc malloc free lists in a core dump:
# glibc_malloc_info.py is a gdb automation script to count the total size of free malloc chunks in all arenas.
# It requires gdb compiled with Python scripting support, a core dump, the process from which the core dump came,
# and glibc symbols (e.g. glibc-debuginfo) that match those used by that process.
# Background: https://sourceware.org/glibc/wiki/MallocInternals
#
# usage:
# Basic: CORE=path_to_core EXE=path_to_process gdb --batch --command glibc_malloc_info.py
# J9 jextract: JEXTRACTED=true gdb --batch --command glibc_malloc_info.py
# glibc symbols extracted into current directory: JEXTRACTED=true CWDSYMBOLS=true gdb --batch --command glibc_malloc_info.py
#
# Example output:
#
# Total malloced: 25948160
# Total malloced not through mmap: 6696960
# Total malloced through mmap: 19251200
# mmap threshold: 2170880
# Number of arenas: 32
# Processing arena 1 @ 0x155190f4b9e0
# [0]: binaddr: 0x155190f4ba50
# [0]: chunkaddr: 0x155157cfe050
# [0,0]: size: 39952
# [0]: total free in bin: 160784, num: 10, max: 39952, avg: 16078
# [...]
# Total malloced: 25948160
# Total free: 12915264
import os
import sys
if os.environ.get("JEXTRACTED") == "true":
gdb.execute("set solib-absolute-prefix " + os.getcwd() + "/")
gdb.execute("set solib-search-path " + os.getcwd() + "/")
if os.environ.get("CWDSYMBOLS") == "true":
print("Setting debug-file-directory to " + os.getcwd() + "/usr/lib/debug/")
gdb.execute("set debug-file-directory " + os.getcwd() + "/usr/lib/debug/")
exe = os.environ.get("EXE")
core = os.environ.get("CORE")
for root, dirnames, filenames in os.walk('.'):
for filename in filenames:
fullpath = root + "/" + filename
if core is None and filename.startswith("core") and not filename.endswith("zip") and not filename.endswith(".gz") and not filename.endswith(".tar"):
print("Found core file: " + fullpath)
core = fullpath
elif exe is None and filename == "java":
print("Found executable: " + fullpath)
exe = fullpath
if exe is None:
raise Exception("Could not find executable in current working directory");
if core is None:
raise Exception("Could not find corefile in current working directory");
gdb.execute("file " + exe)
gdb.execute("core-file " + core)
gdb.execute("set pagination off")
OPTION_DEBUG = os.environ.get("DEBUG") == "true"
OPTION_BIN_ADDR_OFFSET = 16 # see offsetof in bin_at in malloc.c
OPTION_BIN_ADDR_OFFSET2 = 24
OPTION_BIN_COUNT = 253 # see struct malloc_state in malloc.c or `ptype struct malloc_state`-1
OPTION_FASTBIN_COUNT = 9 # see struct malloc_state in malloc.c or `ptype struct malloc_state`-1
OPTION_START = 0 # For debug
def value_to_addr(v):
return clean_addr(v.address)
def clean_addr(addr):
addr = str(addr)
x = addr.find(" ")
if x != -1:
addr = addr[0:x]
return addr
def process_bins(bins, count):
total_free = 0
for i in range(OPTION_START, count):
iteration_free = 0
binaddr = value_to_addr(bins[i])
binaddrHexstring = hex(int(binaddr, 16))
print("[" + str(i) + "]: binaddr: " + binaddr)
chunkaddr = value_to_addr(bins[i].dereference())
print("[" + str(i) + "]: chunkaddr: " + str(chunkaddr))
if chunkaddr == "0x0":
continue
fwd = gdb.parse_and_eval("((struct malloc_chunk *)" + chunkaddr + ")->fd")
if OPTION_DEBUG:
print("[" + str(i) + "]: fwd: " + str(fwd))
x = 0
max = 0
# If the address of the first chunk equals the fd pointer, then it's an unused bin. See malloc_init_state
if binaddr != str(fwd):
firstaddr = chunkaddr
firstiteration = True
while chunkaddr != "0x0" and (firstiteration or (str(chunkaddr) != str(firstaddr))):
if OPTION_DEBUG:
print("[" + str(i) + "," + str(x) + "]: chunkaddr: " + chunkaddr)
checkchunk = hex(int(chunkaddr, 16) + OPTION_BIN_ADDR_OFFSET)
checkchunk2 = hex(int(chunkaddr, 16) + OPTION_BIN_ADDR_OFFSET2)
if OPTION_DEBUG:
print("[" + str(i) + "," + str(x) + "]: checkchunk : " + checkchunk)
print("[" + str(i) + "," + str(x) + "]: checkchunk2: " + checkchunk2)
if checkchunk == binaddrHexstring or checkchunk2 == binaddrHexstring:
break
try:
size = gdb.parse_and_eval("((struct malloc_chunk *)" + chunkaddr + ")->size & ~0x7")
except gdb.error:
size = gdb.parse_and_eval("((struct malloc_chunk *)" + chunkaddr + ")->mchunk_size & ~0x7")
if size > max:
max = size
if OPTION_DEBUG or firstiteration:
print("[" + str(i) + "," + str(x) + "]: size: " + str(size))
total_free = total_free + size
iteration_free = iteration_free + size
chunkaddr = value_to_addr(gdb.parse_and_eval("((struct malloc_chunk *)" + chunkaddr + ")->fd").dereference())
x = x + 1
firstiteration = False
if x > 0:
print("[" + str(i) + "]: total free in bin: " + str(iteration_free) + ", num: " + str(x) + ", max: " + str(max) + ", avg: " + str(iteration_free / x))
else:
print("[" + str(i) + "]: total free in bin: " + str(iteration_free))
return total_free
total_malloced = gdb.parse_and_eval("mp_.mmapped_mem") + gdb.parse_and_eval("main_arena.system_mem")
print("Total malloced: " + str(total_malloced))
print("Total malloced not through mmap: " + str(gdb.parse_and_eval("main_arena.system_mem")))
print("Total malloced through mmap: " + str(gdb.parse_and_eval("mp_.mmapped_mem")))
print("mmap threshold: " + str(gdb.parse_and_eval("mp_.mmap_threshold")))
print("Number of arenas: " + str(gdb.parse_and_eval("narenas")))
total_free = 0
arena = value_to_addr(gdb.parse_and_eval("main_arena"))
main_arena = arena
process_arena = True
arena_count = 1
while process_arena:
print("Processing arena " + str(arena_count) + " @ " + str(arena))
total_free = total_free + process_bins(gdb.parse_and_eval("((struct malloc_state *)" + str(arena) + ")->bins"), OPTION_BIN_COUNT)
total_free = total_free + process_bins(gdb.parse_and_eval("((struct malloc_state *)" + str(arena) + ")->fastbinsY"), OPTION_FASTBIN_COUNT)
arena = clean_addr(gdb.parse_and_eval("((struct malloc_state *)" + str(arena) + ")->next"))
print("Next arena: " + str(arena))
process_arena = str(arena) != str(main_arena)
arena_count = arena_count + 1
print("")
print("Total malloced: " + str(total_malloced))
print("Total free: " + str(total_free))
# Example manual analysis:
#
# There was a question about how to determine glibc malloc free chunks:
# Install glibc-debuginfo and load the core dump in gdb. First, we see that the total outstanding malloc'ed at the time of the dump is about 4GB:
# (gdb) print main_arena.system_mem + mp_.mmapped_mem
# $1 = 4338761728
# Start with the starting chunk for a bin:
# (gdb) print main_arena.bins[250]
# $2 = (mchunkptr) 0x7f24c6232000
# Cast to a malloc_chunk, get the size (or mchunk_size, depending on version) and mask with 0x7 to see that it's 1MB (see https://sourceware.org/glibc/wiki/MallocInternals):
# (gdb) print ((struct malloc_chunk *)0x7f24c6232000)->size & ~0x7
# $3 = 1048545
# Then follow the linked list through the fd (forward) pointer:
# (gdb) print ((struct malloc_chunk *)0x7f24c6232000)->fd
# $4 = (struct malloc_chunk *) 0x7f24a7632000
# Continuing:
# (gdb) print ((struct malloc_chunk *)0x7f24a7632000)->fd
# $5 = (struct malloc_chunk *) 0x7f24a3cfb000
# You know you're at the end of the list when the forward pointer is the address of the first chunk plus either 16 or 24:
# (gdb) print &(main_arena.bins[250])
# $6 = (mchunkptr *) 0x7f2525c97f98 <main_arena+2104>
# After about 3,345 of these chunks, you'll see:
# (gdb) print ((struct malloc_chunk *)0x7f24bcb723d0)->fd
# $7 = (struct malloc_chunk *) 0x7f2525c97f88 <main_arena+2088>
# (gdb) print/x 0x7f2525c97f88 + 16
# $8 = 0x7f2525c97f98
# Therefore, 3,345 * ~1MB = ~3.3GB.
# In this case, the workaround is export MALLOC_MMAP_THRESHOLD_=1000000. This fixes the glibc malloc mmap threshold so
# that when someone calls malloc requesting more than that many bytes, malloc actually calls mmap to allocate and then
# when free is called, it calls munmmap so that the memory goes back to the OS instead of getting added to the malloc
# free lists.
gcore
Although it's preferable to use Java's built-in methods of requesting a core dump, gcore may be used which attaches gdb and dumps most of the memory segments into a core file and allows the process to continue:
usage: gcore [-o filename] pid
Here is a shells cript to help capture more details:
#!/bin/sh
# This script automates taking a core dump using gcore.
# It also updates coredump_filter to maximize core dump contents.
# Usage: ./ibmgcore.sh PID [SEQ] [PREFIX]
# PID - The process ID. You must have permissions (owner or sudo/root).
# SEQ - Optional sequence number. Defaults to 1.
# PREFIX - Optional prefix (e.g. directory and file name). Defaults to ./
PID=$1
SEQ=$2
PREFIX=$3
if [ -z "$PREFIX" ]; then
PREFIX="./"
fi
if [ -z "$SEQ" ]; then
SEQ=1
fi
DT=`date +%Y%m%d.%H%M%S`
LOG="${PREFIX}core.${DT}.$PID.000$SEQ.dmp.log.txt"
COREFILE="${PREFIX}core.${DT}.$PID.000$SEQ.dmp"
echo 0x7f > /proc/$PID/coredump_filter
date > ${LOG}
echo $PID >> ${LOG} 2>&1
cat /proc/$PID/coredump_filter >> ${LOG} 2>&1
echo "maps" >> ${LOG} 2>&1
cat /proc/$PID/maps >> ${LOG} 2>&1
echo "smaps" >> ${LOG} 2>&1
cat /proc/$PID/smaps >> ${LOG} 2>&1
echo "limits" >> ${LOG} 2>&1
cat /proc/$PID/limits >> ${LOG} 2>&1
echo "gcore start" >> ${LOG} 2>&1
date >> ${LOG}
gcore -o $COREFILE $PID >> ${LOG} 2>&1
echo "gcore finish" >> ${LOG} 2>&1
date >> ${LOG}
echo "Gcore complete. Now renaming. This may take a few moments, but your process has now continued running."
# gcore adds the PID to the end of the file, so just remove that
mv $COREFILE.$PID $COREFILE
date >> ${LOG}
echo "Completely finished." >> ${LOG} 2>&1
Shared Libraries
Check if a shared library is stripped of symbols:
$ file $LIBRARY.so
Check the output for "stripped" or "non-stripped."
glibc
malloc
The default Linux native memory allocator on most distributions is Glibc malloc (which is based on ptmalloc and dlmalloc). Glibc malloc either allocates like a classic heap allocator (from sbrk or mmap'ed arenas) or directly using mmap, depending on a sliding threshold (M_MMAP_THRESHOLD). In the former case, the basic idea of a heap allocator is to request a large block of memory from the operating system and dole out chunks of it to the program. When the program frees these chunks, the memory is not returned to the operating system, but instead is saved for future allocations. This generally improves the performance by avoiding operating system overhead, including system call time. Techniques such as binning allows the allocator to quickly find a "right sized" chunk for a new memory request.
The major downside of all heap allocators is fragmentation (compaction is not possible because pointer addresses in the program could not be changed). While heap allocators can coallesce adjacent free chunks, program allocation patterns, malloc configuration, and malloc heap allocator design limitations mean that there are likely to be free chunks of memory that are unlikely to be used in the future. These free chunks are essentially "wasted" space, yet from the operating system point of view, they are still active virtual memory requests ("held" by glibc malloc instead of by the program directly). If no free chunk is available for a new allocation, then the heap must grow to satisfy it.
In the worst case, with certain allocation patterns and enough time, resident memory will grow unbounded. Unlike certain Java garbage collectors, glibc malloc does not have a feature of heap compaction. Glibc malloc does have a feature of trimming (M_TRIM_THRESHOLD); however, this only occurs with contiguous free space at the top of a heap, which is unlikely when a heap is fragmented.
Starting with glibc 2.10 (for example, RHEL 6), the default behavior was changed to be less memory efficient but more performant by creating per-thread arenas to reduce cross-thread malloc contention:
Red Hat Enterprise Linux 6 features version 2.11 of glibc, providing many features and enhancements, including... An enhanced dynamic memory allocation (malloc) behaviour enabling higher scalability across many sockets and cores. This is achieved by assigning threads their own memory pools and by avoiding locking in some situations. The amount of additional memory used for the memory pools (if any) can be controlled using the environment variables MALLOC_ARENA_TEST and MALLOC_ARENA_MAX. MALLOC_ARENA_TEST specifies that a test for the number of cores is performed once the number of memory pools reaches this value. MALLOC_ARENA_MAX sets the maximum number of memory pools used, regardless of the number of cores. (https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/6.0_Release_Notes/compiler.html)
After a certain number of arenas have already been created (2 on 32-bit and 8 on 64-bit, or the value explicitly set through the environment variable MALLOC_ARENA_TEST), the maximum number of arenas will be set to NUMBER_OF_CPU_CORES multiplied by 2 for 32-bit and 8 for 64-bit. A thread will be assigned to a particular arena. These arenas will also increase virtual memory usage compared to a single heap; however, virtual memory increase by itself is not an issue. There is some evidence that certain workloads combined with these per-thread arenas may cause additional fragmentation, which could be an issue. This behavior may be reverted with the environment variable MALLOC_ARENA_MAX=1.
Glibc malloc does not make it easy to tell if fragmentation is the cause of process size growth, versus program demands or a leak. The malloc_stats function can be called in the running process to print free statistics to stderr. It wouldn't be too hard to write a JVMTI shared library which called this function through a static method or MBean (and this could even be loaded dynamically through Java Surgery). More commonly, you'll have a core dump (whether manually taken or from a crash), and the malloc structures don't track total free space in each arena, so the only way would be to write a gdb python script that walks the arenas and memory chunks and calculates free space (in the same way as malloc_stats). Both of these techniques, while not terribly difficult, are not currently available. In general, native heap fragmentation in Java program is much less likely than native memory program demands or a leak, so I always investigate those first (using techniques described elsewhere).
If you have determined that native heap fragmentation is causing unbounded process size growth, then you have a few options. First, you can change the application by reducing its native memory demands. Second, you can tune glibc malloc to immediately free certain sized allocations back to the operating system. As discussed above, if the requested size of a malloc is greater than M_MMAP_THRESHOLD, then the allocation skips the heaps and is directly allocated from the operating system using mmap. When the program frees this allocation, the chunk is un-mmap'ed and thus given back to the operating system. Beyond the additional cost of system calls and the operating system needing to allocate and free these chunks, mmap has additional costs because it must be zero-filled by the operating system, and it must be sized to the boundary of the page size (e.g. 4KB). This can cause worse performance and more memory waste (ceteris paribus).
If you decide to change the mmap threshold, the first step is to determine the allocation pattern. This can be done through tools such as ltrace (on malloc) or SystemTap, or if you know what is causing most of the allocations (e.g. Java DirectByteBuffers), then you can trace just those allocations. Next, create a histogram of these sizes and choose a threshold just under the smallest yet most frequent allocation. For example, let's say you've found that most allocations are larger than 8KB. In this case, you can set the threshold to 8192:
MALLOC_MMAP_THRESHOLD_=8192
Additionally, glibc malloc has a limit on the number of direct mmaps that it will make, which is 65536 by default. With a smaller threshold and many allocations, this may need to be increased. You can set this to something like 5 million:
MALLOC_MMAP_MAX_=5000000
These are set as environment variables in each Java process. Note that there is a trailing underscore on these variable names.
You can verify these settings and the number and total size of mmaps using a core dump, gdb, and glibc symbols:
(gdb) p mp_
$1 = {trim_threshold = 131072, top_pad = 131072, mmap_threshold = 4096,
arena_test = 0, arena_max = 1, n_mmaps = 1907812, n_mmaps_max = 5000000,
max_n_mmaps = 2093622, no_dyn_threshold = 1, pagesize = 4096,
mmapped_mem = 15744507904, max_mmapped_mem = 17279684608, max_total_mem = 0,
sbrk_base = 0x1e1a000 ""}
In this example, the threshold was set to 4KB (mmap_threshold), there are about 1.9 million active mmaps (n_mmaps), the maximum number is 5 million (n_mmaps_max), and the total amount of memory currently mmap'ped is about 14GB (mmapped_mem).
There is also some evidence that the number of arenas can contribute to fragmentation.
Investigating core dumps
Notes:
- The kernel does not dump everything from the running process (even if 0x7f is set in coredump_filter).
- Testing has shown that gcore dumps less memory content than when the kernel dumps a core during crash processing.
Ideas for dealing with fragmentation:
- Reduce the number and/or frequency of direct or indirect mallocs.
- Create thread local caches of whatever is using the mallocs (e.g. DirectByteBuffers). Set the minimum size of the thread pool doing this equal to the maximum size to avoid thread local destruction.
- Experiment with a limited MALLOC_ARENA_MAX (try 1 to see if there's any effect at first).
- Experiment with MALLOC_MMAP_THRESHOLD and MALLOC_MMAP_MAX, carefully monitoring the performance difference.
- Batch the frees together (e.g. with a stop-the-world type of mechanism) to increase the probability of free chunks coalescing.
- Experiment with M_MXFAST
- malloc_trim is rarely useful outside of academic scenarios. It only trims from the top of the main arena. First, most fragmentation is within an arena not at the top, and second, most programs heavily (and even predominantly) use the non-main arenas.
How much is malloc'ed?
Add mp_.mmapped_mem plus system_mem for each arena starting at main_arena and following the next pointer until next==&main_arena
(gdb) p mp_.mmapped_mem
$1 = 0
(gdb) p &main_arena
$2 = (struct malloc_state *) 0x3c95b8ee80
(gdb) p main_arena.system_mem
$3 = 413696
(gdb) p main_arena.next
$4 = (struct malloc_state *) 0x3c95b8ee80
Exploring Arenas
glibc provides malloc statistics at runtime through a few methods: mallinfo, malloc_info, and malloc_stats. mallinfo is old and not designed for 64-bit and malloc_info is the new version which returns an XML blob of information. malloc_stats doesn't return anything, but instead prints out total statistics to stderr (http://udrepper.livejournal.com/20948.html).
- malloc_info: https://www.kernel.org/doc/man-pages/online/pages/man3/malloc_info.3.html
- malloc_stats: https://www.kernel.org/doc/man-pages/online/pages/man3/malloc_stats.3.html
- mallinfo: https://www.kernel.org/doc/man-pages/online/pages/man3/mallinfo.3.html
malloc trace
Malloc supports rudimentary allocation tracing: http://www.gnu.org/software/libc/manual/html_node/Tracing-malloc.html
strace
On how to use strace and ltrace to investigate mmap and malloc calls with callstacks, see the main Linux chapter.
Native Memory Leaks
eBPF
On Linux kernel versions >= 4.1, eBPF is an in-kernel virtual machine that runs programs that access kernel information. eBPF is fully supported starting with, for example, RHEL 8.
Install
- Modern Fedora/RHEL/CentOS/ubi/ubi-init:
Then add todnf install -y kernel-devel bcc-tools bpftool
PATH
:export PATH=/usr/share/bcc/tools/:${PATH}
Alternatively, manually install:
- Install dependencies: https://github.com/iovisor/bcc/blob/master/INSTALL.md#packages
- Clone bcc tools:
git clone https://github.com/iovisor/bcc
bpftool
List running eBPF programs
bpftool prog list
Tracking native memory leaks
Periodically dump any stacks that do not have matching frees:
- Run the memleak script, specifying the process to watch, the
interval in seconds, and, optionally, the number of iterations:
memleak.py -p $PID 30 10 > memleak_$PID.txt
- Analyze the stack output. For example:
[19:41:11] Top 10 stacks with outstanding allocations: 225144 bytes in 159 allocations from stack func1+0x16 [process] main+0x81 [process]
LinuxNativeTracker
Recent versions of IBM Java include an optional feature to enable advanced native memory tracking: https://www.ibm.com/support/pages/ibm-java-linux-howto-tracking-native-memory-java-8-linux
HotSpot Java has -XX:NativeMemoryTracking
Debug Symbols
In general, it is recommended
to compile all executables and libraries with debug symbols
(-g
):
GCC, the GNU C/C++ compiler, supports '-g' with or without '-O', making it possible to debug optimized code. We recommend that you always use '-g' whenever you compile a program.
Alternatively, symbols may be output into separate files and made available for download to support engineers: http://www.sourceware.org/gdb/current/onlinedocs/gdb/Separate-Debug-Files.html
See instructions for each distribution.
Frame pointer omission
Frame pointer
omission (FPO) is a common compiler optimization that makes it more
difficult for diagnostic tools to walk stack traces. When compiling with
GCC, test the relative performance of
-fno-omit-frame-pointer
to ensure that frame pointers are
not omitted so that backtraces are in tact.
To check if an executable uses FPO, dump its assembly and check if
there are instructions to copy the address of the stack pointer into the
frame pointer. If there are no such instructions, then FPO is active.
For example, on x86 (with objdump
using AT&T syntax by
default), you might search as follows:
$ objdump -d libzip.so | grep -e "mov.*%esp,.*%ebp" -e "mov.*%rsp,.*%rbp"
Note that some executables have a mix of FPO and no-FPO so the presence alone may not be sufficient to check.
SystemTap (stap)
SystemTap is largerly superceded by eBPF on newer kernels. However, it does still work. Examples:
ltrace equivalent: https://sourceware.org/git/?p=systemtap.git;a=blob;f=testsuite/systemtap.examples/process/ltrace.stp;h=151cdb545432b9001bf2416f098b097418d2ccff;hb=refs/heads/master
Network
On Linux, once a socket is listening, there are two queues: a SYN queue and an accept queue (controlled by the backlog passed to listen). Once the handshake is complete, a connection is put on the accept queue, if the current number of connections on the accept queue is less than the backlog. The backlog does not affect the SYN queue because if a SYN gets to the server when the accept queue is full, it is still possible that by the time the full handshake completes, the accept queue will have space. If the handshake completes and the accept queue is full, then the server's socket information is dropped but nothing sent to the client; when the client tries to send data, the server would send a RST. If syn cookies are enabled and the SYN queue reaches a high watermark, after the SYN/ACK is sent, the SYN is removed from the queue. When the ACK comes back, the SYN is rebuilt from the information in the ACK and then the handshake is completed.
Process CPU Deep Dive
- Create linuxstat.sh:
#!/bin/sh outputfile="linuxstat_$(date +"%Y%m%d_%H%M%S").log" echo "linuxstat: $(date +"%Y%m%d %H%M%S %N %Z") : PIDs: ${*}" | tee -a ${outputfile} while true; do cat /proc/stat &>> ${outputfile} for PID in ${*}; do echo "linuxstat: $(date +"%Y%m%d %H%M%S %N %Z") : iteration for PID: ${PID}" | tee -a ${outputfile} cat /proc/${PID}/stat &>> ${outputfile} (for i in /proc/${PID}/task/*; do echo -en "$i="; cat $i/stat; done) &>> ${outputfile} done sleep 15 done
chmod +x linuxstat.sh
- Start:
nohup ./linuxstat.sh PID1 PID2...
- Reproduce the problem
Ctrl^C
to stpo linuxstat.sh and gatherlinuxstat*log
Hung Processes
Gather and review (particularly the output of each kernel stack in /stack):
PID=$1
outputfile="linuxhang_$(date +"%Y%m%d_%H%M%S").log"
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : status" | tee -a ${outputfile}
cat /proc/${PID}/status &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : sched" | tee -a ${outputfile}
cat /proc/${PID}/sched &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : schedstat" | tee -a ${outputfile}
cat /proc/${PID}/schedstat &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : syscall" | tee -a ${outputfile}
cat /proc/${PID}/syscall &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : wchan" | tee -a ${outputfile}
echo -en "/proc/${PID}/wchan=" &>> ${outputfile}
cat /proc/${PID}/wchan &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : task wchan" | tee -a ${outputfile}
(for i in /proc/${PID}/task/*; do echo -en "$i="; cat $i/wchan; echo ""; done) &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : stack" | tee -a ${outputfile}
echo -en "/proc/${PID}/stack=" &>> ${outputfile}
cat /proc/${PID}/stack &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : task stack" | tee -a ${outputfile}
(for i in /proc/${PID}/task/*; do echo -en "$i="; cat $i/stack; echo ""; done) &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : syscall" | tee -a ${outputfile}
echo -en "/proc/${PID}/syscall=" &>> ${outputfile}
cat /proc/${PID}/syscall &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : task syscall" | tee -a ${outputfile}
(for i in /proc/${PID}/task/*; do echo -en "$i="; cat $i/syscall; echo ""; done) &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : task sched" | tee -a ${outputfile}
(for i in /proc/${PID}/task/*; do echo -en "$i="; cat $i/sched; done) &>> ${outputfile}
echo "linuxhang: $(date +"%Y%m%d %H%M%S %N %Z") : task status" | tee -a ${outputfile}
(for i in /proc/${PID}/task/*; do echo -en "$i="; cat $i/status; done) &>> ${outputfile}
echo "Wrote to ${outputfile}"
Review if number of switches is increasing:
PID=8939; PROC=sched; for i in /proc/${PID} /proc/${PID}/task/*; do echo -en "$i/${PROC}="; echo ""; cat $i/${PROC}; echo ""; done | grep -e ${PROC}= -e nr_switches
A simpler script:
PID=...
date >> kernelstacks.txt
for i in /proc/${PID}/task/*; do
echo -en "$i stack=" &>> kernelstacks.txt
cat $i/stack &>> kernelstacks.txt
echo "" &>> kernelstacks.txt
echo -en "$i wchan=" &>> kernelstacks.txt
cat $i/wchan &>> kernelstacks.txt
echo "" &>> kernelstacks.txt
echo -en "$i syscall=" &>> kernelstacks.txt
cat $i/syscall &>> kernelstacks.txt
echo "" &>> kernelstacks.txt
echo -en "$i sched=" &>> kernelstacks.txt
cat $i/sched &>> kernelstacks.txt
echo "" &>> kernelstacks.txt
echo -en "$i status=" &>> kernelstacks.txt
cat $i/status &>> kernelstacks.txt
echo "" &>> kernelstacks.txt
done
Kernel Dumps
crash /var/crash/<timestamp>/vmcore /usr/lib/debug /lib/modules/<kernel>/vmlinux
Note that the <kernel> version should be the same that was captured by kdump. To find out which kernel you are currently running, use the uname -r command.
To display the kernel message buffer, type the log command at the interactive prompt.
To display the kernel stack trace, type the bt command at the interactive prompt. You can use bt <pid> to display the backtrace of a single process.
To display status of processes in the system, type the ps command at the interactive prompt. You can use ps <pid> to display the status of a single process.
To display basic virtual memory information, type the vm command at the interactive prompt. You can use vm <pid> to display information on a single process.
To display information about open files, type the files command at the interactive prompt. You can use files <pid> to display files opened by only one selected process.
kernel object file: A vmlinux kernel object file, often referred to as the namelist in this document, which must have been built with the -g C flag so that it will contain the debug data required for symbolic debugging.
When using the fbt provider, it helps to run through the syscall once with all to see what the call stack is and then hone in.
Change root password
- Type
e
on the boot menu. - Add
rd.break enforcing=0
to the line that starts withlinux
(another option issystemd.debug-shell
and hitCtrl+Alt+F9
) - Continue booting:
Ctrl+X
- Make the filesystem writable:
mount -o remount,rw /sysroot
- Enter root filesystem:
chroot /sysroot
- Change root password:
passwd
- Continue booting:
Ctrl+D
Fix non-working disk
- Type
e
on the boot menu. - Add
systemd.unit=emergency.target
to the line that starts withlinux
- Make all filesystems writeable:
mount -o remount,rw /
- Try re-mount:
mount -a
- Fix any errors in
/etc/fstab
- Run
systemctl daemon-reload
- Try re-mount:
mount -a
- Continue booting:
Ctrl+D
journald
Persist all logs
- Set
Storage=persistent
in/etc/systemd/journald.conf
- Run
systemctl reload systemd-journald
- Logs in
/var/log/journal
Battery Status
Example:
$ sudo acpi -V | grep ^Battery
Battery 0: Unknown, 79%
Battery 0: design capacity 1886 mAh, last full capacity 1002 mAh = 53%
Battery 1: Charging, 46%, 01:01:20 until charged
Battery 1: design capacity 6166 mAh, last full capacity 5567 mAh = 90%
Administration
Create New Superuser
- Create a user with a home directory:
adduser -m $user
- Set the password for the new user:
passwd $user
- Add the user to the superuser wheel group:
usermod -a -G wheel $user
Basic Diagnostics
outputfile="linuxdiag_$(date +"%Y%m%d_%H%M%S").log"
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : uptime" | tee -a ${outputfile}
uptime &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : hostname" | tee -a ${outputfile}
hostname &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : w" | tee -a ${outputfile}
w &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : lscpu" | tee -a ${outputfile}
lscpu &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : dmesg" | tee -a ${outputfile}
(dmesg | tail -50) &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : df" | tee -a ${outputfile}
df -h &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : free" | tee -a ${outputfile}
free -m &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : ps memory" | tee -a ${outputfile}
ps -o pid,vsz,rss,cmd &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : vmstat" | tee -a ${outputfile}
vmstat 1 5 &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : top all" | tee -a ${outputfile}
top -b -d 2 -n 2 &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : top threads" | tee -a ${outputfile}
top -b -H -d 2 -n 2 &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : pidstat" | tee -a ${outputfile}
pidstat -d -h --human -l -r -u -v -w 2 2 &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : iostat" | tee -a ${outputfile}
iostat -xm 1 5 &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : ss summary" | tee -a ${outputfile}
ss --summary &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : ss all" | tee -a ${outputfile}
ss -amponet &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : nstat" | tee -a ${outputfile}
nstat -asz &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : sar network" | tee -a ${outputfile}
sar -n DEV 1 5 &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : sar tcp" | tee -a ${outputfile}
sar -n TCP,ETCP 1 5 &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : lnstat" | tee -a ${outputfile}
lnstat -c 1 &>> ${outputfile}
echo "diag: $(date +"%Y%m%d %H%M%S %N %Z") : sysctl" | tee -a ${outputfile}
sysctl -a &>> ${outputfile}
echo "Wrote to ${outputfile}"
Sending a kernel patch
Review the documentation on submitting patches
Find the repository of your target subsystem in the
MAINTAINERS
file. For example, forperf
, the repository is:SCM: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core
Clone this repository. For example, for
perf
, from above:git clone git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
Checkout the appropriate branch from the SCM line in the
MAINTAINERS
file. For example, forperf
, from above:git checkout perf/core
Make your changes and commit them with a prefix of the subsystem. For example, for
perf
:git commit -sam "perf: DESCRIPTION OF CHANGES"
Subscribe to the mailing list in the
MAINTAINERS
file. For example, forperf
, it'slinux-perf-users@vger.kernel.org
and subscription instructions may be found at http://vger.kernel.org/vger-lists.html#linux-perf-usersSend the patch to the mailing list. For example, for
perf
:git send-email --from "First Last <email@example.com>" --to linux-perf-users@vger.kernel.org --suppress-cc all -1
After some time, validate the email was successfully sent to the mailing list by reviewing the archives. For example, for
perf
, see https://lore.kernel.org/linux-perf-users/
Error Codes (errno.h)
An errno
code is used throughout Linux to detail errors when calling functions.
Names and numeric values for a particular instance of Linux may be
listed with errno -l
from the moreutils
package. Additional definitions are available in asm-generic/errno-base.h,
asm-generic/errno.h,
and linux/errno.h.
Example list:
# errno -l | sort -n -k 2 | awk '{printf("%-16s %3s ", $1, $2); for (i=3;i<=NF;i++) printf(" %s", $i); printf("\n");}'
EPERM 1 Operation not permitted
ENOENT 2 No such file or directory
ESRCH 3 No such process
EINTR 4 Interrupted system call
EIO 5 Input/output error
ENXIO 6 No such device or address
E2BIG 7 Argument list too long
ENOEXEC 8 Exec format error
EBADF 9 Bad file descriptor
ECHILD 10 No child processes
EAGAIN 11 Resource temporarily unavailable
EWOULDBLOCK 11 Resource temporarily unavailable
ENOMEM 12 Cannot allocate memory
EACCES 13 Permission denied
EFAULT 14 Bad address
ENOTBLK 15 Block device required
EBUSY 16 Device or resource busy
EEXIST 17 File exists
EXDEV 18 Invalid cross-device link
ENODEV 19 No such device
ENOTDIR 20 Not a directory
EISDIR 21 Is a directory
EINVAL 22 Invalid argument
ENFILE 23 Too many open files in system
EMFILE 24 Too many open files
ENOTTY 25 Inappropriate ioctl for device
ETXTBSY 26 Text file busy
EFBIG 27 File too large
ENOSPC 28 No space left on device
ESPIPE 29 Illegal seek
EROFS 30 Read-only file system
EMLINK 31 Too many links
EPIPE 32 Broken pipe
EDOM 33 Numerical argument out of domain
ERANGE 34 Numerical result out of range
EDEADLK 35 Resource deadlock avoided
EDEADLOCK 35 Resource deadlock avoided
ENAMETOOLONG 36 File name too long
ENOLCK 37 No locks available
ENOSYS 38 Function not implemented
ENOTEMPTY 39 Directory not empty
ELOOP 40 Too many levels of symbolic links
ENOMSG 42 No message of desired type
EIDRM 43 Identifier removed
ECHRNG 44 Channel number out of range
EL2NSYNC 45 Level 2 not synchronized
EL3HLT 46 Level 3 halted
EL3RST 47 Level 3 reset
ELNRNG 48 Link number out of range
EUNATCH 49 Protocol driver not attached
ENOCSI 50 No CSI structure available
EL2HLT 51 Level 2 halted
EBADE 52 Invalid exchange
EBADR 53 Invalid request descriptor
EXFULL 54 Exchange full
ENOANO 55 No anode
EBADRQC 56 Invalid request code
EBADSLT 57 Invalid slot
EBFONT 59 Bad font file format
ENOSTR 60 Device not a stream
ENODATA 61 No data available
ETIME 62 Timer expired
ENOSR 63 Out of streams resources
ENONET 64 Machine is not on the network
ENOPKG 65 Package not installed
EREMOTE 66 Object is remote
ENOLINK 67 Link has been severed
EADV 68 Advertise error
ESRMNT 69 Srmount error
ECOMM 70 Communication error on send
EPROTO 71 Protocol error
EMULTIHOP 72 Multihop attempted
EDOTDOT 73 RFS specific error
EBADMSG 74 Bad message
EOVERFLOW 75 Value too large for defined data type
ENOTUNIQ 76 Name not unique on network
EBADFD 77 File descriptor in bad state
EREMCHG 78 Remote address changed
ELIBACC 79 Can not access a needed shared library
ELIBBAD 80 Accessing a corrupted shared library
ELIBSCN 81 .lib section in a.out corrupted
ELIBMAX 82 Attempting to link in too many shared libraries
ELIBEXEC 83 Cannot exec a shared library directly
EILSEQ 84 Invalid or incomplete multibyte or wide character
ERESTART 85 Interrupted system call should be restarted
ESTRPIPE 86 Streams pipe error
EUSERS 87 Too many users
ENOTSOCK 88 Socket operation on non-socket
EDESTADDRREQ 89 Destination address required
EMSGSIZE 90 Message too long
EPROTOTYPE 91 Protocol wrong type for socket
ENOPROTOOPT 92 Protocol not available
EPROTONOSUPPORT 93 Protocol not supported
ESOCKTNOSUPPORT 94 Socket type not supported
ENOTSUP 95 Operation not supported
EOPNOTSUPP 95 Operation not supported
EPFNOSUPPORT 96 Protocol family not supported
EAFNOSUPPORT 97 Address family not supported by protocol
EADDRINUSE 98 Address already in use
EADDRNOTAVAIL 99 Cannot assign requested address
ENETDOWN 100 Network is down
ENETUNREACH 101 Network is unreachable
ENETRESET 102 Network dropped connection on reset
ECONNABORTED 103 Software caused connection abort
ECONNRESET 104 Connection reset by peer
ENOBUFS 105 No buffer space available
EISCONN 106 Transport endpoint is already connected
ENOTCONN 107 Transport endpoint is not connected
ESHUTDOWN 108 Cannot send after transport endpoint shutdown
ETOOMANYREFS 109 Too many references: cannot splice
ETIMEDOUT 110 Connection timed out
ECONNREFUSED 111 Connection refused
EHOSTDOWN 112 Host is down
EHOSTUNREACH 113 No route to host
EALREADY 114 Operation already in progress
EINPROGRESS 115 Operation now in progress
ESTALE 116 Stale file handle
EUCLEAN 117 Structure needs cleaning
ENOTNAM 118 Not a XENIX named type file
ENAVAIL 119 No XENIX semaphores available
EISNAM 120 Is a named type file
EREMOTEIO 121 Remote I/O error
EDQUOT 122 Disk quota exceeded
ENOMEDIUM 123 No medium found
EMEDIUMTYPE 124 Wrong medium type
ECANCELED 125 Operation canceled
ENOKEY 126 Required key not available
EKEYEXPIRED 127 Key has expired
EKEYREVOKED 128 Key has been revoked
EKEYREJECTED 129 Key was rejected by service
EOWNERDEAD 130 Owner died
ENOTRECOVERABLE 131 State not recoverable
ERFKILL 132 Operation not possible due to RF-kill
EHWPOISON 133 Memory page has hardware error
Sysrq Keys
Check if sysrq enabled
Show if sysrq
is enabled (1
):
$ sysctl kernel.sysrq
kernel.sysrq = 1
Enable sysrq
Enable:
- Method 1 (temporary):
sysctl -w kernel.sysrq=1
- Method 2 (permanent):
- Add
kernel.sysrq=1
to/etc/sysctl.conf
- Apply with
sysctl -p
- Add
sysrq characters
Commonly used characters:
- f: Run the OOM Killer. This will kill the process using the most RAM (even if it's not using much).
- r: Take control of keyboard from X.
- e: Send SIGTERM to all processes. Wait for graceful termination.
- i: Send SIGKILL to all processes for forceful termination.
- s: Sync disks.
- u: Remount all filesystems as read-only.
- b: Reboot.
- g: Switch to the kernel console. Otherwise, switch to a console with, e.g. Ctrl+Alt+F3
- l: Show backtrace of all CPUs.
- 0-9: Change the kernel log level.
- d: Display kernel locks.
- m: Show memory information.
- t: Show a list of all processes.
- w: Show a list of blocked processes.
- c: Perform a kernel crash.
A "controlled" reboot is often done with reisub
Execute sysrq
Execute:
- Method 1 (using keyboard):
- Ctrl + Alt + SysRq (usually PrintScreen) +
$CHARACTER
- All of these keys must be held down at the same time and then released
- On some keyboards, this only works with the right-side Ctrl/Alt keys
- On some keyboards, a function (
Fn
) key must be held for PrintScreen
- Ctrl + Alt + SysRq (usually PrintScreen) +
- Method 2 (as root):
echo $CHARACTER > /proc/sysrq-trigger
- Method 3 (with sudo):
echo $CHARACTER | sudo tee /proc/sysrq-trigger