Troubleshooting Operating Systems
Sub-chapters
- Troubleshooting Linux
- Troubleshooting AIX
- Troubleshooting z/OS
- Troubleshooting IBM i
- Troubleshooting Windows
- Troubleshooting macOS
- Troubleshooting Solaris
- Troubleshooting HP-UX
Debug Symbols
Some applications use native libraries (e.g. JNI; .so, .dll, etc.) to perform functions in native code (e.g. C/C++) rather than through Java code. This may involve allocating native memory outside of the Java heap (e.g. malloc, mmap). These libraries have to do their own garbage collection and application errors can cause native memory leaks, which can ultimately cause crashes, paging, etc. These problems are one of the most difficult classes of problems, and they are made even more difficult by the fact that native libraries are often "stripped" of symbol information.
Symbols are artifacts produced by the compiler and linker to describe the mapping between executable code and source code. For example, a library may have a function in the source code named "foo" and in the binary, this function code resides in the address range 0x10000000 - 0x10001000. This function may be executing, in which case the instruction register is in this address range, or if foo calls another function, foo's return address will be on the call stack. In both cases, a debugger or leak-tracker only has access to raw addresses (e.g. 0x100000a1). If there is nothing to tell it the mapping between foo and the code address ranges, then you'll just get a stack full of numbers, which usually isn't very interesting.
Historically, symbols have been stripped from executables for the following reasons: 1) to reduce the size of libraries, 2) because performance could suffer, and 3) to complicate reverse-engineering efforts. First, it's important to note that all three of these reasons do not apply to privately held symbol files. With most modern compilers, you can produce the symbol files and save them off. If there is a problem, you can download the core dump, find the matching symbols locally, and off you go.
Therefore, the first best practice is to always generate and save off symbols, even if you don't ship them with your binaries. When debugging, you should match the symbol files with the exact build that produced the problem. This also means that you need to save the symbols for every build, including one-off or debug builds that customers may be running, and track these symbols with some unique identifier to map to the running build.
The second best practice is to consider shipping symbol files with your binaries if your requirements allow it. Some answers to the objections above include: 1) although the size of the distribution will be larger, this greatly reduces the time to resolve complex problems, 2) most modern compilers can create fully optimized code with symbols [A], and 3) reverse engineering requires insider or hacker access to the binaries and deep product knowledge; also, Java code is just as easy to reverse engineer as native code with symbols, so this is an aspect of modern programming and debugging. Benefits of shipping symbols include: 1) not having to store, manage, and query a symbol store or database each time you need symbols, 2) allow "on site" debugging without having to ship large core dumps, since oftentimes running a simple back trace or post-processing program on the same machine where the problem happened, with symbols, can immediately produce the desired information.
As always, your mileage may vary and you should fully test such a change, including a performance test.
Eye Catcher
Eye-catchers are generally used to aid in tracking native memory usage or corruption. An eye-catcher, as its name suggests, is some sequence of bytes that has a low probability of randomly appearing in memory. If you see one of your eye-catchers, it's likely that you've found one of your allocations.
For example, below is a simple C program which leaks 10 MyStruct
instances into the native heap with the eye catcher
0xDEADFAD0
and then waits indefinitely so that a coredump
may be produced:
#include <stdio.h>
#include <signal.h>
#include <stdlib.h>
#include <string.h>
#define EYECATCHER_MYSTRUCT 0xDEADFAD0
typedef struct {
int eyeCatcher;
int myData;
} MyStruct;
void main(int argc, char** argv) {
sigset_t sigmask;
MyStruct *p;
int i;
for (i = 0; i < 10; i++) {
p = (MyStruct*)malloc(sizeof(MyStruct));
printf("Alloced struct @ 0x%0X\n", p);
p->eyeCatcher = EYECATCHER_MYSTRUCT;
p->myData = 123*i;
}
printf("Hello World. Waiting indefinitely...\n");
sigemptyset(&sigmask);
sigaddset(&sigmask,SIGCHLD);
sigsuspend(&sigmask);
}
Now, we can find all of these structures in a hexdump. In this example, integers are stored in little endian format, so search for D0FAADDE instead of DEADFAD0:
$ hexdump -C core.680 | grep "d0 fa ad de"
00002cb0 00 00 00 00 d0 fa ad de 00 00 00 00 00 00 00 00 |................|
00002cd0 00 00 00 00 d0 fa ad de 7b 00 00 00 00 00 00 00 |........{.......|
00002cf0 00 00 00 00 d0 fa ad de f6 00 00 00 00 00 00 00 |................|
00002d10 00 00 00 00 d0 fa ad de 71 01 00 00 00 00 00 00 |........q.......|
00002d30 00 00 00 00 d0 fa ad de ec 01 00 00 00 00 00 00 |................|
00002d50 00 00 00 00 d0 fa ad de 67 02 00 00 00 00 00 00 |........g.......|
00002d70 00 00 00 00 d0 fa ad de e2 02 00 00 00 00 00 00 |................|
00002d90 00 00 00 00 d0 fa ad de 5d 03 00 00 00 00 00 00 |........].......|
00002db0 00 00 00 00 d0 fa ad de d8 03 00 00 00 00 00 00 |................|
00002dd0 00 00 00 00 d0 fa ad de 53 04 00 00 00 00 00 00 |........S.......|
We can see the ten allocations there. Note: the eye catcher just happened to be on a word boundary. It's possible that it spanned multiple lines or across the 8 byte boundary. The best way to search for eye catchers is through some type of automation such as gdb extensions.
Strings are often preferable to integers. This solves the problem of big- and little-endianness and it's normally easier to spot these strings:
#define EYECATCHER_MYSTRUCT2 "DEADFAD0"
typedef struct {
char eyeCatcher[9]; // Add 1 to the length of the eye catcher, because strcpy will copy in the null terminator
int myData;
} MyStruct2;
...
for (i = 0; i < 10; i++) {
p2 = (MyStruct2*)malloc(sizeof(MyStruct2));
printf("Alloced struct @ 0x%0X\n", p2);
strcpy(p2->eyeCatcher, EYECATCHER_MYSTRUCT2);
p2->myData = 123*i;
}
...
$ hexdump -C core.6940 | grep DEADFAD0 | tail -10
00002df0 00 00 00 00 44 45 41 44 46 41 44 30 00 00 00 00 |....DEADFAD0....|
00002e10 00 00 00 00 44 45 41 44 46 41 44 30 7b 00 00 00 |....DEADFAD0{...|
00002e30 00 00 00 00 44 45 41 44 46 41 44 30 f6 00 00 00 |....DEADFAD0....|
00002e50 00 00 00 00 44 45 41 44 46 41 44 30 71 01 00 00 |....DEADFAD0q...|
00002e70 00 00 00 00 44 45 41 44 46 41 44 30 ec 01 00 00 |....DEADFAD0....|
00002e90 00 00 00 00 44 45 41 44 46 41 44 30 67 02 00 00 |....DEADFAD0g...|
00002eb0 00 00 00 00 44 45 41 44 46 41 44 30 e2 02 00 00 |....DEADFAD0....|
00002ed0 00 00 00 00 44 45 41 44 46 41 44 30 5d 03 00 00 |....DEADFAD0]...|
00002ef0 00 00 00 00 44 45 41 44 46 41 44 30 d8 03 00 00 |....DEADFAD0....|
00002f10 00 00 00 00 44 45 41 44 46 41 44 30 53 04 00 00 |....DEADFAD0S...|
Here are some other considerations:
- If you're writing native code that is making dynamic allocations, consider always using eye catchers. Yes, they have a small overhead. It's generally worth it. The evidence for this recommendation is that most large, native products use them.
- You can put an "int size" field after the eye catcher which stores the size of the allocation (sizeof(struct)), which makes it easier to quickly tell how much storage your allocations are using.
- You can wrap all allocations (and deallocations) in common routines so that this is more standard (and foolproof) in your code. This is usually done by having an eye catcher struct and wrapping malloc. In the wrapped malloc, add the sizeof(eyecatcherstruct) to the bytes requested, then put the eye catcher struct at the top of the allocation, and then return a pointer to the first byte after sizeof(eyecatcherstruct) to the user.