Troubleshooting Operating Systems

Sub-chapters

Debug Symbols

Some applications use native libraries (e.g. JNI; .so, .dll, etc.) to perform functions in native code (e.g. C/C++) rather than through Java code. This may involve allocating native memory outside of the Java heap (e.g. malloc, mmap). These libraries have to do their own garbage collection and application errors can cause native memory leaks, which can ultimately cause crashes, paging, etc. These problems are one of the most difficult classes of problems, and they are made even more difficult by the fact that native libraries are often "stripped" of symbol information.

Symbols are artifacts produced by the compiler and linker to describe the mapping between executable code and source code. For example, a library may have a function in the source code named "foo" and in the binary, this function code resides in the address range 0x10000000 - 0x10001000. This function may be executing, in which case the instruction register is in this address range, or if foo calls another function, foo's return address will be on the call stack. In both cases, a debugger or leak-tracker only has access to raw addresses (e.g. 0x100000a1). If there is nothing to tell it the mapping between foo and the code address ranges, then you'll just get a stack full of numbers, which usually isn't very interesting.

Historically, symbols have been stripped from executables for the following reasons: 1) to reduce the size of libraries, 2) because performance could suffer, and 3) to complicate reverse-engineering efforts. First, it's important to note that all three of these reasons do not apply to privately held symbol files. With most modern compilers, you can produce the symbol files and save them off. If there is a problem, you can download the core dump, find the matching symbols locally, and off you go.

Therefore, the first best practice is to always generate and save off symbols, even if you don't ship them with your binaries. When debugging, you should match the symbol files with the exact build that produced the problem. This also means that you need to save the symbols for every build, including one-off or debug builds that customers may be running, and track these symbols with some unique identifier to map to the running build.

The second best practice is to consider shipping symbol files with your binaries if your requirements allow it. Some answers to the objections above include: 1) although the size of the distribution will be larger, this greatly reduces the time to resolve complex problems, 2) most modern compilers can create fully optimized code with symbols [A], and 3) reverse engineering requires insider or hacker access to the binaries and deep product knowledge; also, Java code is just as easy to reverse engineer as native code with symbols, so this is an aspect of modern programming and debugging. Benefits of shipping symbols include: 1) not having to store, manage, and query a symbol store or database each time you need symbols, 2) allow "on site" debugging without having to ship large core dumps, since oftentimes running a simple back trace or post-processing program on the same machine where the problem happened, with symbols, can immediately produce the desired information.

As always, your mileage may vary and you should fully test such a change, including a performance test.

Eye Catcher

Eye-catchers are generally used to aid in tracking native memory usage or corruption. An eye-catcher, as its name suggests, is some sequence of bytes that has a low probability of randomly appearing in memory. If you see one of your eye-catchers, it's likely that you've found one of your allocations.

For example, below is a simple C program which leaks 10 MyStruct instances into the native heap with the eye catcher 0xDEADFAD0 and then waits indefinitely so that a coredump may be produced:

#include <stdio.h>
#include <signal.h>
#include <stdlib.h>
#include <string.h>

#define EYECATCHER_MYSTRUCT 0xDEADFAD0

typedef struct {
  int eyeCatcher;
  int myData;
} MyStruct;

void main(int argc, char** argv) {
  sigset_t sigmask;
  MyStruct *p;
  int i;

  for (i = 0; i < 10; i++) {
    p  = (MyStruct*)malloc(sizeof(MyStruct));
    printf("Alloced struct @ 0x%0X\n", p);
    p->eyeCatcher = EYECATCHER_MYSTRUCT;
    p->myData = 123*i;
  }

  printf("Hello World. Waiting indefinitely...\n");
  sigemptyset(&sigmask);
  sigaddset(&sigmask,SIGCHLD);
  sigsuspend(&sigmask);
}

Now, we can find all of these structures in a hexdump. In this example, integers are stored in little endian format, so search for D0FAADDE instead of DEADFAD0:

$ hexdump -C core.680 | grep "d0 fa ad de"
00002cb0  00 00 00 00 d0 fa ad de  00 00 00 00 00 00 00 00  |................|
00002cd0  00 00 00 00 d0 fa ad de  7b 00 00 00 00 00 00 00  |........{.......|
00002cf0  00 00 00 00 d0 fa ad de  f6 00 00 00 00 00 00 00  |................|
00002d10  00 00 00 00 d0 fa ad de  71 01 00 00 00 00 00 00  |........q.......|
00002d30  00 00 00 00 d0 fa ad de  ec 01 00 00 00 00 00 00  |................|
00002d50  00 00 00 00 d0 fa ad de  67 02 00 00 00 00 00 00  |........g.......|
00002d70  00 00 00 00 d0 fa ad de  e2 02 00 00 00 00 00 00  |................|
00002d90  00 00 00 00 d0 fa ad de  5d 03 00 00 00 00 00 00  |........].......|
00002db0  00 00 00 00 d0 fa ad de  d8 03 00 00 00 00 00 00  |................|
00002dd0  00 00 00 00 d0 fa ad de  53 04 00 00 00 00 00 00  |........S.......|

We can see the ten allocations there. Note: the eye catcher just happened to be on a word boundary. It's possible that it spanned multiple lines or across the 8 byte boundary. The best way to search for eye catchers is through some type of automation such as gdb extensions.

Strings are often preferable to integers. This solves the problem of big- and little-endianness and it's normally easier to spot these strings:

#define EYECATCHER_MYSTRUCT2 "DEADFAD0"

typedef struct {
  char eyeCatcher[9]; // Add 1 to the length of the eye catcher, because strcpy will copy in the null terminator
  int myData;
} MyStruct2;

...

for (i = 0; i < 10; i++) {
  p2  = (MyStruct2*)malloc(sizeof(MyStruct2));
  printf("Alloced struct @ 0x%0X\n", p2);
  strcpy(p2->eyeCatcher, EYECATCHER_MYSTRUCT2);
  p2->myData = 123*i;
}

...

$ hexdump -C core.6940 | grep DEADFAD0 | tail -10
00002df0  00 00 00 00 44 45 41 44  46 41 44 30 00 00 00 00  |....DEADFAD0....|
00002e10  00 00 00 00 44 45 41 44  46 41 44 30 7b 00 00 00  |....DEADFAD0{...|
00002e30  00 00 00 00 44 45 41 44  46 41 44 30 f6 00 00 00  |....DEADFAD0....|
00002e50  00 00 00 00 44 45 41 44  46 41 44 30 71 01 00 00  |....DEADFAD0q...|
00002e70  00 00 00 00 44 45 41 44  46 41 44 30 ec 01 00 00  |....DEADFAD0....|
00002e90  00 00 00 00 44 45 41 44  46 41 44 30 67 02 00 00  |....DEADFAD0g...|
00002eb0  00 00 00 00 44 45 41 44  46 41 44 30 e2 02 00 00  |....DEADFAD0....|
00002ed0  00 00 00 00 44 45 41 44  46 41 44 30 5d 03 00 00  |....DEADFAD0]...|
00002ef0  00 00 00 00 44 45 41 44  46 41 44 30 d8 03 00 00  |....DEADFAD0....|
00002f10  00 00 00 00 44 45 41 44  46 41 44 30 53 04 00 00  |....DEADFAD0S...|

Here are some other considerations:

  1. If you're writing native code that is making dynamic allocations, consider always using eye catchers. Yes, they have a small overhead. It's generally worth it. The evidence for this recommendation is that most large, native products use them.
  2. You can put an "int size" field after the eye catcher which stores the size of the allocation (sizeof(struct)), which makes it easier to quickly tell how much storage your allocations are using.
  3. You can wrap all allocations (and deallocations) in common routines so that this is more standard (and foolproof) in your code. This is usually done by having an eye catcher struct and wrapping malloc. In the wrapped malloc, add the sizeof(eyecatcherstruct) to the bytes requested, then put the eye catcher struct at the top of the allocation, and then return a pointer to the first byte after sizeof(eyecatcherstruct) to the user.