Operating Systems

Additionally, see the chapter for your particular operating system:

Central Processing Unit (CPU)

A processor is an integrated circuit (also known as a socket or die) with one or more central processing unit (CPU) cores. A CPU core executes program instructions such as arithmetic, logic, and input/output operations. CPU utilization is the percentage of time that programs or the operating system execute as opposed to idle time. A CPU core may support simultaneous multithreading (also known as hardware threads or hyperthreads) which appears to the operating system as additional logical CPU cores. Be aware that simple CPU utilization numbers may be unintuitive in the context of advanced processor features. Examples:

  • Intel:

    The current implementation of [CPU utilization] [...] shows the portion of time slots that the CPU scheduler in the OS could assign to execution of running programs or the OS itself; the rest of the time is idle [...] The advances in computer architecture made this algorithm an unreliable metric because of introduction of multi core and multi CPU systems, multi-level caches, non-uniform memory, simultaneous multithreading (SMT), pipelining, out-of-order execution, etc.

    A prominent example is the non-linear CPU utilization on processors with Intel® Hyper-Threading Technology (Intel® HT Technology). Intel® HT technology is a great performance feature that can boost performance by up to 30%. However, HT-unaware end users get easily confused by the reported CPU utilization: Consider an application that runs a single thread on each physical core. Then, the reported CPU utilization is 50% even though the application can use up to 70%-100% of the execution units. (https://software.intel.com/en-us/articles/intel-performance-counter-monitor)

  • AIX:

    Although it might be somewhat counterintuitive, simultaneous multithreading performs best when the performance of the cache is at its worst.

  • IBM Senior Technical Staff:

    Use care when partitioning [CPU cores] [...] it's important to recognize that [CPU core] partitioning doesn't create more resources, it simply enables you to divide and allocate the [CPU core] capacity [...] At the end of the day, there still needs to be adequate underlying physical CPU capacity to meet response time and throughput requirements when partitioning [CPU cores]. Otherwise, poor performance will result.

It is not necessarily problematic for a machine to have many more program threads than processor cores. This is common with Java and WAS processes that come with many different threads and thread pools by default that may not be used often. Even if the main application thread pool (or the sum of these across processes) exceeds the number of processor cores, this is only concerning if the average unit of work uses the processor heavily. For example, if threads are mostly I/O bound to a database, then it may not be a problem to have many more threads than cores. There are potential costs to threads even if they are usually sleeping, but these may be acceptable. The danger is when the concurrent workload on available threads exceeds processor capacity. There are cases where thread pools are excessively large but there has not been a condition where they have all filled up (whether due to workload or a front-end bottleneck). It is very important that stress tests saturate all commonly used thread pools to observe worst case behavior.

Depending on the environment, number of processes, redundancy, continuous availability and/or high availability requirements, the threshold for %CPU utilization varies. For high availability and continuous availability environments, the threshold can be as low as 50% CPU utilization. For non-critical applications, the threshold could be as high as 95%. Analyze both the non-functional requirements and service level agreements of the application in order to determine appropriate thresholds to indicate a potential health issue.

It is common for some modern processors (including server class) and operating systems to enable processor scaling by default. The purpose of processor scaling is primarily to reduce power consumption. Processor scaling dynamically changes the frequency of the processor(s), and therefore may impact performance. In general, processor scaling should not kick in during periods of high use; however, it does introduce an extra performance variable. Weigh the energy saving benefits versus disabling processor scaling and simply running the processors at maximum speed at all times (usually done in the BIOS).

Test affinitizing processes to processor sets (operating system specific configuration). In general, affinitize within processor boundaries. Also, start each JVM with -XgcthreadsN (IBM Java) or -XX:ParallelGCThreads=N (Oracle/HotSpot Java) where N equals the number of processor core threads in the processor set.

It is sometimes worth understanding the physical architecture of the central processing units (CPUs). Clock speed and number of cores/hyperthreading are the most obviously important metrics, but CPU memory locality, bus speeds, and L2/L3 cache sizes are sometimes worth considering. One strategy for deciding on the number of JVMs is to create one JVM per processor chip (i.e. socket) and bind it to that chip.

It's common for operating systems to dedicate some subset of CPU cores for interrupt processing and this may distort other workloads running on those cores.

Different types of CPU issues (Old Java Diagnostic Guide):

  • Inefficient or looping code is running. A specific thread or a group of threads is taking all the CPU time.
  • Points of contention or delay exist. CPU usage is spread across most threads, but overall CPU usage is low.
  • A deadlock is present. No CPU is being used.

How many CPUs per node?

IBM Senior Technical Staff:

As a starting point, I plan on having at least one CPU [core] per application server JVM; that way I have likely minimized the number of times that a context switch will occur -- at least as far as using up a time slice is concerned (although, as mentioned, there are other factors that can result in a context switch). Unless you run all your servers at 100% CPU, more than likely there are CPU cycles available as application requests arrive at an application server, which in turn are translated into requests for operating system resources. Therefore, we can probably run more application servers than CPUs.

Arriving at the precise number that you can run in your environment, however, brings us back to it depends. This is because that number will in fact depend on the load, application, throughput, and response time requirements, and so on, and the only way to determine a precise number is to run tests in your environment.

How many application processes per node?

IBM Senior Technical Staff:

In general one should tune a single instance of an application server for throughput and performance, then incrementally add [processes] testing performance and throughput as each [process] is added. By proceeding in this manner one can determine what number of [processes] provide the optimal throughput and performance for their environment. In general once CPU utilization reaches 75% little, if any, improvement in throughput will be realized by adding additional [processes].

Registers

CPUs execute instructions (e.g. add, subtract, etc.) from a computer program, also known as an application, executable, binary, shared library, etc. CPUs have a fixed number of registers used to perform these instructions. These registers have variable contents updated by programs as they execute a set of instructions.

Assembly Language

Assembly language (asm) is a low-level programming language with CPU instructions (and other things like constants and comments). It is compiled by an assembler into machine code which is executed. In the following example, the first instruction of the main function is to push a register onto the stack, the second instruction is to copy (mov) one register into another, and so on:

0000000000401126 <main>:
  401126:       55                      push   %rbp
  401127:       48 89 e5                mov    %rsp,%rbp
  40112a:       48 83 ec 10             sub    $0x10,%rsp
  40112e:       89 7d fc                mov    %edi,-0x4(%rbp)
  401131:       48 89 75 f0             mov    %rsi,-0x10(%rbp)
  401135:       bf 10 20 40 00          mov    $0x402010,%edi
  40113a:       e8 f1 fe ff ff          call   401030 <puts@plt>
  40113f:       b8 00 00 00 00          mov    $0x0,%eax
  401144:       c9                      leave  
  401145:       c3                      ret    

Assembly Syntax

The most common forms of syntax for assembly are AT&T and Intel syntax. There are confusing differences between the two. For example, in AT&T syntax, the source of a mov instruction is first followed by the destination:

mov esp, ebp

Whereas, in Intel syntax, the destination of mov instruction is first followed by the destination:

mov ebp, esp

Instruction Pointer

CPUs may have a register that points to the address of the current execution context of a program. This register is called the instruction pointer (IP), program counter (PC), extended instruction pointer (EIP), instruction address register (IAR), relative instruction pointer (RIP), or other names. Depending on the phase in the CPU's execution, this register may be pointing at the currently executing instruction, or one of the instructions that will be subsequently executed.

Program Stack

A program is usually made of functions which are logical groups of instructions with inputs and outputs. In the following example, the program starts in the main function and calls the getCubeColume function. The getCubeVolume function calculates the volume and returns it to the main function which then prints the calculation along with some text:

#include <stdio.h>

int getCubeVolume(int length) {
  return length * length * length;
}

int main(int argc, char **argv) {
  printf("The volume of a 3x3x3 cube is: %d\n", getCubeVolume(3));
  return 0;
}

When getCubeVolume is ready to return its result, it needs to know how to go back to the main function at the point where getCubeVolume was called. A program stack is used to manage this relationship of function executions. A stack is a data structure in computer science that has push and pop operations. Pushing something onto a stack puts an item on top of all existing items in the stack. Popping something off of a stack removes the top item in the stack.

A real world example is a stack of dishes. As dishes are ready to be washed, they could be pushed on top of a stack of dishes, and a dishwasher could iteratively pops dishes off the top of the stack to wash them. The order in which the dishes are washed is not necessarily the order in which they were used. It might take a while for the dishwasher to get to the bottom plate as long as new dirty plates are constantly added. In this analogy, the dishwasher is the CPU and this is why the main function is always in the stack as long as the program is executing. Only after all program instructions have completed will main be able to complete.

Similarly, a program stack is made of stack frames. Each stack frame represents an executing program method. In the above example, if we paused the program during the getCubeVolume call, the program stack would be made of two frames: the main function would be the stack frame at the bottom of the stack, and the getCubeVolume function would the the stack frame at the top of the stack.

Programs execute in a logical structure called a process which manages memory access, security, and other aspects of a program. Programs have one or more threads which are logical structures that manage what is executing on CPUs. Each thread has a program stack which is an area of memory used to manage function calls. The program stack may also be used for other purposes such as managing temporary variable memory within a function ("local", "stack frame local", or "automatic" variables).

Confusingly, the program stack commonly grows downward in memory. For example, let's say a thread has a stack that is allocated in the memory range 0x100 to 0x200. When the main function starts executing, let's say after some housekeeping, the stack starts at 0x180. As main calls getCubeVolume, the stack will "grow" downward into, for example, 0x150 so that getCubeVolume uses the memory range 0x150 - 0x180 for itself. When the getCubeVolume finishes, the stack "pops" by going back from 0x150 to 0x180.

Stack Pointer

CPUs may have a register that points to the top of the program stack for the currently executing thread. This register is called the stack pointer (SP), extended stack pointer (ESP), register stack pointer (RSP), or other names.

Frame Pointer

CPUs may have a register that points to the bottom of the currently executing stack frame where local variables for that function start. This register is called the frame pointer (FP), base pointer (BP), extended base pointer (EBP), register base pointer (RBP), or other names. Originally, this was used because the only other relative address available is the stack pointer which may be constantly moving as local variables are added and removed. Thus, if a function needed access to a local variable passed into the function, it could just use a constant offset from the frame pointer.

Compilers may perform an optimization called frame pointer omission (FPO) (e.g. with gcc with -O or -fomit-frame-pointer, or by default since GCC 4.6) that uses the frame pointer register as a general purpose register instead and embeds the necessary offsets into the program using the stack pointer to avoid the need for frame pointer offsets.

In the unoptimized case (e.g. without -O or with -fno-omit-frame-pointer with gcc), a common calling convention is for each function to first push the previous function's frame pointer onto its stack, copy the current value of the stack pointer into the frame pointer, and then allocate some space for the function's local variables (Intel Syntax):

push ebp
mov ebp, esp
sub esp, $LOCALS

When the function returns, it will remove all the local stack space it used, pop the frame pointer to the parent function's value, and return to the previous function as well as release the amount of stack used for incoming parameters into this function; for example (Intel Syntax):

mov esp, ebp
pop ebp
ret $INCOMING_PARAMETERS_SIZE

When a function calls another function, any parameters are pushed onto the stack, then the instruction pointer plus the size of two instructions is pushed onto the stack, and then a jump instruction starts executing the new function. When the called function returns, it continues executing at two instructions after the call statement; for example (Intel Syntax):

push 1
push 2
push 3
push eip + 2
jmp getCubeVolume
Call Stack Walking

For diagnostic tools to walk a call stack ("unwind" the stack), in the unoptimized case where the frame pointer is used to hold the start of the stack frame, the tool simply has to start from the frame pointer which will allow it to find the pushed frame pointer of the previous function on the stack, and the tool can walk this linked list.

If a program is optimized to use frame pointer omission (FPO), then diagnostic tools generally cannot walk the stack since the frame pointer register is used for general purpose computation:

In some systems, where binaries are built with gcc --fomit-frame-pointer, using the "fp" method will produce bogus call graphs

As an alternative, if programs are compiled with debugging information in the form of standards such as the DWARF standard and specification that describe detailed information of the program in instances of a Debugging Information Entry (DIE) and particularly the Call Frame Information, then some tools may be able to unwind the stack using this information (e.g. using libdw and libunwind):

Every processor has a certain way of calling functions and passing arguments, usually defined in the ABI. In the simplest case, this is the same for each function and the debugger knows exactly how to find the argument values and the return address for the function.

For some processors, there may be different calling sequences depending on how the function is written, for example, if there are more than a certain number of arguments. There may be different calling sequences depending on operating systems. Compilers will try to optimize the calling sequence to make code both smaller and faster. One common optimization is when there is a simple function which doesn't call any others (a leaf function) to use its caller stack frame instead of creating its own. Another optimization may be to eliminate a register which points to the current call frame. Some registers may be preserved across the call while others are not.

While it may be possible for the debugger to puzzle out all the possible permutations in calling sequence or optimizations, it is both tedious and error prone. A small change in the optimizations and the debugger may no longer be able to walk the stack to the calling function.

The DWARF Call Frame Information (CFI) provides the debugger with enough information about how a function is called so that it can locate each of the arguments to the function, locate the current call frame, and locate the call frame for the calling function. This information is used by the debugger to "unwind the stack," locating the previous function, the location where the function was called, and the values passed.

Like the line number table, the CFI is encoded as a sequence of instructions that are interpreted to generate a table. There is one row in this table for each address that contains code. The first column contains the machine address while the subsequent columns contain the values of the machine registers when the instruction at that address is executed. Like the line number table, if this table were actually created it would be huge. Luckily, very little changes between two machine instructions, so the CFI encoding is quite compact.

Example usage includes perf record --call-graph dwarf,65528.

Programs such as dwarfdump may be used to print embedded DWARF information in binaries. These are embedded in ELF sections such as .eh_frame, .debug_frame, eh_frame_hdr, etc.

Non-Volatile Registers

Non-volatile registers are generally required to be saved on the stack before calling a function, and popped off the stack when a function returns thus allowing them to be predictable values within the context of any function call. Such registers may include EBX, EDI, ESI, and EBP.

Approximate Overhead of System Calls (syscalls)

Although there are some historical measurements of system call times (e.g. DOI:10.1145/269005.266660, DOI:10.1145/224057.224075), the overhead of system calls depends on the CPU and kernel and should be benchmarked, for example, with getpid.

Random Access Memory (RAM), Physical Memory

Random access memory (RAM) is a high speed, ephemeral data storage circuit located near CPU cores. RAM is often referred to as physical memory to contrast it to virtual memory. Physical memory comprises the physical storage units which support memory usage in a computer (apart from CPU core memory registers), whereas virtual memory is a logical feature that an operating system provides for isolating and simplifying access to physical memory. Strictly speaking, physical memory and RAM are not synonymous because physical memory includes paging space, and paging space is not RAM.

Virtual memory

Modern operating systems are based on the concept of multi-user, time-sharing systems. Operating systems use three key features to isolate users and processes from each other: user mode, virtual address spaces, and process/resource limits. Before these innovations, it was much easier for users and processes to affect each other, whether maliciously or not.

User mode forces processes to use system calls provided by the kernel instead of directly interacting with memory, devices, etc. This feature is ultimately enforced by the processor itself. Operating system kernel code runs in a trusted, unrestricted mode, allowing it to do certain things that a user-mode process cannot do. A user-mode process can make a system call into the kernel to request such functions and this allows the kernel to enforce constraints and share limited resources.

Virtual address spaces allow each process to have its own memory space instead of managing and sharing direct memory accesses. The processor and kernel act in concert to allocate physical memory and paging space and translate virtual addresses to physical addresses at runtime. This provides the ability to restrict which memory a process can access and in what way.

File/Page Cache

The file or page cache is an area of RAM that is used as a write-behind or write-through cache for some virtual file system operations. If a file is created, written to, or read from, the operating system may try to perform some or all of these operations through physical memory and then asynchronously flush any changes to disk. This dramatically improves performance of file I/O at the risk of losing file updates if a machine crashes before the data is flushed to disk.

Memory Corruption

RAM bits may be intermittently and unexpectedly flipped by atmospheric radiation such as neutrons. This may lead to strange application behavior and kernel crashes due to unexpected state. Some RAM chips have error-correcting code (ECC) or parity logic to handle one or two invalid bit flips; however, depending on the features of such ECC RAM, and the number of bits flipped (e.g. a lot of radiation), memory corruption is still possible. Most consumer-grade personal computers do not offer ECC RAM and, depending on altitude and other factors, memory corruption rates with non-ECC RAM may reach up to 1 bit error per GB of RAM per 1.8 hours.

You may check if you are using ECC RAM with:

Paging, Swapping

Paging space is a subset of physical memory, often disk storage or a solid state drive (SSD), which the operating system uses as a "spillover" when demands for physical memory exceed available RAM. Historically, swapping referred to paging in or out an entire process; however, many use paging and swapping interchangeably today, and both address page-sized units of memory (e.g. 4KB).

Overcommitting Memory

Overcommitting memory occurs when less RAM is available than the peak in-use memory demand. This is either done accidentally (undersizing) or consciously with the premise that it is unlikely that all required memory will be accessed at once. Overcommitting is dangerous because the process of paging in and out may be time consuming. RAM operates at over 10s of GB/s, whereas even the fastest SSDs operate at a maximum of a few GB/s (often the bottleneck is the interface to the SSD, e.g. SATA, etc.). Overcommitting memory is particularly dangerous with Java because some types of garbage collections will need to read most of the whole virtual address space for a process in a short period of time. When paging is very heavy, this is called memory thrashing, and usually this will result in a total performance degradation of the system of multiple magnitudes.

Sizing Paging Space

Some people recommend sizing the paging files to some multiple of RAM; however, this recommendation is a rule of thumb that may not be applicable to many workloads. Some people argue that paging is worse than crashing because a system can enter a zombie-like state and the effect can last hours before an administrator is alerted and investigates the issue. Investigation itself may be difficult because connecting to the system may be slow or impossible while it is thrashing. Therefore, some decide to dramatically reduce paging space (e.g. 10 MB) or remove the paging space completely which will force the operating system to crash processes that are using too much memory. This creates clear and immediate symptoms and allows the system to potentially restart the processes and recover. A tiny paging space is probably preferable to no paging space in case the operating system decides to do some benign paging. A tiny paging space can also be monitored as a symptom of problems.

Some workloads may benefit from a decently sized paging space. For example, infrequently used pages may be paged out to make room for filecache, etc.

"Although most do it, basing page file size as a function of RAM makes no sense because the more memory you have, the less likely you are to need to page data out." (Russinovich & Solomon)

Non-uniform Memory Access (NUMA)

Non-uniform Memory Access (NUMA) is a design in which RAM is partitioned so that subsets of RAM (called NUMA nodes) are "local" to certain processors. Consider affinitizing processes to particular NUMA nodes.

32-bit vs 64-bit

Whether 32-bit or 64-bit will be faster depends on the application, workload, physical hardware, and other variables. All else being equal, in general, 32-bit will be faster than 64-bit because 64-bit doubles the pointer size, therefore creating more memory pressure (lower CPU cache hits, TLB, etc.). However, all things are rarely equal. For example, 64-bit often provides more CPU registers than 32-bit (this is not always the case, such as Power), and in some cases, the benefits of more registers outweigh the memory pressure costs. There are other cases such as some mathematical operations where 64-bit will be faster due to instruction availability (and this may apply with some TLS usage, not just obscure mathematical applications). Java significantly reduces the impact of the larger 64-bit pointers within the Java heap by using compressed references. With all of that said, in general, the industry is moving towards 64-bit and the performance difference for most applications is in the 5% range.

Large Page Support

Several platforms support using memory pages that are larger than the default memory page size. Depending on the platform, large memory page sizes can range from 4 MB (Windows) to 16 MB (AIX) and up to 1GB versus the default page size of 4KB. Many applications (including Java-based applications) often benefit from large pages due to a reduction in CPU overhead associated with managing smaller numbers of large pages.

Large pages may cause a small throughput improvement (in one benchmark, about 2%).

Some recent benchmarks on very modern hardware have found little benefit to large pages, although no negative consequences so they're still a best practice in most cases.

Input/Output (I/O)

Disk

Many problems are caused by exhausted disk space. It is critical that disk space is monitored and alerts are created when usage is very high.

Disk speed may be an important factor in some types of workloads. Some operating systems support mounting physical memory as disk partitions (sometimes called RAMdisks), allowing you to target certain disk operations that have recreatable contents to physical memory instead of slower disks.

Network Interface Cards (NICs) and Switches

Ensure that NICs and switches are configured to use their top speeds and full duplex mode. Sometimes this needs to be explicitly done, so you should not assume that this is the case by default. In fact, it has been observed that when the NIC is configured for auto-negotiate, sometimes the NIC and the switch can auto-negotiate very slow speeds and half duplex. This is why setting explicit values is recommended.

If the network components support Jumbo Frames, consider enabling it across the relevant parts of the network

Check network performance between two hosts. For example, make a 1 GB file (various operating system commands like dd or mkfile). Then test the network throughput by copying it using FTP, SCP, etc.

Monitor ping latency between hosts, particularly any periodic large deviations.

It is common to have separate NICs for incoming traffic (e.g. HTTP requests) and for backend traffic (e.g. database). In some cases and particularly on some operating systems, this setup may perform worse than a single NIC (as long as it doesn't saturate) probably due to interrupts and L2/L3 cache utilization side-effects.

TCP/IP

TCP/IP is used for most network communications such as HTTP, so understanding and optimizing the operating system TCP/IP stack can have dramatic upstream effects on your application.

TCP/IP is normally used in a fully duplexed mode meaning that communication can occur asynchronously in both directions. In such a mode, a distinction between "client" and "server" is arbitrary and sometimes can confuse investigations (for example, if a web browser is uploading a large HTTP POST body, it is first the "server" and then becomes the "client" when accepting the response). You should always think of a set of two sender and receiver channels for each TCP connection.

TCP/IP is a connection oriented protocol, unlike UDP, and so it requires handshakes (sets of packets) to start and close connections. The establishing handshake starts with a SYN packet from sender IP address A on an ephemeral local port X to receiver IP address B on a port Y (every TCP connection is uniquely identified by this 4-tuple). If the connection is accepted by B, then B sends back an acknowledgment (ACK) packet as well as its own SYN packet to establish the fully duplexed connection (SYN/ACK). Finally, A sends a final ACK packet to acknowledge the established connection. This handshake is commonly referred to as SYN, SYN/ACK, ACK.

A TCP/IPv4 packet has a 40 byte header (20 for TCP and 20 for IPv4).

Bandwidth Delay Product

The Bandwidth-Delay Product (BDP) is the maximum bandwidth times the round trip time:

A fundamental concept in any window-controlled transport protocol: the Bandwidth-Delay Product (BDP). Specifically, suppose that the bottleneck link of a path has a transmission capacity (‘bandwidth’) of C bps and the path between the sender and the receiver has a Round-Trip Time (RTT) of T sec. The connection will be able to saturate the path, achieving the maximum possible throughput C, if its effective window is C*T. This product is historically referred to as BDP. For the effective window to be C*T, however, the smaller of the two socket buffers should be equally large. If the size of that socket buffer is less than C*T, the connection will underutilize the path. If it is more than C*T, the connection will overload the path, and depending on the amount of network buffering, it will cause congestion, packet losses, window reductions, and possibly throughput drops.

Flow Control & Receive/Send Buffers

TCP congestion control (or flow control) is a part of the TCP specifications that governs how much data is sent before receiving acknowledgments for outstanding packets. Flow control tries to ensure that a sender does not send data faster than a receiver can handle. There are two main components to flow control:

  • Advertised receiver window size (rwnd): The receiver advertises a "window size" in each acknowledgment packet which tells the sender how much buffer room the receiver has for future packets. The maximum throughput based on the receiver window is rwnd/RTT. If the window size is 0, the sender should stop sending packets until it receives a TCP Window Update packet or an internal retry timer fires. If the window size is non-zero, but it is too small, then the sender may spend unnecessary time waiting for acknowledgments. The window sizes are directly affected by the rate at which the application can produce and consume packets (for example, if CPU is 100% then a program may be very slow at producing and consuming packets) as well as operating system TCP sending and receiving buffer size limits. The buffers are chunks of memory allocated and managed by the operating system to support TCP/IP flow control. It is generally advisable to increase these buffer size limits as much as operating system configuration, physical memory and the network architecture can support. In general, the maximum socket receive and send buffer sizes should be greater than the average bandwidth delay product.
  • Sender congestion window size (cwnd): A throttle that controls the maximum, concurrent, unacknowledged, outstanding sent bytes. The operating system chooses an initial congestion window size and then resizes it dynamically based on rwnd and other conditions. By default, the initial congestion window size is based on the maximum segment size and starts small as part of the slow start component of the specifications and then grows relatively quickly. This is one reason why using persistent connections is valuable (although idle connections may have their congestion windows reset after a period of inactivity which may be tuned on some operating systems). There are many congestion window resize algorithms (reno, cubic, hybla, etc.) that an operating system may use and some operating systems allow changing the algorithm.

Therefore, one dimension of socket throttling is the instananeous minimum value of rwnd and cwnd. An example symptom of congestion control limiting throughput is when a sender has queued X bytes to the network, the current receive window is greater than X, but less than X bytes are sent before waiting for ACKs from the receiver.

CLOSE_WAIT

If a socket is ESTABLISHED, and one side (let's call it side X) calls close, then X sends a FIN packet to the other side (let's call it side Y) and X enters the FIN_WAIT_1 state. At this point, X can no longer write bytes to Y; however, Y may still write bytes to X (each TCP socket has two pipes).

When Y receives the FIN, it sends an ACK back and Y enters the CLOSE_WAIT state. When X receives the ACK, it enters the FIN_WAIT_2 state. Y's CLOSE_WAIT state may be read as "Y is waiting for the application inside Y to call close on its write pipe to X." At this point, the socket could stay in this condition indefinitely with Y writing bytes to X. Although this is a valid TCP use case to have a half-opened socket, it is an uncommon use case (except for use cases such as Server-Sent Events), so sockets in CLOSE_WAIT state are more commonly simply waiting for Y to close its half of the socket. If the number of sockets in CLOSE_WAIT are high or increasing over time, this may be caused by a leak of the socket object in Y, lack of resources, missing or incorrect logic to close the socket, etc. If sockets in CLOSE_WAIT continuously increase, at some point the process may receive file descriptor exhaustion or other socket errors and the only resolutions are either to restart the process or induce a RST packet.

When Y closes its half of the socket by sending a FIN to X, then Y enters LAST_ACK. When X responds with an ACK on the FIN, then the Y socket is completely closed, and X enters the TIME_WAIT state for a certain period of time.

The above is a normal close; however, it is also possible that RST packets are used to close sockets.

TIME_WAIT

TCP sockets pass through various states such as LISTENING, ESTABLISHED, CLOSED, etc. One particularly misunderstood state is the TIME_WAIT state which can sometimes cause scalability issues. A full duplex close occurs when sender A sends a FIN packet to B to initiate an active close (A enters FIN_WAIT_1 state). When B receives the FIN, it enters CLOSE_WAIT state and responds with an ACK. When A receives the ACK, A enters FIN_WAIT_2 state. Strictly speaking, B does not have to immediately close its channel (if it wanted to continue sending packets to A); however, in most cases it will initiate its own close by sending a FIN packet to A (B now goes into LAST_ACK state). When A receives the FIN, it enters TIME_WAIT and sends an ACK to B. The reason for the TIME_WAIT state is that there is no way for A to know that B received the ACK. The TCP specification defines the maximum segment lifetime (MSL) to be 2 minutes (this is the maximum time a packet can wander the net and stay valid). The operating system should ideally wait 2 times MSL to ensure that a retransmitted packet for the FIN/ACK doesn't collide with a newly established socket on the same port (for instance, if the port had been immediately reused without a TIME_WAIT and if other conditions such as total amount transferred on the packet, sequence number wrap, and retransmissions occur).

This behavior may cause scalability issues:

Because of TIME-WAIT state, a client program should choose a new local port number (i.e., a different connection) for each successive transaction. However, the TCP port field of 16 bits (less the "well-known" port space) provides only 64512 available user ports. This limits the total rate of transactions between any pair of hosts to a maximum of 64512/240 = 268 per second.

Most operating systems do not use 4 minutes as the default TIME_WAIT duration because of the low probability of the wandering packet problem and other mitigating factors. Nevertheless, if you observe socket failures accompanied with large numbers of sockets in TIME_WAIT state, then you should reduce the TIME_WAIT duration further. On some operating systems, it is impossible to change the TIME_WAIT duration except by recompiling the kernel. Conversely, if you observe very strange behavior when new sockets are created that can't be otherwise explained, you should use 4 minutes as a test to ensure this is not a problem.

Finally, it's worth noting that some connections will not follow the FIN/ACK, FIN/ACK procedure, but may instead use FIN, FIN/ACK, ACK, or even just a RST packet (abortive close).

Nagle's Algorithm (RFC 896, TCP_NODELAY)

RFC 896:

There is a special problem associated with small packets. When TCP is used for the transmission of single-character messages originating at a keyboard, the typical result is that 41 byte packets (one byte of data, 40 bytes of header) are transmitted for each byte of useful data. This 4000% overhead is annoying but tolerable on lightly loaded networks. On heavily loaded networks, however, the congestion resulting from this overhead can result in lost datagrams and retransmissions, as well as excessive propagation time caused by congestion in switching nodes and gateways.

The solution is to inhibit the sending of new TCP segments when new outgoing data arrives from the user if any previously transmitted data on the connection remains unacknowledged.

In practice, enabling Nagle's algorithm (which is usually enabled by default) means that TCP will not send a new packet if another previous sent packet is still unacknowledged, unless it has "enough" coalesced data for a larger packet.

The native setsockopt option to disable Nagle's algorithm is TCP_NODELAY

This option can usually be set globally at an operating system level.

This option is also exposed in Java's StandardSocketOptions.TCP_NODELAY to allow for setting a particular Java socket option.

In WebSphere Application Server, TCP_NODELAY is explicitly enabled by default for all WAS TCP channel sockets. In the event of needing to enable Nagle's algorithm, use the TCP channel custom property tcpNoDelay=0.

Delayed Acknowledgments (RFC 1122)

TCP delayed acknowledgments was designed in the late 1980s in an environment of baud speed modems. Delaying acknowledgments was a tactic used when communication over wide area networks was really slow and the delaying would allow for piggy-backing acknowledgment packets to responses within a window of a few hundred milliseconds. In modern networks, these added delays may cause significant latencies in network communications.

Delayed acknowledgments is a completely separate function from Nagle's algorithm (TCP_NODELAY). Both act to delay packets in certain situations. This can be very subtle; for example, on AIX, the option for the former is tcp_nodelayack and the option for the latter is tcp_nodelay.

Delayed ACKs defines the default behavior to delay acknowledgments up to 500 milliseconds (the common default maximum is 40 or 200 milliseconds) from when a packet arrives (but no more than every second segment) to reduce the number of ACK-only packets and ACK chatter because the ACKs may piggy back on a response packet. It may be the case that disabling delayed ACKs, while increasing network chatter and utilization (if an ACK only packet is sent where it used to piggy back a data packet, then there will be an increase in total bytes sent because of the increase in the number of packets and therefore TCP header bytes), may improve throughput and responsiveness. However, there are also cases where delayed ACKs perform better. It is best to test the difference.

RFC 1122:

A host that is receiving a stream of TCP data segments can increase efficiency in both the Internet and the hosts by sending fewer than one ACK (acknowledgment) segment per data segment received; this is known as a "delayed ACK" [TCP:5].

A TCP SHOULD implement a delayed ACK, but an ACK should not be excessively delayed; in particular, the delay MUST be less than 0.5 seconds, and in a stream of full-sized segments there SHOULD be an ACK for at least every second segment.

A delayed ACK gives the application an opportunity to update the window and perhaps to send an immediate response. In particular, in the case of character-mode remote login, a delayed ACK can reduce the number of segments sent by the server by a factor of 3 (ACK, window update, and echo character all combined in one segment).

In addition, on some large multi-user hosts, a delayed ACK can substantially reduce protocol processing overhead by reducing the total number of packets to be processed [TCP:5]. However, excessive delays on ACK's can disturb the round-trip timing and packet "clocking" algorithms [TCP:7].

Delayed acknowledgments interacts poorly with Nagle's algorithm. For example, if A sent a packet to B, and B is waiting to send an acknowledgment to A until B has some data to send (Delayed Acknowledgments), and if A is waiting for the acknowledgment (Nagle's Algorithm), then a delay is introduced. To find if this may be the case:

In Wireshark, you can look for the "Time delta from previous packet" entry for the ACK packet to determine the amount of time elapsed waiting for the ACK... Although delayed acknowledgment may adversely affect some applications [...], it can improve performance for other network connections.

The pros of delayed acknowledgments are:

  1. Reduce network chatter
  2. Reduce potential network congestion
  3. Reduce network interrupt processing (CPU)

The cons of delayed acknowledgments are:

  1. Potentially reduce response times and throughput

In general, if two hosts are communicating on a LAN and there is sufficient additional network capacity and there is sufficient additional CPU interrupt processing capacity, then disabling delayed acknowledgments will tend to improve performance and throughput. However, this option is normally set at an operating system level, so if there are any sockets on the box that may go out to a WAN, then their performance and throughput may potentially be affected negatively. Even on a WAN, for 95% of modern internet connections, disabling delayed acknowledgments may prove beneficial. The most important thing to do is to test the change with real world traffic, and also include tests emulating users with very slow internet connections and very far distances to the customer data center (e.g. second long ping times) to understand any impact. The other potential impact of disabling delayed acknowledgments is that there will be more packets which just have the acknowledgment bit set but still have the TCP/IP header (40 or more bytes). This may cause higher network utilization and network CPU interrupts (and thus CPU usage). These two factors should be monitored before and after the change.

John Nagle -- the person who created Nagle's algorithm -- generally recommends disabling delayed ACKs by default.

Selective Acknowledgments (SACK, RFC 2018)

RFC 2018:

"With the limited information available from cumulative acknowledgments, a TCP sender can only learn about a single lost packet per round trip time... [With a] Selective Acknowledgment (SACK) mechanism... the receiving TCP sends back SACK packets to the sender informing the sender of data that has been received. The sender can then retransmit only the missing data segments."

Listen Back Log

The listen back log is a limited size queue for each socket that holds pending sockets that have completed the SYN packet but that the process has not yet "accepted" (therefore they are not yet established). This back log is used as an overflow for sudden spikes of connections. If the listen back log fills up any new connection attempts (SYN packets) will be rejected by the operating system (i.e. they'll fail). As with all queues, you should size them just big enough to handle a temporary but sudden spike, but not too large so that too much operating system resources are used which means that new connection attempts will fail fast when there is a backend problem. There is no science to this, but 511 is a common value.

Keep-alive

RFC 1122 defines a "keep-alive" mechanism to periodically send packets for idle connections to make sure they're still alive:

A "keep-alive" mechanism periodically probes the other end of a connection when the connection is otherwise idle, even when there is no data to be sent. The TCP specification does not include a keep-alive mechanism because it could:

  1. cause perfectly good connections to break during transient Internet failures;
  2. consume unnecessary bandwidth ("if no one is using the connection, who cares if it is still good?"); and
  3. cost money for an Internet path that charges for packets.

Some TCP implementations, however, have included a keep-alive mechanism. To confirm that an idle connection is still active, these implementations send a probe segment designed to elicit a response from the peer TCP.

By default, keep-alive (SO_KEEPALIVE in POSIX) is disabled:

If keep-alives are included, the application MUST be able to turn them on or off for each TCP connection, and they MUST default to off.

Java defaults Keep-alive to off:

The initial value of this socket option is FALSE.

Major products such as WAS traditional, WebSphere Liberty, the DB2 JDBC driver, etc. enable keep-alive on TCP sockets by default.

Monitor TCP Retransmits

Monitor the number of TCP retransmits in your operating system and be aware of the timeout values. The reason: they may explain random response time fluctuations or maximums of up to a few seconds.

The concept of TCP retransmission is one of the fundamental reasons why TCP is reliable. After a packet is sent, if it's not ACKed within the retransmission timeout, then the sender assumes there was a problem (e.g. packet loss, OS saturation, etc.) and retransmits the packet. From TCP RFC 793:

When the TCP transmits a segment containing data, it puts a copy on a retransmission queue and starts a timer; when the acknowledgment for that data is received, the segment is deleted from the queue. If the acknowledgment is not received before the timer runs out, the segment is retransmitted.

For applications such as Java/WAS, retransmissions occur transparently in the operating system. Retransmission is not considered an error condition and it is not reported up through libc. So a Java application may do a socket write, and every once in a while, a packet is lost, and there is a delay during the retransmission. Unless you've gathered network trace, this is difficult to prove. A retransmission may also cause the socket to switch into "slow start" mode which may affect subsequent packet performance/throughput.

Correlating TCP retransmit increases with the times of response time increases is much easier to do than end-to-end network trace with TCP port correlation (which often doesn't exist in low-overhead tracing).

Domain Name Servers (DNS)

Ensure that Domain Name Servers (DNS) are very responsive.

Consider setting high Time To Live (TTL) values for hosts that are unlikely to change.

If performance is very important or DNS response times have high variability, consider adding all major DNS lookups to each operating system's local DNS lookup file (e.g. /etc/hosts).

Troubleshooting Network Issues

One of the troubleshooting steps for slow response time issues is to sniff the network between all the network elements (e.g. HTTP server, application server, database, etc.). The most popular tool for sniffing and analyzing network data is Wireshark which is covered in the Major Tools chapter. Common errors are frequent retransmission requests (sometimes due to a bug in the switch or bad cabling).

The Importance of Gathering Network Trace on Both Sides

Here is an example where it turned out that a network packet was truly lost in transmission (root cause not determined, but probably in the operating system or some security software). This happened between IHS and WAS and it caused IHS to mark the WAS server down. The symptom in the logs was a connection reset error. There are two points that are interesting to cover: 1) If the customer had only gathered network trace from the IHS side, they might have concluded the wrong thing, and 2) It may be interesting to look at the MAC address of a RST packet:

First, some background: the IHS server is 10.20.30.100 and the WAS server is 10.20.36.100.

Next, if we look at just the IHS packet capture and narrow down to the suspect stream:

663 ... 10.20.30.100 10.20.36.100 TCP 76 38898 > 9086 [SYN] Seq=0 Win=5840...
664 ... 10.20.36.100 10.20.30.100 TCP 62 9086 > 38898 [SYN, ACK] Seq=0 Ack=1 Win=65535...
665 ... 10.20.30.100 10.20.36.100 TCP 56 38898 > 9086 [ACK] Seq=1 Ack=1 Win=5840 Len=0
666 ... 10.20.30.100 10.20.36.100 TCP 534 [TCP Previous segment lost] 38898 > 9086 [PSH, ACK] Seq=1381...
667 ... 10.20.36.100 10.20.30.100 TCP 62 [TCP Dup ACK 664#1] 9086 > 38898 [ACK] Seq=1...
678 ... 10.20.36.100 10.20.30.100 TCP 56 9086 > 38898 [RST] Seq=1 Win=123 Len=0
679 ... 10.20.30.100 10.20.36.100 TCP 56 38898 > 9086 [RST] Seq=1 Win=123 Len=0

So 663-665 are just a normal handshake, but then 666 where we'd expect IHS to send the GET/POST to WAS shows TCP Previous Segment lost. A few packets later, we see what appears to be a RST packet coming from WAS. By doing Follow TCP Stream, this is what it looks like from the WAS application point of view:

# Wireshark: tcp.stream eq 52
[1380 bytes missing in capture file]i-Origin-Hop: 1
Via: 1.1 ...
X-Forwarded-For: ...
True-Client-IP: ...
Host: ...
Pragma: no-cache
Cache-Control: no-cache, max-age=0
$WSCS: AES256-SHA
$WSIS: true
$WSSC: https
$WSPR: HTTP/1.1
$WSRA: ...
$WSRH: ...
$WSSN: ...
$WSSP: 443
$WSSI: ...
Surrogate-Capability: WS-ESI="ESI/1.0+"
_WS_HAPRT_WLMVERSION: -1

So of course the request failed -- the front half is cut off (due to the "previous segment lost")!

From this packet trace alone, one would highly suspect that it's the WAS side or the network path between IHS and WAS because it's the one sending the RST. But, let's look at the trace from the WAS server:

146 ... 10.20.30.100 10.20.36.100 TCP 74 38898 > 9086 [SYN] Seq=0...
147 ... 10.20.36.100 10.20.30.100 TCP 58 9086 > 38898 [SYN, ACK] Seq=0...
148 ... 10.20.30.100 10.20.36.100 TCP 60 38898 > 9086 [ACK] Seq=1...
149 ... 10.20.30.100 10.20.36.100 TCP 532 [TCP Previous segment lost] 38898 > 90086 [PSH, ACK] Seq=1381...
150 ... 10.20.36.100 10.20.30.100 TCP 54 [TCP Dup ACK 147#1] 9086 > 38898 [ACK] Seq=1...
151 ... 10.20.30.100 10.20.36.100 TCP 60 38898 > 9086 [RST] Seq=1...

This is similar -- the first segment with the GET/POST line is lost -- but we only see one RST, coming from IHS. It doesn't seem like the WAS box sent out the RST packet (#678 above). This points back to the IHS side and highlights the fact that getting simultaneous packet captures from both sides is critical. (Note: The RST coming from IHS is just IHS closing its half of the stream in packet 679)

One final point that we found looking back on the IHS side was that if we look at frame 664, for example, in the handshake, we can see a good MAC address of d8:3c:85:41:4e:95. However, in the suspect RST frame #678, the MAC address is blank. This is what helped hone the investigation into the IHS OS and network software.

Antivirus / Security Products

We have seen increasing cases of antivirus leading to significant performance problems. Companies are more likely to run quite intrusive antivirus even on critical, production machines. The antivirus settings are usually corporate-wide and may be inappropriate or insufficiently tuned for particular applications or workloads. In some cases, even when an antivirus administrator states that antivirus has been "disabled," there may still be kernel level modules that are still operational. In some cases, slowdowns are truly difficult to understand; for example, in one case a slowdown occurred because of a network issue communicating with the antivirus hub, but this occurred at a kernel-level driver in fully native code, so it was very difficult even to hypothesize that it was antivirus. You can use operating system level tools and sampling profilers to check for such cases, but they may not always be obvious. Keep a watch out for signs of antivirus and consider running a benchmark comparison with and without antivirus (completely disabled, perhaps even uninstalled).

Another class of products that are somewhat orthogonal are security products which provide integrity, security, and data scrubbing capabilities for sensitive data. For example, they will hook into the kernel so that any time a file is copied onto a USB key, a prompt will ask whether the information is confidential or not (and if so, perform encryption). This highlights the point that it is important to gather data on which kernel modules are active (e.g. using CPU during the time of the problem).

Clocks

To ensure that all clocks are synchronized on all nodes use something like the Network Time Protocol (NTP). This helps with correlating diagnostics and it's required for certain functions in products.

Consider setting one standardized time zone for all nodes, regardless of their physical location. Some consider it easier to standardize on the UTC/GMT/Zulu time zone.

POSIX

The Portable Operating System Interface for Unix (POSIX) is the public standard for Unix-like operating systems, including things like APIs, commands, utilities, threading libraries, etc. It is implemented in part or in full by: Linux, AIX, Solaris, z/OS USS, HP/UX, etc.

Process limits (Ulimits)

On POSIX-based operating systems such as Linux, AIX, etc., process limits (a.k.a. ulimits, or user limits) are operating system restrictions on what a process may do with certain resources. These are designed to protect the kernel, protect memory, protect users from consuming an entire box, and reduce the risks of Denial-of-Service (DoS) attacks. For example, a file descriptor ulimit restricts the maximum number of open file descriptors in the process at any one time. Since a network socket is represented by a file descriptor, this also limits the maximum number of open network sockets at any one time.

Process limits come in two flavors: soft and hard. A process limit starts at the soft limit but it may be increased up to the hard limit at runtime.

Choosing ulimits

The default process limits are somewhat arbitrary and often historical artifacts. Similarly, deciding what ulimits to use is also somewhat arbitrary. If a box is mostly dedicated to running a particular process, some people use the philosophy of setting everything to unlimited for those processes. As always, testing (and in particular, stress testing) the ulimit values is advised.

Setting ulimits

Ulimits are most commonly modified through global, operating system specific configuration files, using the ulimit command in the shell that launches a process (or its parent process), or at runtime by the process itself (e.g. through setrlimit).

In general, we recommend using operating system specific configuration files.

Processes will need to be restarted after ulimit settings are changed.

Maximum number of open file descriptors

The ulimit for the maximum number of open file descriptors (a.k.a. "maximum number of open files", "max number of open files", "open files", ulimit -n, nofile, or RLIMIT_NOFILE) limits both the number of open files and open network sockets.

For example, WebSphere Application Server traditional defaults to a maximum of up to 20,000 incoming TCP connections and Liberty defaults up to 128,000. In addition, Java will have various open files to JARs, and applications will likely drive other sockets to backend connections such as databases, web services, and so on; therefore, if such load may be reasonably reached, a ulimit -n value such as 1048576 (or more), or unlimited may be considered.

Maximum number of processes

The ulimit for the maximum number of processes (a.k.a. "max user processes", "max number of processes", ulimit -u, nproc, or RLIMIT_NPROC) limits the maximum number of threads spawned by a particular user. This is slightly different than other ulimits which apply on a per-process basis. For example, on Linux:

RLIMIT_NPROC This is a limit on the number of extant process (or, more precisely on Linux, threads) for the real user ID of the calling process.

On some versions of Linux, this is configured globally using a separate mechanism of /etc/security/limits.d/90-nproc.conf.

It is common for a Java process to use hunreds or thousands of threads. In addition, given that this limit is accounted for at the user-level and multiple processes may run under the same user, if such load may be reasonably reached, a ulimit -u value such as 131072 (or more), or unlimited may be considered.

Maximum data segment size

The ulimit for the maximum data segment size (a.k.a. "max data size", "maximum data size", "data seg size", ulimit -d, or RLIMIT_DATA) limits the total native memory requested by malloc, and, in some operating systems and versions, mmap (for example, since Linux 4.7). In general, this ulimit should be unlimited.

How do you confirm that ulimits are set correctly?

  • Recent versions of Linux: cat /proc/$PID/limits.
  • If using IBM Java/Semeru/OpenJ9, the javacore produced with kill -3 $PID includes a process limits section. For example:
    1CIUSERLIMITS  User Limits (in bytes except for NOFILE and NPROC)
    NULL           ------------------------------------------------------------------------
    NULL           type                            soft limit           hard limit
    2CIUSERLIMIT   RLIMIT_AS                        unlimited            unlimited
    2CIUSERLIMIT   RLIMIT_CORE                      unlimited            unlimited
    2CIUSERLIMIT   RLIMIT_CPU                       unlimited            unlimited
    2CIUSERLIMIT   RLIMIT_DATA                      unlimited            unlimited
    2CIUSERLIMIT   RLIMIT_FSIZE                     unlimited            unlimited
    2CIUSERLIMIT   RLIMIT_LOCKS                     unlimited            unlimited
    2CIUSERLIMIT   RLIMIT_MEMLOCK                   unlimited            unlimited
    2CIUSERLIMIT   RLIMIT_NOFILE                      1048576              1048576
    2CIUSERLIMIT   RLIMIT_NPROC                     unlimited            unlimited
    2CIUSERLIMIT   RLIMIT_RSS                       unlimited            unlimited
    2CIUSERLIMIT   RLIMIT_STACK                     unlimited            unlimited
    2CIUSERLIMIT   RLIMIT_MSGQUEUE                     819200               819200
    2CIUSERLIMIT   RLIMIT_NICE                              0                    0
    2CIUSERLIMIT   RLIMIT_RTPRIO                            0                    0
    2CIUSERLIMIT   RLIMIT_SIGPENDING                    47812                47812

Process core dumps

A process core dump is a file that represents metadata about a process and its virtual memory at a particular point in time.

Core dump security implications

In general, a process core dump contains most of the virtual memory areas of a process. If a sensitive operation was occurring at the time when the core dump was produced -- for example, a user completing a bank transaction -- then it is possible for someone with access to the core dump to discover sensitive information about that operation -- for example, the name of the user and the details of the bank transaction. For this reason, in general and particularly for production environments, core dumps should be treated sensitively. This is normally done either using filesystem permissions to restrict who can read the core dump or by disabling core dumps (e.g. a core ulimit of 0).

In general, we do not recommend disabling core dumps because then it may be very difficult or impossible to understand the causes of crashes, OutOfMemoryErrors, and other production problems. Instead, we recommend carefully planning out how core dumps are produced, stored, and shared. For example, a core dump may be encrypted and transferred off of a box to a controlled location, and even if the sensitive information within a core dump restricts the core dump from being shared to IBM Support, as long as the core dump exists, it's possible to investigate it remotely using screen sharing or iterative debug commands.

Core dump disk implications

The size of a core dump is approximately the virtual size of the process. A common default is to create the core dump in the current working directory of the process. Therefore, if a process has requested a lot of memory (e.g. Java with a large maximum heap size), then the core dump will be very large. If a process automatically restarts after a crash, then it's possible it will create continuous core dumps if the crash continues to be exercised, and the core dumps can fill up disk space and cause application issues if an application needs disk space in the same filesystem (e.g. transaction log).

Some operating systems provide a way to limit this impact either by specifying the directory where core dumps go (which can then be mounted on a filesystem dedicated for diagnostics whose exhaustion does not impact applications), truncating core dumps to a maximum size, and/or limiting the total disk space used by core dumps by deleting older core dumps (e.g. systemd-coredump on Linux).

You can determine the virtual address space size in various ways (where VSZ is normally in KB):

  • Linux: ps -o pid,vsz -p PID
  • AIX: ps -o pid,vsz -L PID
  • Solaris: ps -o pid,vsz -p PID
  • HP-UX: UNIX95="" ps -o pid,vsz -p PID

Descructive core dumps

Process core dumps are most often associated with destructive events such as process crashes and they're used to find the cause of a crash. In such an event, after the core dump is produced, the process is killed by the operating system. The core dump may then be loaded into a debugger by a developer to inspect the cause of the crash.

Non-destructive core dumps

Core dumps may be produced non-destructively as a diagnostic aid to investigate things such as memory usage (e.g. OutOfMemoryError) or something happening on a thread. In this case, a diagnostic tool attaches to the process, pauses the process, writes out the core dump, and then detaches and the process continues running.

IBM Java/Semeru/OpenJ9 commonly use non-destructive core dumps as diagnostics (confusingly, these artifacts are called "System dumps" even though they are process dumps). A non-destructive core dump is requested on the first OutOfMemoryError, and non-destructive core dumps may be requested on various Java events, method entry/exit, manually dumping memory for system sizing, and so on.

Performance implications of non-destructive core dumps

Unlike diagnostics such as thread dumps which are generally very lightweight in the range of 10s or 100s of milliseconds, non-destructive core dumps may have a significant performance impact in the range of dozens of seconds during which the process is completely frozen. In general, this duration is proportional to the virtual size of the process, the speed of CPUs and RAM (to read all of the virtual memory), the speed of the disk where the core dump is written, and the free RAM at the time of the core dump (since some operating systems will write the core dump to RAM and then asynchronously flush that to disk).

These performance implications generally don't matter for destructive core dumps because the process does not live after the core dump is produced, and the core dump is often needed to find the cause of the crash.

Core dumps and ulimits

After reviewing the security and disk implications of core dumps, the way to ensure core dumps are produced and not truncated starts by setting core (a.k.a. "core file size", ulimit -c, core, or RLIMIT_CORE) and file size (a.k.a. "maximum filesize", ulimit -f, fsize, or RLIMIT_FSIZE) ulimits to unlimited.

However, such configuration may not be sufficient to configure core dumps. Further operating-system specific changes may be needed such as:

Ulimit Summary

Based on the above sections on ulimits and process core dumps, a summary of a common starting point for customized ulimits may be something like the following (and generally best applied through global, operating system specific configuration instead of such explicit ulimit commands):

ulimit -c unlimited
ulimit -f unlimited
ulimit -u 131072
ulimit -n 1048576
ulimit -d unlimited

If using the unlimited philosophy:

ulimit -c unlimited
ulimit -f unlimited
ulimit -u unlimited
ulimit -n unlimited
ulimit -d unlimited

Further configuration may need to be applied such as for process core dumps.

SSH Keys

As environments continue to grow, automation becomes more important. On POSIX operating systems, SSH keys may be used to automate running commands, gathering logs, etc. A 30 minute investment to configure SSH keys will save countless hours and mistakes.

Step #1: Generate an "orchestrator" SSH key

  1. Choose one of the machines that will be the orchestrator (or a Linux, Mac, or Windows cygwin machine)
  2. Ensure the SSH key directory exists:
    $ cd ~/.ssh/
    If this directory does not exist:
    $ mkdir ~/.ssh && chmod 700 ~/.ssh && cd ~/.ssh/
  3. Generate an SSH key:
    $ ssh-keygen -t rsa -b 4096 -f ~/.ssh/orchestrator

Step #2: Distribute "orchestrator" SSH key to all machines

If using Linux:

  1. Run the following command for each machine:
    $ ssh-copy-id -i ~/.ssh/orchestrator user@host

For other POSIX operating systems

  1. Log in to each machine as a user that has access to all logs (e.g. root):

    $ ssh user@host
  2. Ensure the SSH key directory exists:

    $ cd ~/.ssh/

    If this directory does not exist:

    $ mkdir ~/.ssh && chmod 700 ~/.ssh && cd ~/.ssh/
  3. If the file ~/.ssh/authorized_keys does not exist:

    $ touch ~/.ssh/authorized_keys && chmod 700 ~/.ssh/authorized_keys
  4. Append the public key from ~/.ssh/orchestrator.pub above to the authorized_keys file:

    $ cat >> ~/.ssh/authorized_keys
    Paste your clipboard and press ENTER
    Ctrl+D to save

Step #3: Now you are ready to automate things

Go back to the orchestrator machine and test the key:

  1. Log into orchestrator machine and try to run a simple command on another machine:
    $ ssh -i ~/.ssh/orchestrator root@machine2 "hostname"
  2. If your SSH key has a password, then you'll want to use ssh-agent so that it's cached for some time:
    $ ssh-add ~/.ssh/orchestrator
  3. If this gives an error, try starting ssh-agent:
    $ ssh-agent
  4. Now try the command again and it should give you a result without password:
    $ ssh -i ~/.ssh/orchestrator root@machine2 "hostname"

Now we can create scripts on the orchestrator machine to stop servers, clear logs, start servers, start mustgathers, gather logs, etc.

Example Scripts

In all the example scripts below, we basically iterate over a list of hosts and execute commands on all of those hosts. Remember that if the orchestrator machine is also one of these hosts, that it should be included in the list (it will be connecting to "itself"). You will need to modify these scripts to match what you need.

Example Script to Stop Servers

#!/bin/sh
USER=root
for i in ihs1hostname ihs2hostname; do
ssh -i ~/.ssh/orchestrator $USER@$i "/opt/IBM/HTTPServer/bin/apachectl -k stop"
ssh -i ~/.ssh/orchestrator $USER@$i "kill -INT `pgrep tcpdump`"
done
for i in wl1hostname wl2hostname; do
ssh -i ~/.ssh/orchestrator $USER@$i "/opt/liberty/bin/server stop ProdSrv01"
ssh -i ~/.ssh/orchestrator $USER@$i "kill -INT `pgrep tcpdump`"
done

Example Script to Clear Logs

#!/bin/sh
USER=root
for i in ihs1hostname ihs2hostname; do
ssh -i ~/.ssh/orchestrator $USER@$i "rm -rf /opt/IBM/HTTPServer/logs/*"
ssh -i ~/.ssh/orchestrator $USER@$i "rm -rf /opt/IBM/HTTPServer/Plugin/webserver1/logs/*"
ssh -i ~/.ssh/orchestrator $USER@$i "nohup tcpdump -nn -v -i any -C 100 -W 10 -Z root -w /tmp/capture`date +"%Y%m%d_%H%M"`.pcap &"
done
for i in wl1hostname wl2hostname; do
ssh -i ~/.ssh/orchestrator $USER@$i "rm -rf /opt/liberty/usr/servers/*/logs/*"
ssh -i ~/.ssh/orchestrator $USER@$i "nohup tcpdump -nn -v -i any -C 100 -W 10 -Z root -w /tmp/capture`date +"%Y%m%d_%H%M"`.pcap &"
done

Example Script to Execute perfmustgather

#!/bin/sh
USER=root
for i in wl1hostname wl2hostname; do
ssh -i ~/.ssh/orchestrator $USER@$i "nohup /opt/perfMustGather.sh --outputDir /tmp/ --iters 6 `cat /opt/liberty/usr/servers/.pid/*.pid` &"
done

Example Script to Gather Logs

#!/bin/sh
USER=root
LOGS=logs`date +"%Y%m%d_%H%M"`
mkdir $LOGS
for i in ihs1hostname ihs2hostname; do
mkdir $LOGS/ihs/$i/
scp -r -i ~/.ssh/orchestrator $USER@$i:/opt/IBM/HTTPServer/logs/* $LOGS/ihs/$i/
scp -r -i ~/.ssh/orchestrator $USER@$i:/opt/IBM/HTTPServer/conf/httpd.conf $LOGS/ihs/$i/
scp -r -i ~/.ssh/orchestrator $USER@$i:/opt/IBM/HTTPServer/plugings/config/*/plugin-cfg.xml $LOGS/ihs/$i/
scp -r -i ~/.ssh/orchestrator $USER@$i:/opt/IBM/HTTPServer/Plugin/webserver1/logs/* $LOGS/ihs/$i/
scp -r -i ~/.ssh/orchestrator $USER@$i:/tmp/capture*.pcap* $LOGS/ihs/$i/
done
for i in wl1hostname wl2hostname; do
mkdir $LOGS/liberty/$i/
scp -r -i ~/.ssh/orchestrator $USER@$i:/opt/liberty/usr/servers/*/logs/ $LOGS/liberty/$i/
scp -r -i ~/.ssh/orchestrator $USER@$i:/opt/liberty/usr/servers/*/server.xml $LOGS/liberty/$i/
scp -r -i ~/.ssh/orchestrator $USER@$i:/tmp/capture*.pcap* $LOGS/liberty/$i/
scp -r -i ~/.ssh/orchestrator $USER@$i:/tmp/mustgather_RESULTS.tar.gz $LOGS/liberty/$i/
done
tar czvf $LOGS.tar.gz $LOGS