MustGather: Hangs

This documents documents common problems leading to hang-like conditions as instructions for running the executable collector on unix platforms. The GatherHangDoc collector should only be run on a system actively experiencing high CPU.

  1. Review known issues

  2. Run the collector

For windows, see The Windows Mustgather

Known issues to check for first

Review the issues below before running the mustgather or contacting support. New/contemporary issues are addded to the top.

Hangs at startup on AMD EPYC processors

See https://www.ibm.com/support/pages/node/7176444.

IHS stops accepting connections with MaxRequestsPerChild non-zero

Prior to 9.0.5.11, IHS 9.0 on Linux or z/OS with a non-zero MaxRequestsPerChild can stop accepting new connections after some time. The normal set of processes will not be visible with ps -ef, only the main parent process 1 or 2 more utility processes. See APAR PH41945 here: https://www.ibm.com/support/pages/node/6539882.

Hangs and delays on z/OS

If IHS takes an unreasonable amount of time to read requests or send responses based on other observations, ensure neither IHS nor WAS nor the client are running in a WLM service class that uses DISCRETIONARY as a goal.

All Platforms; WebSphere Application Server MustGathers

In many cases, the ServerDoc tool will simply show that the WebSphere Plugin is waiting for a response from a backend Application Server that is not responding in a timely fashion. For this reason, we encourage customers to proactively gather the WebSphere Application Server MustGather for Hangs and submit it along with the IBM HTTP Server MustGather.

For WebSphere Application Server MustGathers for your platform, see here

Startup hangs with mod_proxy_balancer

mod_proxy_balancer requires 2 bytes of random data during start/restart which will result in a blocking read of /dev/random. On some virtualized systems, or systems where some other process is exhausting /dev/random, startup may hang until enough entropy can be gathered by the system.

Solaris

For web server hangs with IHS 2.0 on Solaris, please see this document first.

AIX 5.3

For web server hangs with any release of IHS on AIX 5.3:

  • AIX APAR IY58143 is required. If that APAR is not currently installed, you must install that APAR and reproduce the problem (if possible) before sending hang documentation to IBM HTTP Server support. You can check for the installation of this APAR in the following manner:

$ /usr/sbin/instfix -vik IY58143
IY58143 Abstract: Required fixes for AIX 5.3

    Fileset X11.Dt.lib:5.3.0.1 is applied on the system.
    Fileset X11.Dt.rte:5.3.0.1 is applied on the system.
    Fileset X11.base.rte:5.3.0.1 is applied on the system.
    Fileset X11.fnt.ucs.ttf is not applied on the system.
    Fileset X11.fnt.ucs.ttf_extb:5.3.0.1 is applied on the system.
    *...*
    Fileset devices.vdevice.hvterm1.rte:5.3.0.1 is applied on the system.
    Fileset devices.vtdev.scsi.rte:5.3.0.1 is applied on the system.
    Fileset sysmgt.websm.apps:5.3.0.1 is applied on the system.
    Fileset sysmgt.websm.framework:5.3.0.1 is applied on the system.
    Fileset sysmgt.websm.rte:5.3.0.1 is applied on the system.
    Fileset sysmgt.websm.webaccess:5.3.0.1 is applied on the system.
    All filesets for IY58143 were found.

What we expect to learn from this information

Common web server hang conditions can be categorized as follows:

  • IHS is not responding because all IHS threads are waiting on an external application to respond

  • IHS is not responding because of a problem preventing it from processing new client connections

As discussed in the following sections, the root cause of the problem may not reside in IHS, so analysis of the IHS hang documentation may indicate that a different type of information is necessary.

IHS is waiting on an external application

A primary use of IHS is as a front-end to the WebSphere Application Server. It is possible for applications running in WebSphere to have delayed reponse, or no response at all, so that all IHS threads are waiting for a application server response and no free IHS threads are available to handle new client connections.

Some authentication mechanisms for IHS, such as LDAP authentication capability provided with IHS or by a third party vendor, must contact a server over the network as part of IHS request processing. If that communication stalls, it is possible that after some time all IHS threads are waiting on an authentication response and no free IHS threads are available to handle new client connections.

In any situation where all IHS threads are waiting on an external application, the IHS hang documentation will show which component is waiting but it cannot determine the root cause for why the application is not responding.

Note: If the IHS hang documentation shows that IHS is waiting for a WebSphere response, related documentation for WebSphere will need to be gathered. Instructions for this WebSphere documentation can be found here It is possible to collect this documentation at the same time when the IHS hang documentation is collected, so that the required WebSphere information is available to IBM support if it is necessary.

Vendors of third-party components which run inside IHS may provide similar information for gathering documentation on problems that can cause the component to hang or stall; contact the vendor for more information.

IHS has a problem with processing new client connections

For this type of problem, IBM support anticipates being able to determine the failing component, as well as whether or not this is a known problem. Occasionally there are operating system issues which prevent IHS from finding out about new client connections. If analysis of the IHS hang documentation shows such a problem, network traces may be necessary and operating system support may suggest further diagnostic information.

making sure required support programs are available

Please refer to these instructions for verifying that required support programs are installed.

Running the tool

You will need to download the collector

Warning

This tool uses native tools such as strace and truss to obtain system call traces, which include the contents of buffers used to read and write data from the network.

Note: This executable mustgather is not used on Windows nor on z/OS.

  • On Windows, refer to win32_hang_doc.html

  • On z/OS, system dumps should be collected that include the httpd address spaces. Previously, the executable mustgather was provided on z/OS but it depends on the dbx debugger which conflicts with the very common SAFRunAs directive.

Run the tool as root to avoid any permissions problems with obtaining backtraces or reading files, such as log files and configuration files. (More information about the requirement to run this tool as root is available here.)

ServerDoc is passed in four parameters for gathering hang documentation:

  • GatherHangDoc

  • the name of the IHS installation directory (e.g., /usr/HTTPServer)

  • the web server parent process id, or "auto" if the parent process has exited and left stranded child processes, or "auto" if CGI processes are stalling

  • the address of a non-SSL port handled by the web server (e.g., 127.0.0.1:80), or "-" if there is no non-SSL port

# java -jar ServerDoc.jar GatherHangDoc /path/to/IHS 1398 127.0.0.1:80

The tool creates a new directory which contains a timestamp in the name, and the hang documentation will be saved in that directory.

determining the value of the non-SSL address parameter

If the IHS installation only supports SSL, then use - (hyphen) for this parameter. Otherwise, specify an IP address and port which can be used to reach the server from the local machine without using SSL.

  • If the server has a no non-SSL listening ports, use -

  • If the server has a typical Listen 80 or Listen 0.0.0.0:80, use 127.0.0.1:80

  • If the server listens on a particular interface and port like Listen 192.168.1.15:81, use that particular interface and port verbatim like Listen 192.168.1.15:81:

a sample run

For this example, IHS is installed in /scratch/IHS, the parent process id is stored in file /scratch/IHS/logs/httpd.pid, the non-SSL port can be reached from the web server machine on address 127.0.0.1:8080, and ihsdiag was unpacked into directory /root/ihsdiag-1.3.0.

# cd /tmp
# java -jar /root/ihsdiag-1.3.0/ServerDoc.jar GatherHangDoc \
/scratch/IHS `cat /scratch/IHS/logs/httpd.pid` 127.0.0.1:8080
Gathering doc on 4 web server processes...
5985  5986  5988  5984

Seconds remaining before gathering information again:
60...54...48...42...36...30...24...18...12...6...

Gathering doc on 4 web server processes...
5985  5986  5988  5984

Seconds remaining before gathering information again:
30...27...24...21...18...15...12...9...6...3...

Gathering doc on 4 web server processes...
5985  5986  5988  5984

Reports, log files, and configuration files have been saved to
directory
  HangDoc.200408310607
If you have additional log files or configuration files, copy them
there before packing up the directory.
Web server log and conf files other than the default will have to be
copied manually.  WebSphere plug-in conf and log files will have to be copied manually.

Hint for packing up the directory:
  <b>tar -cf HangDoc.200408310607.tar HangDoc.200408310607</b>
  <b>gzip HangDoc.200408310607.tar</b>
# ls -l HangDoc.200408310607/
total 772
-rw-rw-r--    1 trawick  trawick         0 Aug 31 06:07 access_log
-rw-rw-r--    1 trawick  trawick      5358 Aug 31 06:07 apachectl
-rw-rw-r--    1 trawick  trawick       118 Aug 31 06:07 error_log
-rw-rw-r--    1 trawick  trawick    462978 Aug 31 06:07 httpd
-rw-rw-r--    1 trawick  trawick     28790 Aug 31 06:07 httpd.conf
-rw-rw-r--    1 trawick  trawick    255056 Aug 31 06:08 log
-rw-rw-r--    1 trawick  trawick        56 Aug 31 06:07 redhat-release
-rw-rw-r--    1 trawick  trawick      5453 Aug 31 06:08 report

what if the HangDoc tool is taking a very long time?

If you need to interrupt the tool so the web server can be restarted (to try to resolve the hang condition), the best place to interrupt it is when it is counting down the number of seconds until it checks the web server state again. The last lines of output on the display will look like this:

Seconds remaining before gathering information again:
60...54...48...42...36...30.

If the tool is interrupted at a different time, incomplete information will be gathered on the state of the web server. This will introduce some risk into our analysis of the problem, but as long as a meaningful percentage of the web server processes have been examined (>30%), it is usually possible to find a probable cause of the hang.

  • the display is not being updated after several minutes

If the IHS child processes have a very large number of threads (e.g., ThreadsPerChild is higher than 200), the expected cause is that the system debugger has a performance degradation analyzing such processes.

It is also possible that the HangDoc tool has a problem interacting with the system debugger, and it will never finish.

To find out more information about the cause of the delay, take these steps:

  • Make sure you've waited at least four minutes from the time that the display was last updated.

  • From another terminal window, save the output of ps -ef to a file. This must be done before interrupting the HangDoc tool.

  • Interrupt the HangDoc tool and find the most recent HangDoc.xxxx directory, which is what it was using when it stalled.

  • Cut and paste the HangDoc display to a file.

  • Send in the ps listing, the HangDoc display, and the HangDoc.xxxx directory to IHS support for analysis.

copying other web server and plug-in files

The next step is to copy any other web server or plug-in configuration files and logs into the new HangDoc directory. Here is a list of files to copy if they are being used:

  • any IHS configuration file other than httpd.conf in the IHS install directory

  • any additional web server error or access log files, such as log files specific to each virtual host or log files created by rotatelogs

  • the WebSphere plug-in configuration file

  • the WebSphere plug-in log file

  • any platform-specific WebSphere Appplication Server MustGathers that were proactively collected.

saving the documentation directory

The last step is to pack up and compress the documentation directory using zip, tar followed by gzip, or pax followed by compress. The easiest way is to cut and paste the messages displayed by ServerDoc previously which showed the commands to use. The suggested commands will vary by platform. On z/OS, for example, pax and compress will be suggested instead of tar and gzip.

Don't forget to collect the corresponding WebSphere Appplication Server MustGathers and include them in your submission.

  • a sample run

 tar -cf HangDoc.200408310607.tar HangDoc.200408310607
 gzip HangDoc.200408310607.tar

The resulting compressed file is the file to send to IBM support.

understanding the root requirement

When gathering information on web server hangs, the tool must attach to live web server processes to obtain information about the state of those processes.

If the web server is started as root, then at least one of these processes will be owned by root and other processes will be owned by the web server user id (e.g., nobody or www). Only root has the authority to attach to all of the processes, so the tool itself must be run as root. If the web server administrator does not have authority to log in or switch user to root, a simple script can be created to gather the hang documentation, and the system administrator can give the web server administrator sudo access to that script. sudo is a third-party tool available without cost for all appropriate platforms.

If the web server is not started as root, there are no such concerns, and the hang documentation tool may be run by the user id which starts the web server.

If the tool is run as non-root and it is unable to gather the required information, the problem will have to be recreated. It may not be possible to determine if this problem occurred until the documentation has been analyzed by IBM HTTP Server support.

interpreting the report output

  • Why is gsk_get_last_validation_error in the call stack in unexpected places?

Sometimes a series of IBM Global Security Kit (GSKit) internal functions show up as gsk_get_last_validation_error in the backtraces but they are not a cause of concern. Usually the lowest call in the stack is a properly displayed IBM Global Security Kit (GSKit) function (such as gsk_secure_soc_init) and higher in the stack will be the IHS or WebServer Plugin I/O callbacks (secure_read or plugin_ssl_read).

  • Why do so processes have a number of threads less than ThreadsPerChild?

This is normal for some utility processes created by IHS, such as the IHS parent process, sidd, or the CGI daemon will be decorated as such

If one of the threads is decorated as the "IHS main thread waiting for process to exit...", then the remaining threads are likely to be hung or stuck in a loop (causing high CPU). We'd typically look for a blocking system call (read, poll, select, mutex/lock related) in the first few frames for a hung thread, then look to identify the module owning the hung code by looking farther down in the stack to see where control was handed off from the core of Apache.