MustGather: Hangs¶
This documents documents common problems leading to hang-like conditions
as instructions for running the executable collector on unix platforms.
The GatherHangDoc
collector should only be run on a system actively
experiencing high CPU.
For windows, see The Windows Mustgather
Known issues to check for first¶
Review the issues below before running the mustgather or contacting support. New/contemporary issues are addded to the top.
IHS stops accepting connections with MaxRequestsPerChild non-zero¶
Prior to 9.0.5.11, IHS 9.0 on Linux or z/OS with a non-zero MaxRequestsPerChild
can stop accepting new connections after some time. The normal set of processes
will not be visible with ps -ef
, only the main parent process 1 or 2 more utility
processes. See APAR PH41945 here: https://www.ibm.com/support/pages/node/6539882.
Hangs and delays on z/OS¶
If IHS takes an unreasonable amount of time to read requests or send responses based on other observations, ensure neither IHS nor WAS nor the client are running in a WLM service class that uses DISCRETIONARY as a goal.
High CPU with mod_ldap and virtualhosts¶
mod_ldap prior to PI94050 could cause loops/hangs/high cpu when used with virtual hosts.
Hangs, delays, or frontend timeouts with IHS on Linux/PPC or AIX¶
z/OS Only: server stops accepting new connections under lite load¶
If the server stops accepting new connections on z/OS, make sure LE APAR PM90528 is present.
All Platforms; WebSphere Application Server MustGathers ¶
In many cases, the ServerDoc tool will simply show that the WebSphere Plugin is waiting for a response from a backend Application Server that is not responding in a timely fashion. For this reason, we encourage customers to proactively gather the WebSphere Application Server MustGather for Hangs and submit it along with the IBM HTTP Server MustGather.
For WebSphere Application Server MustGathers for your platform, see here
Startup hangs with mod_proxy_balancer¶
mod_proxy_balancer
requires 2 bytes of random data during
start/restart which will result in a blocking read of /dev/random
.
On some virtualized systems, or systems where some other process is exhausting
/dev/random
, startup may hang until enough entropy can be gathered
by the system.
Solaris¶
For web server hangs with IHS 2.0 on Solaris, please see this document first.
AIX 5.3¶
For web server hangs with any release of IHS on AIX 5.3:
AIX APAR IY58143 is required. If that APAR is not currently installed, you must install that APAR and reproduce the problem (if possible) before sending hang documentation to IBM HTTP Server support. You can check for the installation of this APAR in the following manner:
$ /usr/sbin/instfix -vik IY58143
IY58143 Abstract: Required fixes for AIX 5.3
Fileset X11.Dt.lib:5.3.0.1 is applied on the system.
Fileset X11.Dt.rte:5.3.0.1 is applied on the system.
Fileset X11.base.rte:5.3.0.1 is applied on the system.
Fileset X11.fnt.ucs.ttf is not applied on the system.
Fileset X11.fnt.ucs.ttf_extb:5.3.0.1 is applied on the system.
*...*
Fileset devices.vdevice.hvterm1.rte:5.3.0.1 is applied on the system.
Fileset devices.vtdev.scsi.rte:5.3.0.1 is applied on the system.
Fileset sysmgt.websm.apps:5.3.0.1 is applied on the system.
Fileset sysmgt.websm.framework:5.3.0.1 is applied on the system.
Fileset sysmgt.websm.rte:5.3.0.1 is applied on the system.
Fileset sysmgt.websm.webaccess:5.3.0.1 is applied on the system.
All filesets for IY58143 were found.
What we expect to learn from this information¶
Common web server hang conditions can be categorized as follows:
IHS is not responding because all IHS threads are waiting on an external application to respond
IHS is not responding because of a problem preventing it from processing new client connections
As discussed in the following sections, the root cause of the problem may not reside in IHS, so analysis of the IHS hang documentation may indicate that a different type of information is necessary.
IHS is waiting on an external application¶
A primary use of IHS is as a front-end to the WebSphere Application Server. It is possible for applications running in WebSphere to have delayed reponse, or no response at all, so that all IHS threads are waiting for a application server response and no free IHS threads are available to handle new client connections.
Some authentication mechanisms for IHS, such as LDAP authentication capability provided with IHS or by a third party vendor, must contact a server over the network as part of IHS request processing. If that communication stalls, it is possible that after some time all IHS threads are waiting on an authentication response and no free IHS threads are available to handle new client connections.
In any situation where all IHS threads are waiting on an external application, the IHS hang documentation will show which component is waiting but it cannot determine the root cause for why the application is not responding.
Note: If the IHS hang documentation shows that IHS is waiting for a WebSphere response, related documentation for WebSphere will need to be gathered. Instructions for this WebSphere documentation can be found here It is possible to collect this documentation at the same time when the IHS hang documentation is collected, so that the required WebSphere information is available to IBM support if it is necessary.
Vendors of third-party components which run inside IHS may provide similar information for gathering documentation on problems that can cause the component to hang or stall; contact the vendor for more information.
IHS has a problem with processing new client connections¶
For this type of problem, IBM support anticipates being able to determine the failing component, as well as whether or not this is a known problem. Occasionally there are operating system issues which prevent IHS from finding out about new client connections. If analysis of the IHS hang documentation shows such a problem, network traces may be necessary and operating system support may suggest further diagnostic information.
making sure required support programs are available¶
Please refer to these instructions for verifying that required support programs are installed.
Running the tool¶
You will need to download the collector
Warning¶
This tool uses native tools such as strace
and truss
to obtain system call traces, which include the contents of buffers used to
read and write data from the network.
Note: This executable mustgather is not used on Windows nor on z/OS.
On Windows, refer to win32_hang_doc.html
On z/OS, system dumps should be collected that include the httpd address spaces. Previously, the executable mustgather was provided on z/OS but it depends on the dbx debugger which conflicts with the very common SAFRunAs directive.
Run the tool as root
to avoid any permissions problems
with obtaining backtraces or reading files, such as log files and
configuration files. (More information about the requirement to run
this tool as root
is available here.)
ServerDoc is passed in four parameters for gathering hang documentation:
GatherHangDoc
the name of the IHS installation directory (e.g., /usr/HTTPServer)
the web server parent process id, or "auto" if the parent process has exited and left stranded child processes, or "auto" if CGI processes are stalling
the address of a non-SSL port handled by the web server (e.g., 127.0.0.1:80), or "-" if there is no non-SSL port
# java -jar ServerDoc.jar GatherHangDoc /path/to/IHS 1398 127.0.0.1:80
The tool creates a new directory which contains a timestamp in the name, and the hang documentation will be saved in that directory.
determining the value of the non-SSL address parameter¶
If the IHS installation only supports SSL, then use - (hyphen) for this parameter. Otherwise, specify an IP address and port which can be used to reach the server from the local machine without using SSL.
If the server has a no non-SSL listening ports, use -
If the server has a typical
Listen 80
orListen 0.0.0.0:80
, use 127.0.0.1:80If the server listens on a particular interface and port like
Listen 192.168.1.15:81
, use that particular interface and port verbatim likeListen 192.168.1.15:81
:
a sample run
For this example, IHS is installed in /scratch/IHS
,
the parent process id is stored in file
/scratch/IHS/logs/httpd.pid
, the non-SSL port can be
reached from the web server machine on address
127.0.0.1:8080
, and ihsdiag was unpacked into directory
/root/ihsdiag-1.3.0
.
# cd /tmp
# java -jar /root/ihsdiag-1.3.0/ServerDoc.jar GatherHangDoc \
/scratch/IHS `cat /scratch/IHS/logs/httpd.pid` 127.0.0.1:8080
Gathering doc on 4 web server processes...
5985 5986 5988 5984
Seconds remaining before gathering information again:
60...54...48...42...36...30...24...18...12...6...
Gathering doc on 4 web server processes...
5985 5986 5988 5984
Seconds remaining before gathering information again:
30...27...24...21...18...15...12...9...6...3...
Gathering doc on 4 web server processes...
5985 5986 5988 5984
Reports, log files, and configuration files have been saved to
directory
HangDoc.200408310607
If you have additional log files or configuration files, copy them
there before packing up the directory.
Web server log and conf files other than the default will have to be
copied manually. WebSphere plug-in conf and log files will have to be copied manually.
Hint for packing up the directory:
<b>tar -cf HangDoc.200408310607.tar HangDoc.200408310607</b>
<b>gzip HangDoc.200408310607.tar</b>
# ls -l HangDoc.200408310607/
total 772
-rw-rw-r-- 1 trawick trawick 0 Aug 31 06:07 access_log
-rw-rw-r-- 1 trawick trawick 5358 Aug 31 06:07 apachectl
-rw-rw-r-- 1 trawick trawick 118 Aug 31 06:07 error_log
-rw-rw-r-- 1 trawick trawick 462978 Aug 31 06:07 httpd
-rw-rw-r-- 1 trawick trawick 28790 Aug 31 06:07 httpd.conf
-rw-rw-r-- 1 trawick trawick 255056 Aug 31 06:08 log
-rw-rw-r-- 1 trawick trawick 56 Aug 31 06:07 redhat-release
-rw-rw-r-- 1 trawick trawick 5453 Aug 31 06:08 report
what if the HangDoc tool is taking a very long time?¶
If you need to interrupt the tool so the web server can be restarted (to try to resolve the hang condition), the best place to interrupt it is when it is counting down the number of seconds until it checks the web server state again. The last lines of output on the display will look like this:
Seconds remaining before gathering information again:
60...54...48...42...36...30.
If the tool is interrupted at a different time, incomplete information will be gathered on the state of the web server. This will introduce some risk into our analysis of the problem, but as long as a meaningful percentage of the web server processes have been examined (>30%), it is usually possible to find a probable cause of the hang.
the display is not being updated after several minutes
If the IHS child processes have a very large number of threads (e.g., ThreadsPerChild is higher than 200), the expected cause is that the system debugger has a performance degradation analyzing such processes.
It is also possible that the HangDoc tool has a problem interacting with the system debugger, and it will never finish.
To find out more information about the cause of the delay, take these steps:
Make sure you've waited at least four minutes from the time that the display was last updated.
From another terminal window, save the output of
ps -ef
to a file. This must be done before interrupting the HangDoc tool.Interrupt the HangDoc tool and find the most recent
HangDoc.xxxx
directory, which is what it was using when it stalled.Cut and paste the HangDoc display to a file.
Send in the ps listing, the HangDoc display, and the
HangDoc.xxxx
directory to IHS support for analysis.
copying other web server and plug-in files¶
The next step is to copy any other web server or plug-in configuration files and logs into the new HangDoc directory. Here is a list of files to copy if they are being used:
any IHS configuration file other than httpd.conf in the IHS install directory
any additional web server error or access log files, such as log files specific to each virtual host or log files created by rotatelogs
the WebSphere plug-in configuration file
the WebSphere plug-in log file
any platform-specific WebSphere Appplication Server MustGathers that were proactively collected.
saving the documentation directory¶
The last step is to pack up and compress the documentation directory using zip, tar followed by gzip, or pax followed by compress. The easiest way is to cut and paste the messages displayed by ServerDoc previously which showed the commands to use. The suggested commands will vary by platform. On z/OS, for example, pax and compress will be suggested instead of tar and gzip.
Don't forget to collect the corresponding WebSphere Appplication Server MustGathers and include them in your submission.
a sample run
tar -cf HangDoc.200408310607.tar HangDoc.200408310607
gzip HangDoc.200408310607.tar
The resulting compressed file is the file to send to IBM support.
understanding the root
requirement ¶
When gathering information on web server hangs, the tool must attach to live web server processes to obtain information about the state of those processes.
If the web server is started as root
, then at least one of these processes
will be owned by root
and other processes will be owned by the web server
user id (e.g., nobody
or www
). Only root
has the authority to attach to
all of the processes, so the tool itself must be run as root
. If the web
server administrator does not have authority to log in or switch user to
root
, a simple script can be created to gather the hang documentation, and
the system administrator can give the web server administrator sudo
access to
that script. sudo
is a third-party tool available without cost for all
appropriate platforms.
If the web server is not started as root
, there are no
such concerns, and the hang documentation tool may be run by the user
id which starts the web server.
If the tool is run as non-root
and it is unable to
gather the required information, the problem will have to be
recreated. It may not be possible to determine if this problem
occurred until the documentation has been analyzed by IBM HTTP Server
support.
interpreting the report output ¶
Why is
gsk_get_last_validation_error
in the call stack in unexpected places?
Sometimes a series of IBM Global Security Kit (GSKit) internal functions show up as gsk_get_last_validation_error
in the backtraces
but they are not a cause of concern. Usually the lowest call in the stack is a properly displayed IBM Global Security Kit (GSKit) function
(such as gsk_secure_soc_init
) and higher in the stack will be the IHS or WebServer Plugin I/O callbacks
(secure_read or plugin_ssl_read).
Why do so processes have a number of threads less than
ThreadsPerChild
?
This is normal for some utility processes created by IHS, such as the IHS parent process, sidd, or the CGI daemon will be decorated as such
If one of the threads is decorated as the "IHS main thread waiting for process to exit...", then the remaining threads are likely to be hung or stuck in a loop (causing high CPU). We'd typically look for a blocking system call (read, poll, select, mutex/lock related) in the first few frames for a hung thread, then look to identify the module owning the hung code by looking farther down in the stack to see where control was handed off from the core of Apache.